Duplicate image detection with perceptual hashing in Python

Recently we implemented a duplicate image detector to avoid importing dupes into Jetsetter’s large image store. To achieve this, we wrote a Python implementation of the dHash perceptual hash algorithm and the nifty BK-tree data structure.

Jetsetter has hundreds of thousands of high-resolution travel photos, and we’re adding lots more every day. The problem is, these come from a variety of sources and are uploaded in a semi-automated way, so there are often duplicates or almost-identical photos that sneak in. And we don’t want our photo search page filled with dupes.

So we decided to automate the task of filtering out duplicates using a perceptual hashing algorithm. A perceptual hash is like a regular hash in that it is a smaller, compare-able fingerprint of a much larger piece of data. However, unlike a typical hashing algorithm, the idea with a perceptual hash is that the perceptual hashes are “close” (or equal) if the original images are close.

Difference hash (dHash)

We use a perceptual image hash called dHash (“difference hash”), which was developed by Neal Krawetz in his work on photo forensics. It’s a very simple but surprisingly effective algorithm that involves the following steps (to produce a 128-bit hash value):

  • Convert the image to grayscale
  • Downsize to a 9x9 square of gray values (or 17x17 for a larger, 512-bit hash)
  • Calculate the “row hash”: for each row, move from left to right, and output a 1 bit if the next gray value is greater than or equal to the previous one, or a 0 bit if it’s less (each 9-pixel row produces 8 bits of output)
  • Calculate the “column hash”: same as above, but for each column, move top to bottom
  • Concatenate the two 64-bit values together to get the final 128-bit hash

dHash is great because it’s fairly accurate, and very simple to understand and implement. It’s also fast to calculate (Python is not very fast at bit twiddling, but all the hard work of converting to grayscale and downsizing is done by a C library: ImageMagick+wand or PIL).

Here’s what this process looks like visually. Starting with the original image:

Diver - original image

Grayscale and down-sized to 9x9 (but then magnified for viewing):

Diver - grayscale and down-sized

And the 8x8 row and column hashes, again magnified (black = 0 bit, white = 1 bit):

Diver - row hash

The core of the dHash code is as simple as a couple of nested for loops:

def dhash_row_col(image, size=8):
    width = size + 1
    grays = get_grays(image, width, width)

    row_hash = 0
    col_hash = 0
    for y in range(size):
        for x in range(size):
            offset = y * width + x
            row_bit = grays[offset] < grays[offset + 1]
            row_hash = row_hash << 1 | row_bit

            col_bit = grays[offset] < grays[offset + width]
            col_hash = col_hash << 1 | col_bit

    return (row_hash, col_hash)

It’s a simple enough algorithm to implement, but there are a few tricky edge cases, and we thought it’d be nice to roll it all together and open source it, so our Python code is available on GitHub and from the Python Package Index – so it’s only a pip install dhash away.

Dupe threshold

To determine whether an image is a duplicate, you compare their dHash values. If the hash values are equal, the images are nearly identical. If they hash values are only a few bits different, the images are very similar – so you calculate the number of bits different between the two values (hamming distance), and then check if that’s under a given threshold.

Side note: there’s a helper function in our Python dhash library called get_num_bits_different() that calculates the delta. Oddly enough, in Python the fastest way to do this is to XOR the values, convert the result to a binary string, and count the number of '1' characters (this is because then you’re asking builtin functions written in C to do the hard work and the looping):

def get_num_bits_different(hash1, hash2):
    return bin(hash1 ^ hash2).count('1')

On our set of images (over 200,000 total) we set the 128-bit dHash threshold to 2. In other words, if the hashes are equal or only different in 1 or 2 bits, we consider them duplicates. In our tests, this is a large enough delta to catch most of the dupes. When we tried going to 4 or 5 it started catching false positives – images that had similar fingerprints but were too different visually.

For example, this was one of the image pairs that helped us settle on a threshold of 2 – these two images have a delta of 4 bits:

False positive "dupes" with dHash delta of 4 bits

MySQL bit counting

I’m a big PostgreSQL fan, but we’re using MySQL for this project, and one of the neat little functions it has that PostgreSQL doesn’t is BIT_COUNT, which counts the number of 1 bits in a 64-bit integer. So if you break up the 128-bit hash into two parts you can use two BIT_COUNT() calls to determine whether a binary hash column is within n bits of a given hash.

It’s a little round-about, because MySQL doesn’t seem to have a way to convert part of a binary column to a 64-bit integer, so we did it going to hex and back (let us know if there’s a better way!). Our dHash column is called dhash8, and dhash8_0 and dhash8_1 are the high and low 64-bit literal values of the hash we’re comparing against.

So here’s the query we use to detect duplicates when we upload a new image (well, we’re actually using the Python SQLAlchemy ORM, but close enough):

SELECT asset_id, dhash8
FROM assets
WHERE
    BIT_COUNT(CAST(CONV(HEX(SUBSTRING(dhash8, 1, 8)), 16, 10)
        AS UNSIGNED) ^ :dhash8_0) +    -- high part
    BIT_COUNT(CAST(CONV(HEX(SUBSTRING(dhash8, 9, 8)), 16, 10)
        AS UNSIGNED) ^ :dhash8_1)      -- plus low part
    <= 2                               -- less than threshold?

The above is a relatively slow query that involves a full table scan, but we only do it once on upload, so taking an extra second or two isn’t a big deal.

BK-trees and fast dupe detection

However, when we were searching for existing duplicates in our entire image set (which was about 150,000 photos at the time), it turns into an O(N^2) problem pretty quickly – for every photo, you have to look for dupes in every other photo. With a hundred thousand images, that’s way too slow, so we needed something better. Enter the BK-tree.

A BK-tree is an n-ary tree data structure specifically designed for finding “close” matches fast. For example, finding strings within a certain edit distance of a given string. Or in our case, finding dHash values within a certain bit distance of a given dHash. This turns an O(N^2) problem into something closer to an O(log N) one. (We discovered a truly marvelous proof of this, but the margin is too narrow to contain it.)

BK-trees are described by Nick Johnson in his “Damn Cool Algorithms” blog series. It’s dense reading, but the BK-tree structure is actually quite simple, especially in Python where creating trees is very easy with a bunch of nested dictionaries. The BKTree.add() code to add an item to a tree:

def add(self, item):
    node = self.tree
    if node is None:
        self.tree = (item, {})
        return

    while True:
        parent, children = node
        distance = self.distance_func(item, parent)
        node = children.get(distance)
        if node is None:
            children[distance] = (item, {})
            break

There were a couple of existing BK-tree libraries in Python (and I think more since we added ours), but one of them didn’t work for us and was buggy (ryanfox/bktree), and the one that looked good wasn’t on PyPI (ahupp/bktree), so we rolled our own.

So again, our Python code is available on GitHub and from the Python Package Index – so it’s only a pip install pybktree away.

Comments

If you have feedback or links to your own related work, please let us know – add a comment below, or head over to the Hacker News and reddit Python threads.

Config-based document hydration

Here at Jetsetter we produce a lot of content. MikeP described how we recently revamped our CMS admin tool to enable our editors to produce great looking content without having to type a single HTML tag.

If we have an article talking about why this hotel is just the absolute to-die-for destination this season, and we have this hotel listed on Jetsetter, then wouldn’t it be nice to show off the Jetsetter price when linking to the hotel page?

Options:

  1. Editors wake up extra early every morning and look up the Jetsetter prices of these hotels and update the articles with those prices
  2. When rendering the article, client calls Jetsetter pricing APIs for each hotel to get latest prices
  3. The CMS backend calls the APIs and includes pricing details in the article response

Option 1 was not ideal because we like our editors. Option 2 is workable but adds complexity in each client and can delay the finished rendering of the article.

Option 3 looks good but what backend engineer would agree to the ongoing support of whatever rich content the editors and front end guys decide that they want? Today they want Jetsetter pricing, tomorrow it’s image metadata, the next day it’s instagram embeds ….

To flip the last bit on this Win-Win-Lose scenario we need to generalize the approach so that new data sources can be configured without any code change.

The backend doesn’t know what resources are referenced in these documents, and doesn’t want to. Nor does the backend know how to fetch said resources. Let the json-schema producer dev and the desktop client dev work it out between themselves.

Take this simple document as an example:

{
  "propertyId": 65
}

We will want to populate this object with the full details of the referenced property if possible. Even though this is a Jetsetter property, and this is all Jetsetter code, we (the backend) want nothing to do with how to fetch this data. To accomplish this, the client facing side CMS backend is configuable (via config files) on how to match and fetch any resources referenced in the document. For example:

{
"property": {
    "tokens": {
        "property-id": ["propertyId", "propertyIds"]
    },
    "method": "GET",
    "header": { ... },
    "ssl": true,
    "host": "api.jetsetter.com",
    "port": 443,
    "path": "/v4/PropertyService/properties/${property-id}",
    "data": {},
    "max-connections": 10,
    "max-retries": 0
},

Hydration happens in a few steps-

  • Extract matching metadata

The configuration describes “tokens” which match on field names in the document and eventually are used when fetching the given resource. In this case, “property-id”: [“propertyId”, “propertyIds”] means that any fields called “propertyId” or “propertyIds” will lead to this resource being fetched.

  • Fetch resources

Then it replaces token in braces with extracted references

"path": "/v4/PropertyService/properties/${property-id}"

becomes:

/v4/PropertyService/properties/65
  • Infuse API response into document (hydration)

Our original document:

{
  "propertyId": 65
}

After matching and fetching and infusing becomes:

{
  "propertyId": 65,
  "property": [
    {
      "status": 0,
      "data": {
        "name": "Solage Calistoga",
        "url": "/hotels/calistoga/california/65/solage-calistoga",
        

What about a more complex example, where one widget might need to be hydrated against different APIs depending on the content? What if the ‘token’ is just one piece of the loaded value? Note the different extractor fields below which will match and capture the relevant piece of the data:

{
"twitter-oembed": {
    "tokens": {
        "social-id": ["socialId"]
    },
    "extractors": {
        "social-id": "^(.*twitter.com.*)$"
    },
    "method": "GET",
    "header": {},
    "ssl": true,
    "host": "publish.twitter.com",
    "port": 443,
    "path": "/oembed?url=${social-id}",
    "data": {},
    "max-connections": 10,
    "max-retries": 0
},
"instagram-oembed": {
    "tokens": {
        "social-id": ["socialId"]
    },
    "extractors": {
        "social-id": "^(.*instagram.com.*)$"
    },
    "method": "GET",
    "header": {},
    "ssl": true,
    "host": "api.instagram.com",
    "port": 443,
    "path": "/oembed/?url=${social-id}",
    "data": {},
    "max-connections": 10,
    "max-retries": 0
}

In real life we have many additional configured endpoints and some of the documents reference a great number of external resources. We would not want the performance of the website to be dependent on all of these external data sources so one final additional step is to load the hydrated documents into an elastic search index. This gives us horizontal scalability and decouples the serving of the documents with the production/hydration, as well as allows for elastic search to do some of the heavy lifting of querying documents based on different custom fields.

graphyte: Python 3 library to send data to a Graphite metrics server

graphyte is a small Python library that sends data to a Graphite metrics server. We wrote it because the existing graphitesend library didn’t support Python 3, and also required gevent for asyncronous use. graphyte is compatible with Python 3.4+ as well as Python 2.7, and uses the standard library’s threading module for asynchronous use.

We’ve recently started to use Python 3.5 in production at Jetsetter for various image processing tasks, so compatibility with Python 3.x is important to us. The library is small and stand-alone (pure Python, no external dependencies), so it wasn’t hard to release it as open source.

Jetsetter is using the official PyPI version in our own server and it’s working well so far! Here’s a Grafana screenshot showing one of the metrics we’re pushing through using graphyte:

Grafana screenshot

graphyte is on the Python Package Index (PyPI), so to install it, fire up a command prompt, activate your virtualenv if you’re using one, and type:

pip install graphyte

Using graphyte is simple – just call init() to initialize the default sender and then send() to send a message. For example, to send system.sync.foo.bar 42 {timestamp}\n to graphite.example.com:2003 synchronously:

import graphyte
graphyte.init('graphite.example.com', prefix='system.sync')
graphyte.send('foo.bar', 42)

If you want to send asynchronously on a background thread (for example, in a web server context), just specify a send interval. For example, this will setup a background thread to send every 10 seconds:

graphyte.init('graphite.example.com', prefix='async', interval=10)
graphyte.send('foo.bar', 42)

For (moderately) more advanced examples and other info, see the Jetsetter/graphyte repo on GitHub.

Enjoy! And let us know if you have any feedback.

JSON-Schema Powered CMS


A few months ago, we at Jetsetter decided to revamp the tools that our editorial team used to create content for our online Magazine. Instead of listing all of the problems with our old CMS, I’ll let it speak for itself:

Old Jetsetter CMS

That’s pretty much all there was. Placing content other than plain text inside an article required the user to insert raw HTML, oftentimes with the help of an engineer. There was also very little support for images, which didn’t make much sense given our huge repository of awesome travel and hotel imagery. Needless to say, this was not anyone’s idea of an ideal system. So we set out to change that.


We researched existing CMS solutions, but decided to build our own for two reasons.

  • Easy integration with our existing platform — we wanted to allow users to make use of our large image repository and hotel data in creative ways.
  • Customized workflow — building a CMS ourselves would give us full control to quickly build tools that effectively met our team’s specific needs.

The old CMS offered very few tools to augment articles beyond plain text, and the ones it did offer were too cumbersome to use. We wanted a system that was flexible and powerful, while at the same time simple and intuitive to use.

We established the concept of “widgets” — modular, structured blocks of content that could be added to an article and easily reorganized. A large selection of widgets (rich text, photos, videos, social embeds, etc.) would allow for a broad range of expression, while also abstracting as many of the details away from the user as possible.

We knew we wanted to store articles as JSON because it’s a flexible format and is easy to work with, so when we began discussing how to implement “structured blocks of content”, we felt that JSON Schema was a natural fit in a system like this. Our friends at Oyster had success integrating JSON Schema into their rebuilt CMS, so we felt confident that it was something we could make use of as well.


JSON Schema allowed us to easily define the structure of our widgets and check whether the data for a specific widget was valid. Here’s a trimmed-down version of the schema we wrote that defines a widget for a social embed:

{
  "type": "object",
  "title": "Social",
  "required": [
    "socialUrl",
    "socialType"
  ],
  "properties": {
    "socialType": {
      "type": "string",
      "enum": [
        "Facebook",
        "Twitter",
        "Instagram",
        "Pinterest"
      ]
    },
    "socialUrl": {
      "type": "string",
      "title": "Social URL"
    }
  }
}

And the data for a Social widget would simply look like this:

{
    "socialType": "Facebook",
    "socialUrl": "https://www.facebook.com/Jetsetter/posts/10157238317965361"
}

Not only did JSON Schema enable us to document the structure of our data, we could use that documentation to validate the data on both the client and the server.

However, we still had to create the new UI that would collect this data from the user. At Jetsetter, we use React so it would have been easy to create some kind of SocialWidget component that was rendered whenever the user wanted to add a social embed to their article. The SocialWidget component could have extracted the different socialTypes and put them into a select element and rendered an input element for the socialUrl. 

However, this wasn’t going to scale that well at all. We would have had to update the React component whenever we wanted to make even a small change to the schema. Do you want bugs? Because that’s how you get bugs.

The big breakthrough for us was when we realized that we could use the schema to generate the UI. There were a few libraries out there that could do this, but we settled on the appropriately-named react-jsonschema-form because it was very customizable and actively maintained.

All we needed to do was feed our widget schemas into react-jsonschema-form and…

React form

This was HUGE. Now our entire CMS could be powered by our schemas. If we wanted to add a new socialType or add a field to the Social widget or even simply change the label for the input, all we needed to do was update the schema and everything just worked.


In addition to defining schemas for our widgets, we needed to define schemas for our different article types so that an entire article’s data could be validated and not just the individual widgets. An article consists of several singular inputs, like the article’s title, and a list of widgets. Here’s an abbreviated example of a schema for our Longform articles:

{
  "type": "object",
  "required": [
    "title",
    "deck",
    "widgets"
  ],
  "properties": {
    "title": {
      "type": "string",
      "title": "Title"
    },
    "deck": {
      "type": "string",
      "title": "Deck"
    },
    "widgets": {
      "type": "array",
      "title": "Widgets",
      "items": {
        "oneOf": [
          {"$ref": "#/definitions/content"},
          {"$ref": "#/definitions/photo"},
          {"$ref": "#/definitions/pullQuote"},
          {"$ref": "#/definitions/video"},
          {"$ref": "#/definitions/social"}
        ]
      }
    }
  },
  "definitions": {...}
}

Notice that this schema makes use of the oneOf keyword, which tells us that each item in the widgets array must validate against exactly one of the schemas listed. This allows our article schemas to be very flexible in the size and structure of articles that validate against it, but it presented us with another problem: react-jsonschema-form didn’t know how to render this kind of schema.

To solve this problem, we kept track of a special schema — a “dynamic” schema — inside our application’s state that react-jsonschema-form did know how to render. Whenever a user wanted to add a new widget to their article, we added that widget’s schema to the dynamic schema and then fed that back into react-jsonschema-form.

Here’s what the dynamic schema would look like after the article was first created:

{
  "type": "object",
  "required": [...],
  "properties": {
    "title": {...},
    "deck": {...},
    "widgets": {
      "type": "array",
      "title": "Widgets",
      "items": []
    }
  },
  "definitions": {...}
}

And here’s what the dynamic schema would look like after the user added a Content widget, and Social widget, and another Content widget:

{
  "type": "object",
  "required": [...],
  "properties": {
    "title": {...},
    "deck": {...},
    "widgets": {
      "type": "array",
      "title": "Widgets",
      "items": [
          {"$ref": "#/definitions/content"},
          {"$ref": "#/definitions/social"},
          {"$ref": "#/definitions/content"}
      ]
    }
  },
  "definitions": {...}
}

We made use of react-jsonschema-form’s customizability in order to give the user the ability to add new widgets to an article. We overrode the default library behavior to add an “Add Widget” button beneath each of the widgets:

Add Widget

That same kind of customizability also allowed us to do things like integrate a rich text editor component (we chose react-rte) so that our writers never had to write HTML again and add the ability to reorder and remove widgets to enable users to quickly restructure their content.

The new CMS was originally built to power our Magazine, but it can drive other parts of Jetsetter.com too. All we need to do is define the document schema and any new widget schemas we need. Then we can generate the UI automatically, collect user input, and construct a JSON object that can be stored in our database and used however we want. We’ve already created a Homepage document with a Homepage Hero Item widget that controls the carousel of images on Jetsetter’s homepage.

The combination of React, JSON Schema, and react-jsonschema-form resulted in a relatively small codebase that powers an extremely flexible and robust system that is easy to use, maintain, and expand.