Image retrieveal with the consumer in mind

As a continuation of my blog post Assumptions about the end user I want to explain what else should be thought of when designing image retrieval systems with the end user in mind.

Don’t cause the user more work

To summarize the post I mentioned above: “Algorithms should not create new work for the user, but remove (some of) it.” An algorithm should be rather conservative in its decisions, because a user will perceive an algorithm that, for instance creates wrong tags, that the user have to correct in the end, as faulty at not helpful at all.

Don’t dethrone the user

Also to often there is no option for the user to easily override the decision of the algorithm, without the need to disable it and losing all the support.

Lifelong learning

The algorithm should not only allow me to retag an image or move it to a different cluster, but use this information to retag other affected images and make better decisions in the future.

For instance Wang et al. show in Intelligent photo clustering with user interaction and distance metric learning how it is possible to use corrections made by the user to improve the distance calculation for photo clustering.

Solving the wrong problem

Unfortunately unconstrained* object recognition is still far from solved and useable. The best system so far is the one from Alex Krizhevsky (University of Toronto) using Deep Convolutional Neural Networks.

His system achieved a top-5 error rate** of 15.3%, compared to 26% of the second best system for one of the most demanding benchmark databases with 1.2 million images and 1000 object classes.

That’s very impressive, but it also means, that every 6th image gets assigned 5 labels, which are incorrect.

Nevertheless this system was so ground breaking that he together with his supervisor, Geoffrey Hinton, and another grad student where hired by Google in March of this year.
This system now runs the google+ photo search.

But do we need such a system? What does it help you if the algorithm detects that there is a plant or a chair in your images? Isn’t it much more useful to analyze the scene of the picture, to tag pictures with broader scene descriptions like, group picture, living room or mountains?

In 2010 a team from MIT and Brown University showed, that even with existing methods on can achieve 90% recognition for 15 different scene classes like office, living room, inside city and forest with only 100 training images per class.

The authors wanted to push their new dataset, that contains nearly 400 scene classes, for which they reach a recognition rate of just under 40%. While academically much more demanding and thus interesting, I don’t think consumers have a use for a system that can differentiate an oil refinery from an ordinary factory most of the time.

I am convinced that a simpler system, that gets a few categories right ‘all’ the time, is much more useful.

* unconstrained means that the algorithm does not need the environment or the object to be controlled in some way.
Most working system only work with lighting or background, perspective and with no or limited clutter and occlusion.

** top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model

AI, the new secret weapon in the cloud photo-storage war.

Gigaom posted an article on “The Dropbox computer vision acquisition that slipped under the radar“. But I think it the article should have been called:

AI, the new secret weapon in the cloud photo-storage war.

Okay, this title is probably a hyperbole. But all the big internet companies offer a way to store and share your photos online. And to make their offer more compelling Yahoo, Google, and Dropbox all recently bought computer vision start-ups that will provide image recognition for their user’s uploaded photos. While Yahoo bought LookFlow, Google bought DNNresearch.

Microsoft is researching on image recognition for a long time and I am sure they will soon integrate some of their algorithms into their cloud products. And Facebook just founded an internal AI group.

And to get a look into the future without having to upload all your photographs to the internet, try the iOS app Impala. The app will analyse and categorise all your photographs on your device. It was created by EUVision technologies, a spin off of the University of Amsterdam commercializing their research efforts. 

After the negative conclusion from my last post about the closure of Everpix these are positive news for the machine learning market.

How good is Google Drive’s image recognition engine?

As announced via twitter I took the time to test Google Drive’s image recognition feature. Google Drive was announced two weeks ago with a blog post, which contained the bold claim:

Search everything. Search by keyword and filter by file type, owner and more. … We also use image recognition so that if you drag and drop photos from your Grand Canyon trip into Drive, you can later search for [grand canyon] and photos of its gorges should pop up. This technology is still in its early stages, and we expect it to get better over time.

This sparked my curiosity, so I evaluated Google Drive’s performance like I would with the image recognition frameworks I do my research on. First I uploaded an image dataset and with images containing known objects and then counted how many of the pictures Google Drive’s search would find, if I search for these objects.

As dataset I used the popular  Caltech 101 dataset containing pictures of objects belonging to 101 different categories. There are about 40 to 800 images per category and roughly 4500 images in total. While being far from perfect, it is a well-known contender.

These are my first finding:

  • Google Drive only finds a fraction of the images, but the images it finds it categorizes correctly.

  • In numbers: Precision is 83% (std=36%) and the recall is 8% (std=11%) (averaged over all categories)
  • The best results it achieves for the two ‘comic’ categories ‘Snoopy’ and ‘Garfield’ and for iconic symbols like the dollar bill and the stop sign.
  • As the The Caltech 101 dataset was created using Google’s image search the high precision is at least partly a result of a ‘simple’ duplicate detection with the Google index and not of a successful similarity search.


As all vision systems working in such an unconstrained environment they are far from being actually usable. One cannot rely on them, but once or twice they will surprise you by adding an image to the result list, that one hasn’t thought of.

Further resources:


Link to Matlab code which achieves 65% precision with 100% recall.*

* The numbers are not comparable 1-to-1 as both use a different evaluation approach. The Matlab script assigns to each image of the dataset its most likely class, while google drive tries to find a concept or object in the image.