Image retrieveal with the consumer in mind

As a continuation of my blog post Assumptions about the end user I want to explain what else should be thought of when designing image retrieval systems with the end user in mind.

Don’t cause the user more work

To summarize the post I mentioned above: “Algorithms should not create new work for the user, but remove (some of) it.” An algorithm should be rather conservative in its decisions, because a user will perceive an algorithm that, for instance creates wrong tags, that the user have to correct in the end, as faulty at not helpful at all.

Don’t dethrone the user

Also to often there is no option for the user to easily override the decision of the algorithm, without the need to disable it and losing all the support.

Lifelong learning

The algorithm should not only allow me to retag an image or move it to a different cluster, but use this information to retag other affected images and make better decisions in the future.

For instance Wang et al. show in Intelligent photo clustering with user interaction and distance metric learning how it is possible to use corrections made by the user to improve the distance calculation for photo clustering.

Solving the wrong problem

Unfortunately unconstrained* object recognition is still far from solved and useable. The best system so far is the one from Alex Krizhevsky (University of Toronto) using Deep Convolutional Neural Networks.

His system achieved a top-5 error rate** of 15.3%, compared to 26% of the second best system for one of the most demanding benchmark databases with 1.2 million images and 1000 object classes.

That’s very impressive, but it also means, that every 6th image gets assigned 5 labels, which are incorrect.

Nevertheless this system was so ground breaking that he together with his supervisor, Geoffrey Hinton, and another grad student where hired by Google in March of this year.
This system now runs the google+ photo search.

But do we need such a system? What does it help you if the algorithm detects that there is a plant or a chair in your images? Isn’t it much more useful to analyze the scene of the picture, to tag pictures with broader scene descriptions like, group picture, living room or mountains?

In 2010 a team from MIT and Brown University showed, that even with existing methods on can achieve 90% recognition for 15 different scene classes like office, living room, inside city and forest with only 100 training images per class.

The authors wanted to push their new dataset, that contains nearly 400 scene classes, for which they reach a recognition rate of just under 40%. While academically much more demanding and thus interesting, I don’t think consumers have a use for a system that can differentiate an oil refinery from an ordinary factory most of the time.

I am convinced that a simpler system, that gets a few categories right ‘all’ the time, is much more useful.

* unconstrained means that the algorithm does not need the environment or the object to be controlled in some way.
Most working system only work with lighting or background, perspective and with no or limited clutter and occlusion.

** top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model

AI, the new secret weapon in the cloud photo-storage war.

Gigaom posted an article on “The Dropbox computer vision acquisition that slipped under the radar“. But I think it the article should have been called:

AI, the new secret weapon in the cloud photo-storage war.

Okay, this title is probably a hyperbole. But all the big internet companies offer a way to store and share your photos online. And to make their offer more compelling Yahoo, Google, and Dropbox all recently bought computer vision start-ups that will provide image recognition for their user’s uploaded photos. While Yahoo bought LookFlow, Google bought DNNresearch.

Microsoft is researching on image recognition for a long time and I am sure they will soon integrate some of their algorithms into their cloud products. And Facebook just founded an internal AI group.

And to get a look into the future without having to upload all your photographs to the internet, try the iOS app Impala. The app will analyse and categorise all your photographs on your device. It was created by EUVision technologies, a spin off of the University of Amsterdam commercializing their research efforts. 

After the negative conclusion from my last post about the closure of Everpix these are positive news for the machine learning market.

The end of Everpix, a sad week for photographers and machine learning researchers.

This week the photo storage service Everpix announced, that they will close down. They did not have enough paying costumers and could not find new investors.

That is sad. Not only because it was the world’s best photo startup according to the Verge, but also because it was the only company besides Google that used new machine learning techniques to help people manage their photo mess.

everpix home screen

Everpix home screen

Their closure can be seen as an indicator that end users and investors are not ready yet to spend additional money on machine learning algorithms.

Flashback mail

Flashback mail

Having read some articles and the associated comments[1, 2], it is clear to me that not their use of sophisticated machine learning algorithms but the daily ‘flashback’ email with pictures taken on the same day in previous years was the more popular feature. In fact, I did not even see one single comment about the algorithms that analysed the pictures.

But maybe their algorithms were just not good enough.

Unfortunately I could not try out their algorithms myself. My pictures just finished processing a few days before they announced to close down. But I found a comment of one of the founders on Hacker News, saying that they used a deep convolutional neural network with 3 layers for the semantic image analysis. This is the same technology Google now uses for their photo search.

But they were unhappy with the results of the algorithm so in January this year they changed their approach as their CTO, Kevin Quennesson, explains in ‘To Reclaim Your Photos, Kill the Algorithm’.  He writes: “If a user is a food enthusiast and takes a lot of food close-ups, are we going to tell him that this photo is not the photo of a dish because an algorithm only learned to model some other kind of dishes?” They found that the algorithm’s errors were not comprehensible for the end user.

So they planned to change their system. As I understand it, their old system learned and used concepts independent of the single user. But the new system also uses pictures of the same user to infer the content of a new picture. He calls this “feature based image backlinks”.

Explanation Feature-based Image Backlinks

The graph shows how a picture of a dish can be correctly identified because the content can be inferred by similar pictures of the user that the system identified correctly before. – from Quennesson’s blog post

Regardless of the success of Everpix, I think using the context of an image more is a helpful and necessary approach to build systems, that will reliably predict the content of an image in the future.

In any case I wish we would hear more about the underlying algorithms, what they tried, what worked and what not.

Assumptions about the end user

I am in the middle of a little literature review on using machine learning for photo organisation and came across a statement that struck me as misconceived. The paper’s topic is segmenting photo streams into events and states at the end of page 5:

We believe that for end users, having a low miss rate is more valuable than having a low false alarm rate.

I believe this is a false assumption that will lead to frustrated end users. Out of my own experience I am convinced that the opposite is true.

They continue: “To correct a false alarm is a one-step process of removing the incorrect segment boundary. But to correct a miss, the user must first realize that there is a miss, then figure out the position of the segment boundary.”

Similar to face detection users will be happy about a correct detection but unhappy about an algorithm that creates wrong boundaries they have to manually correct.

And if we assume, that a conservative algorithm still finds all the strong boundaries, the user might not miss the not detected boundaries after all.

Algorithms should not create new work for the user, but remove (some of) it.

How good is Google Drive’s image recognition engine?

As announced via twitter I took the time to test Google Drive’s image recognition feature. Google Drive was announced two weeks ago with a blog post, which contained the bold claim:

Search everything. Search by keyword and filter by file type, owner and more. … We also use image recognition so that if you drag and drop photos from your Grand Canyon trip into Drive, you can later search for [grand canyon] and photos of its gorges should pop up. This technology is still in its early stages, and we expect it to get better over time.

This sparked my curiosity, so I evaluated Google Drive’s performance like I would with the image recognition frameworks I do my research on. First I uploaded an image dataset and with images containing known objects and then counted how many of the pictures Google Drive’s search would find, if I search for these objects.

As dataset I used the popular  Caltech 101 dataset containing pictures of objects belonging to 101 different categories. There are about 40 to 800 images per category and roughly 4500 images in total. While being far from perfect, it is a well-known contender.

These are my first finding:

  • Google Drive only finds a fraction of the images, but the images it finds it categorizes correctly.

  • In numbers: Precision is 83% (std=36%) and the recall is 8% (std=11%) (averaged over all categories)
  • The best results it achieves for the two ‘comic’ categories ‘Snoopy’ and ‘Garfield’ and for iconic symbols like the dollar bill and the stop sign.
  • As the The Caltech 101 dataset was created using Google’s image search the high precision is at least partly a result of a ‘simple’ duplicate detection with the Google index and not of a successful similarity search.


As all vision systems working in such an unconstrained environment they are far from being actually usable. One cannot rely on them, but once or twice they will surprise you by adding an image to the result list, that one hasn’t thought of.

Further resources:


Link to Matlab code which achieves 65% precision with 100% recall.*

* The numbers are not comparable 1-to-1 as both use a different evaluation approach. The Matlab script assigns to each image of the dataset its most likely class, while google drive tries to find a concept or object in the image.

Building Rome in a Day with Photosynth

Remeber Photosynth(video), the Microsoft Research software that turned hundreds of Flickr images of one building into a 3D view? They have, together with researchers from the University of Washington, Cornell, improved the software to a point where they can model the city Rome out of 150,000 photos from Flickr in less than 21h.

More videos and details are available from the project website,

The paper, “Building Rome in a Day” by Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, and Richard Szeliski, was presented at the International Conference on Computer Vision (ICCV 2009) on September 30th.

via Improved Photosynth can model a city in hours | Computer Vision Central.

Open problem photo management

When I got a new Mac and with it the latest photo management system iPhoto from them, I thought that this will convince me that photo management is a nearly solved problem. Boy, was I wrong in thinking that!!! Far from it — apart from some ‘apple experience’ I am disappointed in the system. […]  It appears that most photo systems feel that if they can do nice slide show and help publish photos in different forms, their job is done. These systems ignore that now people literally shoot anything that they find mildly interesting. …

Like Prof. Jain I bought a Mac with Iphoto09 and was a little bit disappointed. The geotagging work flow is kind of awkward if you have to tag the photos by hand and after the euphoria the face detection is is in the end more useful as a party gag, as it is to cumbersome to find all the faces he missed and label them per hand.

And we are not even talking about automatic event detection or clustering of near duplicates.

via Ramesh Jain’s Blog » Photo Management remains an open problem.