Image retrieveal with the consumer in mind

As a continuation of my blog post Assumptions about the end user I want to explain what else should be thought of when designing image retrieval systems with the end user in mind.

Don’t cause the user more work

To summarize the post I mentioned above: “Algorithms should not create new work for the user, but remove (some of) it.” An algorithm should be rather conservative in its decisions, because a user will perceive an algorithm that, for instance creates wrong tags, that the user have to correct in the end, as faulty at not helpful at all.

Don’t dethrone the user

Also to often there is no option for the user to easily override the decision of the algorithm, without the need to disable it and losing all the support.

Lifelong learning

The algorithm should not only allow me to retag an image or move it to a different cluster, but use this information to retag other affected images and make better decisions in the future.

For instance Wang et al. show in Intelligent photo clustering with user interaction and distance metric learning how it is possible to use corrections made by the user to improve the distance calculation for photo clustering.

Solving the wrong problem

Unfortunately unconstrained* object recognition is still far from solved and useable. The best system so far is the one from Alex Krizhevsky (University of Toronto) using Deep Convolutional Neural Networks.

His system achieved a top-5 error rate** of 15.3%, compared to 26% of the second best system for one of the most demanding benchmark databases with 1.2 million images and 1000 object classes.

That’s very impressive, but it also means, that every 6th image gets assigned 5 labels, which are incorrect.

Nevertheless this system was so ground breaking that he together with his supervisor, Geoffrey Hinton, and another grad student where hired by Google in March of this year.
This system now runs the google+ photo search.

But do we need such a system? What does it help you if the algorithm detects that there is a plant or a chair in your images? Isn’t it much more useful to analyze the scene of the picture, to tag pictures with broader scene descriptions like, group picture, living room or mountains?

In 2010 a team from MIT and Brown University showed, that even with existing methods on can achieve 90% recognition for 15 different scene classes like office, living room, inside city and forest with only 100 training images per class.

The authors wanted to push their new dataset, that contains nearly 400 scene classes, for which they reach a recognition rate of just under 40%. While academically much more demanding and thus interesting, I don’t think consumers have a use for a system that can differentiate an oil refinery from an ordinary factory most of the time.

I am convinced that a simpler system, that gets a few categories right ‘all’ the time, is much more useful.

* unconstrained means that the algorithm does not need the environment or the object to be controlled in some way.
Most working system only work with lighting or background, perspective and with no or limited clutter and occlusion.

** top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model

Assumptions about the end user

I am in the middle of a little literature review on using machine learning for photo organisation and came across a statement that struck me as misconceived. The paper’s topic is segmenting photo streams into events and states at the end of page 5:

We believe that for end users, having a low miss rate is more valuable than having a low false alarm rate.

I believe this is a false assumption that will lead to frustrated end users. Out of my own experience I am convinced that the opposite is true.

They continue: “To correct a false alarm is a one-step process of removing the incorrect segment boundary. But to correct a miss, the user must first realize that there is a miss, then figure out the position of the segment boundary.”

Similar to face detection users will be happy about a correct detection but unhappy about an algorithm that creates wrong boundaries they have to manually correct.

And if we assume, that a conservative algorithm still finds all the strong boundaries, the user might not miss the not detected boundaries after all.

Algorithms should not create new work for the user, but remove (some of) it.