As a continuation of my blog post Assumptions about the end user I want to explain what else should be thought of when designing image retrieval systems with the end user in mind.
Don’t cause the user more work
To summarize the post I mentioned above: “Algorithms should not create new work for the user, but remove (some of) it.” An algorithm should be rather conservative in its decisions, because a user will perceive an algorithm that, for instance creates wrong tags, that the user have to correct in the end, as faulty at not helpful at all.
Don’t dethrone the user
Also to often there is no option for the user to easily override the decision of the algorithm, without the need to disable it and losing all the support.
The algorithm should not only allow me to retag an image or move it to a different cluster, but use this information to retag other affected images and make better decisions in the future.
For instance Wang et al. show in Intelligent photo clustering with user interaction and distance metric learning how it is possible to use corrections made by the user to improve the distance calculation for photo clustering.
Solving the wrong problem
Unfortunately unconstrained* object recognition is still far from solved and useable. The best system so far is the one from Alex Krizhevsky (University of Toronto) using Deep Convolutional Neural Networks.
His system achieved a top-5 error rate** of 15.3%, compared to 26% of the second best system for one of the most demanding benchmark databases with 1.2 million images and 1000 object classes.
That’s very impressive, but it also means, that every 6th image gets assigned 5 labels, which are incorrect.
Nevertheless this system was so ground breaking that he together with his supervisor, Geoffrey Hinton, and another grad student where hired by Google in March of this year.
This system now runs the google+ photo search.
But do we need such a system? What does it help you if the algorithm detects that there is a plant or a chair in your images? Isn’t it much more useful to analyze the scene of the picture, to tag pictures with broader scene descriptions like, group picture, living room or mountains?
In 2010 a team from MIT and Brown University showed, that even with existing methods on can achieve 90% recognition for 15 different scene classes like office, living room, inside city and forest with only 100 training images per class.
The authors wanted to push their new dataset, that contains nearly 400 scene classes, for which they reach a recognition rate of just under 40%. While academically much more demanding and thus interesting, I don’t think consumers have a use for a system that can differentiate an oil refinery from an ordinary factory most of the time.
I am convinced that a simpler system, that gets a few categories right ‘all’ the time, is much more useful.
* unconstrained means that the algorithm does not need the environment or the object to be controlled in some way.
Most working system only work with lighting or background, perspective and with no or limited clutter and occlusion.
** top-5 error rate is the fraction of test images for which the correct label is not among the ﬁve labels considered most probable by the model