My plan for a content based image search

I saw this job posting from EyeEm, a photo sharing app / service, in which they express their wish/plan to build a search engine that can ‘identify and understand beautiful photographs’. That got me thinking about how I would approach building a system like that.

Here is how I would start:

1. Define what you are looking for

eyeem.brandenburgertorEyeEm already has a search engine based on tags and geo-location. So I assume, they want to prevent low quality pictures to appear in the results and add missing tags to pictures, based on the image’s content. One could also group similar looking pictures or rank those pictures lower which “don’t contain their tags”.  For instance for the Brandenburger Tor there are a lot of similar looking pictures and even some that don’t contain the gate at all.

But for which concepts should one train the algo-rithms? Modern image retrieval systems are trained for hundreds of concepts, but I don’t think it is wise to start with that many. Even the most sophisticated, fine tuned systems have high error rates for most of the concepts as can be seen in this year’s results of the Large Scale Visual Recognition Challenge.

For instance the team from EUVision / University of Amsterdam, placed 6 in the classification challenge, only selected 16 categories for their consumer app Impala. For a consumer application I think their tags are a good choice:

  • Architecture
  • Babies
  • Beaches
  • Cars
  • Cats (sorry, no dogs)
  • Children
  • Food
  • Friends
  • Indoor
  • Men
  • Mountains
  • Outdoor
  • Party life
  • Sunsets and sunrises
  • Text
  • Women

But of course EyeEm has the luxury of looking at their log files to find out what their users are actually searching for.

And on a comparable task of classifying pictures into 15 scene categories a team from MIT under Antonio Torralba showed that even with established algorithms one can achieve nearly 90% accuracy [Xiao10]. So I think it’s a good idea to start with a limited number of standard and EyeEm specific concepts, which allows for usable recognition accuracy even with less sophisticated approaches.

But what about identifying beautiful photographs? I think in image retrieval there is no other concept which is more desirable and challenging to master. What does beautiful actually mean? What features make a picture beautiful? How do you quantify these features? Is beautiful even a sensibly concept for image retrieval? Might it be more useful trying to predict which pictures will be `liked` or `hearted` a lot? These questions have to be answered before one can even start experimenting. I think for now it is wise to start with just filtering out low quality pictures and to try to predict what factors make a picture popular.

2. Gather datasets

Not only do the systems need to be trained with example photographs for which we know the depicted concepts, we also need data to evaluate our system to be sure that the implemented system really works as intended. But to gather useful datasets for learning and benchmarking is one of the hardest and most overlooked tasks. To draw meaningful conclusions the dataset must consist of huge quantities of realistic example pictures with high and consistent metadata. In our case here, I would aggregate existing datasets that contain labeled images for the categories we want to learn.

For starters the ImageNet, the Scene Understanding and the Faces in the Wild databases seem usable. Additionally one could manually add pictures from Flickr, google image search and EyeEm’s users.

Apart from a rather limited dataset of paintings and pictures of nature from the Computational Aesthetics Group of the University Jena, Germany, I don’t know any good dataset to evaluate how well a system detects beautiful images. Researchers either harvest photo communities that offer peer-rated ‘beautifulness’ scores such as photo.net [Datta06] or dpchallenge.com [Poga12], or they collect photos themselves and rate the pictures themselves for visual appeal [Poga12, Tang13].

The problem with datasets harvested from photo communities is that they suffer from self selection bias, because users only upload their best shots. As a result there are few low quality shots to train the system.

Never the less I would advise to collect the data inhouse. If labeling an image as low quality takes one second, one person can label 30.000 images in less then 10h. And even if we accept that one picture has to be labeled by multiple persons to minimize unwanted subjectivity, this approach would ensure, that the system has the same notion of beauty as favored by EyeEm.

3. Algorithms to try

I would start with established techniques like the Bag of visual Words approach (BoW). As the before mentioned MIT paper describes, over 80% accuracy can already be achieved with this method for a comparable task of classifying 15 indoor and outdoor scenes [Xiao10]. While this approach originally relies on the patented SIFT feature detector and descriptor, one can choose from a whole list of new free alternatives, which deliver comparable performance while being much faster and having a lower memory footprint [Miksik2012].  In the MIT paper they also combined BoW with other established methods to increase the accuracy to nearly 90%.

The next step than would be to use Alex Krizhevsk’s implementation of a deep convolutional neural network which he used to win last year’s Large Scale Visual Recognition Challenge. The code is freely available online. While being much more powerful this system is also much harder to train, with many parameters to train with out good existing heuristics.

But these two approaches wont really help assessing the beauty of pictures or identifying the low quality ones. If one agrees with Microsoft Research’s view of photo quality, defined by simplicity, (lack of) realism and quality of craftsmanship, one could start with the algorithms they designed to classify between high quality professional photos and low quality snapshots. [Ke06]

Caveats

Specific for the case at hand I predict that the filters will cause problems. They change the colors and some of them add high and low frequency elements. This will decrease the accuracy of the algorithm. To prevent this the analysis has to be performed on the phone or the unaltered image has to be uploaded as well.

Low quality or not?

If I remember correctly I once read that EyeEm applies the filters in full resolution to pictures on their servers and downloads the result to the user’s phones afterwards. If this is still the case both approaches are feasible. But as phones get more and more powerful a system which works on the phone is to be preferred as it is inherently more scalable.

Another challenge would be to distinguish between low quality pictures and pictures that break the rules of photography on purpose. The picture on the right for example has a blue undertone, low contrast and is quite blurry. But while these features make this image special, they would also trigger the low quality detector. It will be interesting to see if machine learning algorithms can learn to distinguish between the two cases.

So to recap:

1. Make sure the use case is sound.
2. Collect loads of data to train and evaluate.
3. Start with simple, proven algorithms and increase the complexity step by step.

Image retrieveal with the consumer in mind

As a continuation of my blog post Assumptions about the end user I want to explain what else should be thought of when designing image retrieval systems with the end user in mind.

Don’t cause the user more work

To summarize the post I mentioned above: “Algorithms should not create new work for the user, but remove (some of) it.” An algorithm should be rather conservative in its decisions, because a user will perceive an algorithm that, for instance creates wrong tags, that the user have to correct in the end, as faulty at not helpful at all.

Don’t dethrone the user

Also to often there is no option for the user to easily override the decision of the algorithm, without the need to disable it and losing all the support.

Lifelong learning

The algorithm should not only allow me to retag an image or move it to a different cluster, but use this information to retag other affected images and make better decisions in the future.

For instance Wang et al. show in Intelligent photo clustering with user interaction and distance metric learning how it is possible to use corrections made by the user to improve the distance calculation for photo clustering.

Solving the wrong problem

Unfortunately unconstrained* object recognition is still far from solved and useable. The best system so far is the one from Alex Krizhevsky (University of Toronto) using Deep Convolutional Neural Networks.

His system achieved a top-5 error rate** of 15.3%, compared to 26% of the second best system for one of the most demanding benchmark databases with 1.2 million images and 1000 object classes.

That’s very impressive, but it also means, that every 6th image gets assigned 5 labels, which are incorrect.

Nevertheless this system was so ground breaking that he together with his supervisor, Geoffrey Hinton, and another grad student where hired by Google in March of this year.
This system now runs the google+ photo search.

But do we need such a system? What does it help you if the algorithm detects that there is a plant or a chair in your images? Isn’t it much more useful to analyze the scene of the picture, to tag pictures with broader scene descriptions like, group picture, living room or mountains?

In 2010 a team from MIT and Brown University showed, that even with existing methods on can achieve 90% recognition for 15 different scene classes like office, living room, inside city and forest with only 100 training images per class.

The authors wanted to push their new dataset, that contains nearly 400 scene classes, for which they reach a recognition rate of just under 40%. While academically much more demanding and thus interesting, I don’t think consumers have a use for a system that can differentiate an oil refinery from an ordinary factory most of the time.

I am convinced that a simpler system, that gets a few categories right ‘all’ the time, is much more useful.

* unconstrained means that the algorithm does not need the environment or the object to be controlled in some way.
Most working system only work with lighting or background, perspective and with no or limited clutter and occlusion.

** top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model

AI, the new secret weapon in the cloud photo-storage war.

Gigaom posted an article on “The Dropbox computer vision acquisition that slipped under the radar“. But I think it the article should have been called:

AI, the new secret weapon in the cloud photo-storage war.

Okay, this title is probably a hyperbole. But all the big internet companies offer a way to store and share your photos online. And to make their offer more compelling Yahoo, Google, and Dropbox all recently bought computer vision start-ups that will provide image recognition for their user’s uploaded photos. While Yahoo bought LookFlow, Google bought DNNresearch.

Microsoft is researching on image recognition for a long time and I am sure they will soon integrate some of their algorithms into their cloud products. And Facebook just founded an internal AI group.

And to get a look into the future without having to upload all your photographs to the internet, try the iOS app Impala. The app will analyse and categorise all your photographs on your device. It was created by EUVision technologies, a spin off of the University of Amsterdam commercializing their research efforts. 

After the negative conclusion from my last post about the closure of Everpix these are positive news for the machine learning market.

The end of Everpix, a sad week for photographers and machine learning researchers.

This week the photo storage service Everpix announced, that they will close down. They did not have enough paying costumers and could not find new investors.

That is sad. Not only because it was the world’s best photo startup according to the Verge, but also because it was the only company besides Google that used new machine learning techniques to help people manage their photo mess.

everpix home screen

Everpix home screen

Their closure can be seen as an indicator that end users and investors are not ready yet to spend additional money on machine learning algorithms.

Flashback mail

Flashback mail

Having read some articles and the associated comments[1, 2], it is clear to me that not their use of sophisticated machine learning algorithms but the daily ‘flashback’ email with pictures taken on the same day in previous years was the more popular feature. In fact, I did not even see one single comment about the algorithms that analysed the pictures.

But maybe their algorithms were just not good enough.

Unfortunately I could not try out their algorithms myself. My pictures just finished processing a few days before they announced to close down. But I found a comment of one of the founders on Hacker News, saying that they used a deep convolutional neural network with 3 layers for the semantic image analysis. This is the same technology Google now uses for their photo search.

But they were unhappy with the results of the algorithm so in January this year they changed their approach as their CTO, Kevin Quennesson, explains in ‘To Reclaim Your Photos, Kill the Algorithm’.  He writes: “If a user is a food enthusiast and takes a lot of food close-ups, are we going to tell him that this photo is not the photo of a dish because an algorithm only learned to model some other kind of dishes?” They found that the algorithm’s errors were not comprehensible for the end user.

So they planned to change their system. As I understand it, their old system learned and used concepts independent of the single user. But the new system also uses pictures of the same user to infer the content of a new picture. He calls this “feature based image backlinks”.

Explanation Feature-based Image Backlinks

The graph shows how a picture of a dish can be correctly identified because the content can be inferred by similar pictures of the user that the system identified correctly before. – from Quennesson’s blog post

Regardless of the success of Everpix, I think using the context of an image more is a helpful and necessary approach to build systems, that will reliably predict the content of an image in the future.

In any case I wish we would hear more about the underlying algorithms, what they tried, what worked and what not.

Assumptions about the end user

I am in the middle of a little literature review on using machine learning for photo organisation and came across a statement that struck me as misconceived. The paper’s topic is segmenting photo streams into events and states at the end of page 5:

We believe that for end users, having a low miss rate is more valuable than having a low false alarm rate.

I believe this is a false assumption that will lead to frustrated end users. Out of my own experience I am convinced that the opposite is true.

They continue: “To correct a false alarm is a one-step process of removing the incorrect segment boundary. But to correct a miss, the user must first realize that there is a miss, then figure out the position of the segment boundary.”

Similar to face detection users will be happy about a correct detection but unhappy about an algorithm that creates wrong boundaries they have to manually correct.

And if we assume, that a conservative algorithm still finds all the strong boundaries, the user might not miss the not detected boundaries after all.

Algorithms should not create new work for the user, but remove (some of) it.

Content based image classification with the bag of visual words model in Python

Even with ever growing interest in deep learning I still find myself using the bag of visual word approach, if only to have a familiar baseline to test my new fancy algorithms against. I especially like the BoW demo script from the VLFeat team, that reaches a solid 65% accuracy on the, admittedly outdated, Caltech101 dataset. The script has the advantage that it is contains all the usual steps in one script (feature extraction, training of the classifier and evaluation of the whole pipeline) and that it can also be easily adapted to other datasets.

The only problem was, that it is a Matlab script and Matlab licences are in my experience often scarce due to their high price even for research institutes. So I rewrote the script in Python using the uncomplete VLFeat Python wrapper.

You can find my code as usual on github: https://github.com/shackenberg/phow_caltech101.py

In case you are just diving into the world of BoW I recommend my minimal BoW image classifier code, which might be easier to understand.

How to archive my tweets?

[UPDATE]

So, it looks like IFTTT deleted my recipe, to avoid drawing Twitters wrath on themselves. I will mail the customer support and ask, what happened…

[Original article]

tldr: IFTTT recipe: Archive my tweets to dropbox!

Even though I am a happy Twitter user, it makes me feel very uncomfortable that I cannot easily access all my old tweets. For that reason I used IFTTT and Dropbox for the last several months to archive all the new tweets, so that I have access to those at least. But after Twitter forbid IFTTT to directly access the tweets, this solution was no longer working. But as clever users found out, Twitter is still providing a RSS feed containing all publicly sent tweets  for each user.

As a result it was very easy to get the whole pipe line working again. If you want to try it youself just use my public IFTTT recipe: Archive my tweets to dropbox! You only have to replace my Twitter handle with yours…

Btw: Please share if you have other useful recipes!