The end of Everpix, a sad week for photographers and machine learning researchers.

This week the photo storage service Everpix announced, that they will close down. They did not have enough paying costumers and could not find new investors.

That is sad. Not only because it was the world’s best photo startup according to the Verge, but also because it was the only company besides Google that used new machine learning techniques to help people manage their photo mess.

everpix home screen

Everpix home screen

Their closure can be seen as an indicator that end users and investors are not ready yet to spend additional money on machine learning algorithms.

Flashback mail

Flashback mail

Having read some articles and the associated comments[1, 2], it is clear to me that not their use of sophisticated machine learning algorithms but the daily ‘flashback’ email with pictures taken on the same day in previous years was the more popular feature. In fact, I did not even see one single comment about the algorithms that analysed the pictures.

But maybe their algorithms were just not good enough.

Unfortunately I could not try out their algorithms myself. My pictures just finished processing a few days before they announced to close down. But I found a comment of one of the founders on Hacker News, saying that they used a deep convolutional neural network with 3 layers for the semantic image analysis. This is the same technology Google now uses for their photo search.

But they were unhappy with the results of the algorithm so in January this year they changed their approach as their CTO, Kevin Quennesson, explains in ‘To Reclaim Your Photos, Kill the Algorithm’.  He writes: “If a user is a food enthusiast and takes a lot of food close-ups, are we going to tell him that this photo is not the photo of a dish because an algorithm only learned to model some other kind of dishes?” They found that the algorithm’s errors were not comprehensible for the end user.

So they planned to change their system. As I understand it, their old system learned and used concepts independent of the single user. But the new system also uses pictures of the same user to infer the content of a new picture. He calls this “feature based image backlinks”.

Explanation Feature-based Image Backlinks

The graph shows how a picture of a dish can be correctly identified because the content can be inferred by similar pictures of the user that the system identified correctly before. – from Quennesson’s blog post

Regardless of the success of Everpix, I think using the context of an image more is a helpful and necessary approach to build systems, that will reliably predict the content of an image in the future.

In any case I wish we would hear more about the underlying algorithms, what they tried, what worked and what not.

Content based image classification with the bag of visual words model in Python

Even with ever growing interest in deep learning I still find myself using the bag of visual word approach, if only to have a familiar baseline to test my new fancy algorithms against. I especially like the BoW demo script from the VLFeat team, that reaches a solid 65% accuracy on the, admittedly outdated, Caltech101 dataset. The script has the advantage that it is contains all the usual steps in one script (feature extraction, training of the classifier and evaluation of the whole pipeline) and that it can also be easily adapted to other datasets.

The only problem was, that it is a Matlab script and Matlab licences are in my experience often scarce due to their high price even for research institutes. So I rewrote the script in Python using the uncomplete VLFeat Python wrapper.

You can find my code as usual on github:

In case you are just diving into the world of BoW I recommend my minimal BoW image classifier code, which might be easier to understand.

Paper: Rendering Synthetic Objects into Legacy Photographs

Inserting 3D objects into existing photographs


This fascinating video presents a new method to insert 3D objects into existing photographs. It is based on the research of Kevin Karsch, Varsha Hedau, David Forsyth and Derek Hoiem  (all University of Illinois at Urbana-Champaign). Their main contribution is the algorithm, which generates the light model for the scene. The algorithm needs only one photograph and a few manual markings by a novice user together with a ground truth data set to create a near real life insertion. The ground truth data set was generated with 200 images from 20 indoor scenes under varying lighting conditions.

The video is well done and I am surprised whats possible, but I like to see how much user input is really necessary and how well the algorithm and the ground truth perform with other images. What do you think?

More details can be found at Kevin Karsch’s website.

Reconsidering evaluation data sets

In this blog post I want to share some interessting articles which deal with data sets in computer vision. For starters, in this blog post Tomasz Malisiewicz draws attention to a video lecture by Peter Norvig (Google) in which Mr Norvig showed some interesting results

where algorithms that obtained the best performance on a small dataset no longer did the best when the size of the training set was increased by an order of magnitude. … Also, the mediocre algorithms in the small training size regime often outperformed their more complicated counterparts once more data was utilized.

This is indeed interesting as it is always hard to say how much training and test data is necessary and most scientist, me as well, a far more interested in working on their precious algorithm instead of collecting a solid ground truth. Furthermore, as I pointed out in a comment for Tomasz’ blog post, using 10 times as many pictures would mean, I could only evaluate 3 feature combinations in the time I could have evaluated 30.

Answering to my question on how to handle that trade-off, he advocates nonparametric* approaches and

combining learning with data-driven approaches to reduce test time complexity.

I agree with him, that we definitely should spent more time and effort creating larger groundtruth sets, instead of optimizing our algorithms for a groundtruth that is too small to reveal anything.

For further reading I refer to Prof. Jain’s Blog, where he claims in his blog post, Evaluating Multimedia Algorithms, that the existing data sets for photo retrieval are

too small such as the Corel or Pascal datasets, too specific like the TRECVID dataset, or without ground truth, such as the several recent efforts by MIT and MSRA that gathered millions of Web images for testing,

and promotes his concept for gathering controlled data ground truths.

As the third read the Scienceblog features a story about a James DiCarlo, a neuroscientist in the McGovern Institute for Brain Research at MIT and graduates students Nicolas Pinto and David Cox of the Rowland Harvard Institute who

argue that natural photographic image sets, like the widely used Caltech101 database, have design flaws that enable computers to succeed where they would fail with more authentically varied images. For example, photographers tend to center objects in a frame and to prefer certain views and contexts. The visual system, by contrast, encounters objects in a much broader range of conditions.

They go on

We suspected that the supposedly natural images in current computer vision tests do not really engage the central problem of variability, and that our intuitions about what makes objects hard or easy to recognize are incorrect.”

I think all the three articles remind us, to reconsider the data sets we use for evaluation. Regarding their size,noisiness and their ‘naturality’.

* nonparametric as in using rank or order of the images