Content based image classification with the bag of visual words model in Python

Even with ever growing interest in deep learning I still find myself using the bag of visual word approach, if only to have a familiar baseline to test my new fancy algorithms against. I especially like the BoW demo script from the VLFeat team, that reaches a solid 65% accuracy on the, admittedly outdated, Caltech101 dataset. The script has the advantage that it is contains all the usual steps in one script (feature extraction, training of the classifier and evaluation of the whole pipeline) and that it can also be easily adapted to other datasets.

The only problem was, that it is a Matlab script and Matlab licences are in my experience often scarce due to their high price even for research institutes. So I rewrote the script in Python using the uncomplete VLFeat Python wrapper.

You can find my code as usual on github: https://github.com/shackenberg/phow_caltech101.py

In case you are just diving into the world of BoW I recommend my minimal BoW image classifier code, which might be easier to understand.

Advertisements

How good is Google Drive’s image recognition engine?

As announced via twitter I took the time to test Google Drive’s image recognition feature. Google Drive was announced two weeks ago with a blog post, which contained the bold claim:

Search everything. Search by keyword and filter by file type, owner and more. … We also use image recognition so that if you drag and drop photos from your Grand Canyon trip into Drive, you can later search for [grand canyon] and photos of its gorges should pop up. This technology is still in its early stages, and we expect it to get better over time.

This sparked my curiosity, so I evaluated Google Drive’s performance like I would with the image recognition frameworks I do my research on. First I uploaded an image dataset and with images containing known objects and then counted how many of the pictures Google Drive’s search would find, if I search for these objects.

As dataset I used the popular  Caltech 101 dataset containing pictures of objects belonging to 101 different categories. There are about 40 to 800 images per category and roughly 4500 images in total. While being far from perfect, it is a well-known contender.

These are my first finding:

  • Google Drive only finds a fraction of the images, but the images it finds it categorizes correctly.

  • In numbers: Precision is 83% (std=36%) and the recall is 8% (std=11%) (averaged over all categories)
  • The best results it achieves for the two ‘comic’ categories ‘Snoopy’ and ‘Garfield’ and for iconic symbols like the dollar bill and the stop sign.
  • As the The Caltech 101 dataset was created using Google’s image search the high precision is at least partly a result of a ‘simple’ duplicate detection with the Google index and not of a successful similarity search.

Verdict:

As all vision systems working in such an unconstrained environment they are far from being actually usable. One cannot rely on them, but once or twice they will surprise you by adding an image to the result list, that one hasn’t thought of.

Further resources:

[update]

Link to Matlab code which achieves 65% precision with 100% recall.*

* The numbers are not comparable 1-to-1 as both use a different evaluation approach. The Matlab script assigns to each image of the dataset its most likely class, while google drive tries to find a concept or object in the image.

Reconsidering evaluation data sets

In this blog post I want to share some interessting articles which deal with data sets in computer vision. For starters, in this blog post Tomasz Malisiewicz draws attention to a video lecture by Peter Norvig (Google) in which Mr Norvig showed some interesting results

where algorithms that obtained the best performance on a small dataset no longer did the best when the size of the training set was increased by an order of magnitude. … Also, the mediocre algorithms in the small training size regime often outperformed their more complicated counterparts once more data was utilized.

This is indeed interesting as it is always hard to say how much training and test data is necessary and most scientist, me as well, a far more interested in working on their precious algorithm instead of collecting a solid ground truth. Furthermore, as I pointed out in a comment for Tomasz’ blog post, using 10 times as many pictures would mean, I could only evaluate 3 feature combinations in the time I could have evaluated 30.

Answering to my question on how to handle that trade-off, he advocates nonparametric* approaches and

combining learning with data-driven approaches to reduce test time complexity.

I agree with him, that we definitely should spent more time and effort creating larger groundtruth sets, instead of optimizing our algorithms for a groundtruth that is too small to reveal anything.

For further reading I refer to Prof. Jain’s Blog, where he claims in his blog post, Evaluating Multimedia Algorithms, that the existing data sets for photo retrieval are

too small such as the Corel or Pascal datasets, too specific like the TRECVID dataset, or without ground truth, such as the several recent efforts by MIT and MSRA that gathered millions of Web images for testing,

and promotes his concept for gathering controlled data ground truths.

As the third read the Scienceblog features a story about a James DiCarlo, a neuroscientist in the McGovern Institute for Brain Research at MIT and graduates students Nicolas Pinto and David Cox of the Rowland Harvard Institute who

argue that natural photographic image sets, like the widely used Caltech101 database, have design flaws that enable computers to succeed where they would fail with more authentically varied images. For example, photographers tend to center objects in a frame and to prefer certain views and contexts. The visual system, by contrast, encounters objects in a much broader range of conditions.

They go on

We suspected that the supposedly natural images in current computer vision tests do not really engage the central problem of variability, and that our intuitions about what makes objects hard or easy to recognize are incorrect.”

I think all the three articles remind us, to reconsider the data sets we use for evaluation. Regarding their size,noisiness and their ‘naturality’.

* nonparametric as in using rank or order of the images