On the importance of context

For people who just start thinking about computer vision, it is of hard to understand why computers have such a difficult time finding faces in images, even though it is so easy for us.

Adding to one of my earlier articles about why is vision hard, which points out, that computers are missing concepts, there is another reason. They are also missing context. It is so easy for us, to spot faces, because we know were to look for them, where to expect them. When we see a person, we know were to look for the face, and when we see a house, we know, that we wont find a face. But computers don’t. And to show you that even we are lost without context I present you this nice picture with the coffee beans. Your job is to find the face.

Where is the face in the coffee beans?

Did you find the face? Probably you didn’t find it at once, but started to scan the picture till you found it. It took me nearly 30s,which is so much slower, than any recent software.

So how can we improve our algorithm with context?

p.s.

About the picture. I got it from a very interesting talk held at TID by Aleix M Martinez from the Ohio State University on classification. His main point, PCA and LDA. For starters check out his paper PCA versus LDA (2001).

Reconsidering evaluation data sets

In this blog post I want to share some interessting articles which deal with data sets in computer vision. For starters, in this blog post Tomasz Malisiewicz draws attention to a video lecture by Peter Norvig (Google) in which Mr Norvig showed some interesting results

where algorithms that obtained the best performance on a small dataset no longer did the best when the size of the training set was increased by an order of magnitude. … Also, the mediocre algorithms in the small training size regime often outperformed their more complicated counterparts once more data was utilized.

This is indeed interesting as it is always hard to say how much training and test data is necessary and most scientist, me as well, a far more interested in working on their precious algorithm instead of collecting a solid ground truth. Furthermore, as I pointed out in a comment for Tomasz’ blog post, using 10 times as many pictures would mean, I could only evaluate 3 feature combinations in the time I could have evaluated 30.

Answering to my question on how to handle that trade-off, he advocates nonparametric* approaches and

combining learning with data-driven approaches to reduce test time complexity.

I agree with him, that we definitely should spent more time and effort creating larger groundtruth sets, instead of optimizing our algorithms for a groundtruth that is too small to reveal anything.

For further reading I refer to Prof. Jain’s Blog, where he claims in his blog post, Evaluating Multimedia Algorithms, that the existing data sets for photo retrieval are

too small such as the Corel or Pascal datasets, too specific like the TRECVID dataset, or without ground truth, such as the several recent efforts by MIT and MSRA that gathered millions of Web images for testing,

and promotes his concept for gathering controlled data ground truths.

As the third read the Scienceblog features a story about a James DiCarlo, a neuroscientist in the McGovern Institute for Brain Research at MIT and graduates students Nicolas Pinto and David Cox of the Rowland Harvard Institute who

argue that natural photographic image sets, like the widely used Caltech101 database, have design flaws that enable computers to succeed where they would fail with more authentically varied images. For example, photographers tend to center objects in a frame and to prefer certain views and contexts. The visual system, by contrast, encounters objects in a much broader range of conditions.

They go on

We suspected that the supposedly natural images in current computer vision tests do not really engage the central problem of variability, and that our intuitions about what makes objects hard or easy to recognize are incorrect.”

I think all the three articles remind us, to reconsider the data sets we use for evaluation. Regarding their size,noisiness and their ‘naturality’.

* nonparametric as in using rank or order of the images

Why is vision hard?

cropped-twitter-background7.png

Why is computer vision so difficult? Because it is the difference between seeing and perceiving.

Modern cameras can see the world nearly as good as we humans do, but they are not able to comprehend what they see. It is like me looking at a piece of modern art. I can see it, but I sure won’t understand it. I can identify lines, squares and other shapes, but that wont help me uncover the meaning of the picture. (I know, that modern art is not only about comprehending and the meaning of the pictures, but I think you get my point.)

But computers don’t even know the concept of squares or any other shapes. They have to learn from examples or be taught by humans. But how do you describe a circle, so that a computer can recognise it, even when it is the wheel of a bike leaning against a wall? And I can look at hundreds of pictures of modern art, but that wont help me if nobody tells me what (features) to look for. And writing algorithms to automatically extract the right features from pictures to help the computer recognise scenes, objects and people is what the researchers in computer vision are working on.

Of course there are also other topics. Fusing image data with laser rangefinders is important for the field of robot vision and in image retrieval they are also interested on features that describe the technical or aesthetic quality of the pictures. Not per se vision, but still always important are faster and more accurate learning algorithms and optimization of the algorithms involved. Image processing is always very cpu time and memory expensive.