As you can see, we have a variety of images. They all are in different resolutions. Cats and dogs are in different shapes, positions, colors. They may be sitting, they may not. They may be happy or sad. Cats may be sleeping, dogs may be barking. Photographs can be taken from diffrent angles with a different zoom.
An infinite number of configurations is possible. For a human, recognizing pets in a scene in a set of heterogeneous photographs comes naturally with no effort. This is however not a trivial task for a machine. In fact, automatic classification assumes to know how to robustly describe what makes a cat a cat and makes a dog a dog. This assumes to know the intrinsic features that describe each animal.
What makes deep neural networks very effective at classifying images is their ability to automatically learn multiple levels of abstraction that simply characterize each class in a given classification task. They can recognize patterns with extreme variability, and with robustness to distortions and simple geometric transformations.
Let’s see how deep neural nets handle this.