Convolutional Neural Networks

Convolution Neural Network is a type of deep, feed-forward neural network that is used in a variety of areas, including image and pattern recognition, speech recognition, natural language processing, and video analysis. Yann LeCun from Facebook’s AI Research group built the first Convolution Neural Network in 1988 called LeNet.

CNN’s are special cases of the neural network. A CNN consists of one or more convolutional layers, often with a subsampling layer, which are followed by one or more fully connected layers as in a standard neural network. They use little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage.

Convolutional layers

Just like any other layer, a convolutional layer receives input, transforms the input in some way, and then outputs the transformed input to the next layer. Mathematically, the convolution operations performed by convolutional layers are actually called cross-correlations, which are measures of similarity of two series. With each convolutional layer, we need to specify the number of filters the layer should have. These filters are actually what detect the patterns.


Let’s expand on precisely what we mean when we say that the filters are able to detect patterns. Think about how much may be going on in any single image. Patterns which we could isolate are: edges, shapes, textures, curves, objects, colors, etc

One type of pattern that a filter can detect in an image is edges so this filter would be called an edge detector. This kind of filter is what we’d see at the start of a convolutional neural network. The deeper the network goes, the more sophisticated the filters become.

To understand what’s actually happening here with these convolutional layers and their respective filters, let’s look at an example.


Suppose we have a convolutional neural network that is accepting images of handwritten digits and our network is classifying them into their categories of whether on the image is number 1, 2, 3, etc.

A filter can technically just be thought of as a relatively small matrix (kernel), for which, we decide the number of rows and columns this matrix has, and the values within this matrix are initialized with random numbers.

In this example, on the first convolutional layer, we’re going to specify that we want the layer to contain one filter of size 3 x 3.

This image shows us one step in the convolution process. For each position on the blue input channel, the 3 x 3 filter does a computation that maps the colored part of the blue input channel to the corresponding colored part of the red output channel.

This convolutional layer receives an input channel, and the filter will slide over each 3 x 3 set of pixels of the input itself until it covers over every 3 x 3 block of pixels from the entire image.

Convolution operation

The blue input channel is a matrix representation of an image from the MNIST dataset. The values in this matrix are the individual pixels from the image. These images are grayscale images, and so we only have a single input channel which we are going to pass to a convolutional layer.

we’ve specified the first convolutional layer to only have one filter, and this filter is going to convolve (slide) across each 3 x 3 block of pixels from the input. When the filter lands on its first 3 x 3 block of pixels, the dot product of the filter itself with the 3 x 3 block of pixels from the input will be computed and stored.

After this filter has convolved the entire input, we’ll be left with a new representation of our input, which is now stored in the output channel. This output channel is called a feature map, which is going to be the input of the next layer.

This was just a very simple illustration, but as mentioned earlier, we can think of these filters as pattern detectors.

Input and output channels

Suppose that this grayscale image (single color channel) of a seven from the MNIST data set is our input:

Let’s suppose that we have four 3 x 3 filters for our first convolutional layer, and these filters are filled with the values you see below. These values can be represented visually by having -1s correspond to black, 1s correspond to white, and 0s correspond to grey.

On the image above we can see a convolutional layer with 4 filters. If we convolve our original image of a seven with each of these four filters individually, this is what the output would look like for each filter:

We can see that all four of these filters are detecting edges. In the output channels, the brightest pixels can be interpreted as what the filter has detected. In the first one, we can see detects top horizontal edges of the seven, and that’s indicated by the brightest pixels (white).

The second detects left vertical edges, again being displayed with the brightest pixels. The third detect bottom horizontal edges, and the fourth detects right vertical edges.

These filters, as we mentioned before, are really basic and just detect edges. These are filters we may see towards the start of a convolutional neural network. More complex filters would be located deeper in the network and would gradually be able to detect more sophisticated patterns.

Effectiveness and limitations

A CNN architecture makes it possible to predict objects and faces in images using industry benchmark datasets with up to 95% accuracy, greater than human capabilities which stand at 94% accuracy. Even so, convolutional neural networks have their limitations:

  • Require high processing power. Models are typically trained on high-cost machines with specialized Graphical Processing Units (GPUs).
  • Can fail when images are rotated or tilted, or when an image has the features of the desired object, but not in the correct order or position, for example, a face with the nose and mouth switched around. A new architecture called CAPSNet has emerged to address this limitation.