Convolutional neural networks: what they are and why they matter

26.06.2023

Key points

Convolutional neural networks (CNNs) are a class of neural networks that help computers see and understand images and video.
They comprise multiple layers, called convolutional layers, which let CNNs learn complex features and make more accurate predictions about visual content.
CNNs are used for facial recognition, autonomous driving, medical prototyping and natural language processing.

How are convolutional neural networks built?

CNNs loosely mimic the human brain and use sets of learned rules to help a computer detect features in images, interpret them and make sense of the information.

Each layer processes the data and passes the detected features to the next layer for further processing. Filters highlight salient features such as edges or shapes in an image.

Applying filters to visual data produces a convolved image. The CNN then analyses it to identify important features. This process is called feature extraction.

Beyond convolutional layers, CNNs include:

pooling layers, which reduce image size so the network can run faster and generalise better;
normalisation layers, which help prevent overfitting and improve performance;
fully connected layers, which are used for classification.

How do they work?

Convolutional neural networks operate as follows:

input data, such as images or video, arrive at the input layer;
convolutional layers extract features from the input using filters that detect edges, shapes, textures and other characteristics;
after each convolutional layer, a ReLU activation function is applied to add non-linearity and improve performance;
a pooling layer then reduces the dimensionality of the feature maps by selecting the most important values from each region;
fully connected layers take the output of the pooling layer and use learned weights for classification or prediction, combining the extracted features to make a final decision.

An example

Suppose a CNN must classify images of cats and dogs. The pipeline might look like this:

input layer: receives RGB colour images of a dog or a cat, where each pixel is represented by intensity values of the red, green and blue channels;
convolutional layer: applies filters to the image to highlight features such as edges, corners and shapes;
ReLU layer: adds non-linearity by applying the ReLU activation function to the convolutional output;
pooling layer: reduces the dimensionality of the features by selecting maximum values in each region of the feature map;
layer repetition: multiple convolutional and pooling layers are stacked to extract increasingly complex features from the input image;
flattening layer: converts the previous output into a one-dimensional vector representing all features;
fully connected layer: takes the flattened output and applies weights to classify the image as a dog or a cat.

The CNN is trained on labelled images. During training, the weights of the filters and fully connected layers are adjusted to reduce the error between the network’s predictions and the correct labels.

Once trained, the CNN can accurately identify what appears in new, unseen images of cats and dogs. It uses the learned features and patterns to make the correct classification.

What types of convolutional neural networks exist?

traditional CNNs, also known as “conventional” CNNs, consist of a series of convolutional and subsampling (pooling) layers followed by one or more fully connected layers. Each convolutional layer performs convolutions with learnable filters to extract features from the input image. An example is the LeNet-5 architecture, one of the first successful CNNs for handwritten-digit recognition. It comprises two sets of convolutional and subsampling layers followed by two fully connected layers. LeNet-5 demonstrated the effectiveness of CNNs for image identification, and they became widely used in computer vision;
recurrent neural networks (RNNs) can process sequential data by taking account of the context of previous values. Unlike standard neural networks that process data in a fixed order, RNNs can handle variable-length inputs and make inferences that depend on previous inputs. They are widely used in natural language processing. When working with text, they can generate text and perform translation by training on paired sentences in two languages. An RNN processes sentences one by one, creating an output sentence where each step depends on the input. This allows an RNN to translate even complex texts by tracking previous inputs and outputs to capture context;
fully convolutional networks (FCNs) are widely used in computer-vision tasks such as image segmentation, object detection and image classification. They are trained end-to-end using backpropagation for categorising or segmenting images. Backpropagation helps a neural network compute gradients of the loss function with respect to the weights. The loss function measures how well a machine-learning model predicts the expected outcome for a given input. Unlike traditional CNNs, FCNs have no fully connected layers and rely entirely on convolutional layers, making them more flexible and computationally efficient;
spatial transformer networks (STNs) are used in computer vision to improve a network’s ability to recognise objects or patterns in an image regardless of their position, orientation or scale—so-called spatial invariance. For example, a network may apply a transformation to an input image before processing it, aligning objects, correcting perspective distortions or making other changes to improve performance in a given task. STNs help a network handle the spatial characteristics of images and improve recognition in varied conditions.

What are the advantages of CNNs?

One major advantage is shift invariance: a CNN can recognise objects in an image regardless of their location.

Another is parameter sharing, meaning the same set of parameters is applied across the entire input image. This makes the network more compact and efficient, since it need not memorise separate parameters for each region. Instead it generalises feature knowledge across the image, which is particularly useful with large datasets.

Other benefits include hierarchical representations that model complex structures, and robustness to variations, making CNNs reliable under different imaging conditions. In addition, they can be trained end-to-end—from inputs to outputs—speeding training and improving overall performance.

CNNs learn features at different levels of abstraction: lower layers capture simple elements such as edges and textures; higher layers capture more complex parts and shapes. This hierarchy helps with demanding tasks such as object detection and segmentation.

Moreover, CNNs can be trained across the whole network at once. Gradient descent optimises all parameters simultaneously to improve performance and achieve faster convergence by adjusting weights based on error information to minimise loss during training.

And the drawbacks?

Training CNNs typically requires large labelled datasets and can be time-consuming, owing to heavy computational demands.

Architecture—how many layers and of what type—affects performance. Adding more layers may improve accuracy but increases complexity and computing requirements. Deep CNNs are also prone to overfitting, where the network fixates on training data and fails to generalise to new, unseen data.

For tasks that require strong contextual understanding, such as natural language processing, CNNs can be limited. Other types of neural networks that specialise in sequence analysis and capture contextual dependencies are often preferred.

Despite these drawbacks, CNNs remain widely used and highly effective in deep learning. They are a key tool in artificial neural networks, especially in computer vision.

This material was prepared with the help of language models developed by OpenAI. The information presented here is partly based on machine learning rather than real-world experience or empirical research.

Follow ForkLog on social media

Telegram Instagram

Found a mistake in the text? Highlight it and press CTRL+ENTER.

Рассылки ForkLog: держите руку на пульсе биткоин-индустрии!

What is proof-of-personhood (PoP)?

What is explainable artificial intelligence (XAI)?

AI image generation: five free services

ChatGPT: what it is and how to use it

What is a deepfake?

Recommender systems: what they are and how they work

What is natural-language processing?

What is computer vision? (machine learning)