{"id":37441,"date":"2021-11-08T09:00:00","date_gmt":"2021-11-08T07:00:00","guid":{"rendered":"https:\/\/forklog.com\/en\/?p=37441"},"modified":"2025-08-29T17:07:06","modified_gmt":"2025-08-29T14:07:06","slug":"what-are-transformers-machine-learning","status":"publish","type":"post","link":"https:\/\/forklog.com\/en\/what-are-transformers-machine-learning\/","title":{"rendered":"What are transformers? (machine learning)"},"content":{"rendered":"<div class=\"wp-block-text-wrappers-cards single_card\">\n<h2 class=\"card_label\"><strong>What are transformers?<\/strong><\/h2>\n<p>Transformers are a relatively new type of neural network designed to handle sequences while coping easily with long-range dependencies. Today they are the most advanced technique in natural-language processing (NLP).<\/p>\n<p>They can translate text, write poems and articles, and even generate computer code. Unlike recurrent neural networks (RNNs), transformers do not process sequences strictly in order. For example, if the input is text, they do not need to process the end of a sentence only after the beginning. This makes such networks easy to parallelise and much faster to train.<\/p>\n<\/div>\n<div class=\"wp-block-text-wrappers-cards single_card\">\n<h2 class=\"card_label\"><strong>When did they appear?<\/strong><\/h2>\n<p>Transformers were first described by engineers at Google Brain in the 2017 paper <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">\u201cAttention Is All You Need\u201d<\/a>.<\/p>\n<p>One key difference from earlier methods is that the input sequence can be fed in parallel, allowing efficient use of graphics processors and higher training speeds.<\/p>\n<\/div>\n<div class=\"wp-block-text-wrappers-cards single_card\">\n<h2 class=\"card_label\"><strong>Why use transformers?<\/strong><\/h2>\n<p>Until 2017, engineers used deep learning to understand text via recurrent neural networks.<\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"517\" src=\"https:\/\/forklog.com\/wp-content\/uploads\/NN_RNN-min-1024x517.png\" alt=\"What are transformers? (machine learning)\" class=\"wp-image-151139\" srcset=\"https:\/\/forklog.com\/wp-content\/uploads\/NN_RNN-min-1024x517.png 1024w, https:\/\/forklog.com\/wp-content\/uploads\/NN_RNN-min-300x152.png 300w, https:\/\/forklog.com\/wp-content\/uploads\/NN_RNN-min-768x388.png 768w, https:\/\/forklog.com\/wp-content\/uploads\/NN_RNN-min.png 1200w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<p>Suppose we are translating a sentence from English into Russian. An RNN would take the English sentence as input, process its words one by one, and then output their Russian counterparts sequentially. The crucial word here is \u201csequential\u201d. Word order matters in language; you cannot simply shuffle words.<\/p>\n<p>RNNs run into several problems. First, they try to process long sequences of text. By the time they reach the end of a paragraph, they can \u201cforget\u201d the beginning. For instance, a translation model based on RNNs may struggle to remember the gender of an entity in a long passage.<\/p>\n<p>Second, RNNs are hard to train. They suffer from the so-called <a href=\"https:\/\/towardsdatascience.com\/the-exploding-and-vanishing-gradients-problem-in-time-series-6b87d558d22\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">vanishing\/exploding gradient<\/a> problem.<\/p>\n<p>Third, because they process words sequentially, they are difficult to parallelise. That means you cannot easily speed up training by using more graphics processors. Consequently, training on very large datasets is impractical.<\/p>\n<\/div>\n<div class=\"wp-block-text-wrappers-cards single_card\">\n<h2 class=\"card_label\"><strong>How do transformers work?<\/strong><\/h2>\n<p>The main components of transformers are an encoder and a decoder.<\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"735\" height=\"1024\" src=\"https:\/\/forklog.com\/wp-content\/uploads\/transformers-architecture-735x1024.jpg\" alt=\"What are transformers? (machine learning)\" class=\"wp-image-155333\" srcset=\"https:\/\/forklog.com\/wp-content\/uploads\/transformers-architecture-735x1024.jpg 735w, https:\/\/forklog.com\/wp-content\/uploads\/transformers-architecture-215x300.jpg 215w, https:\/\/forklog.com\/wp-content\/uploads\/transformers-architecture-768x1070.jpg 768w, https:\/\/forklog.com\/wp-content\/uploads\/transformers-architecture.jpg 974w\" sizes=\"auto, (max-width: 735px) 100vw, 735px\" \/><figcaption>Transformer architecture. Source: <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">\u201cAttention Is All You Need\u201d<\/a>.<\/figcaption><\/figure>\n<p>The encoder transforms the input (for example, text) into a vector (a set of numbers). The decoder then renders it as a new sequence (for example, an answer) or as words in another language, depending on the model\u2019s purpose.<\/p>\n<p>Other innovations behind transformers boil down to three core ideas:<\/p>\n<ul class=\"wp-block-list\">\n<li>positional encodings;<\/li>\n<li>attention;<\/li>\n<li>self-attention.<\/li>\n<\/ul>\n<p>Start with positional encodings. Suppose we need to translate English text into Russian. Standard RNN models \u201cunderstand\u201d word order and process words sequentially. But that hampers parallelisation.<\/p>\n<p>Positional encoders overcome this. The idea is to take every word in the input sequence\u2014here, the English sentence\u2014and add to each its position. You \u201cfeed\u201d the network a sequence like this:<\/p>\n<p><strong> [(\u201cRed\u201d, 1), (\u201cfox\u201d, 2), (\u201cjumps\u201d, 3), (\u201cover\u201d, 4), (\u201clazy\u201d, 5), (\u201cdog\u201d, 6)]<\/strong><\/p>\n<p>Conceptually, this shifts the burden of understanding word order from the network\u2019s structure onto the data themselves.<\/p>\n<p>At first, before training on any data, transformers do not know how to interpret these positional codes. But as the model sees more examples of sentences and their encodings, it learns to use them effectively.<\/p>\n<p>The sketch above is simplified\u2014 the original paper used sinusoidal functions to construct positional encodings rather than plain integers 1, 2, 3, 4\u2014 but the point is the same. By keeping word order as data rather than structure, the network becomes easier to train.<\/p>\n<p>Attention is a neural-network mechanism <a href=\"https:\/\/arxiv.org\/abs\/1409.0473\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">introduced to machine translation in 2015<\/a>. To grasp it, consider the original paper.<\/p>\n<p>Imagine we need to translate into French the phrase:<\/p>\n<p><strong>\u201cThe agreement on the European Economic Area was signed in August 1992\u201d.<\/strong><\/p>\n<p>The French equivalent is:<\/p>\n<p><strong>\u201cL\u2019accord sur la zone \u00e9conomique europ\u00e9enne a \u00e9t\u00e9 sign\u00e9 en ao\u00fbt 1992\u201d.<\/strong><\/p>\n<p>The worst way to translate would be to look up word-for-word correspondences in order. That fails for several reasons.<\/p>\n<p>First, some words are reordered in French:<\/p>\n<p><strong>\u201cEuropean Economic Area\u201d<\/strong> versus <strong>\u201cla zone \u00e9conomique europ\u00e9enne\u201d<\/strong>.<\/p>\n<p>Second, French marks gender. To match the feminine noun <strong>\u201cla zone\u201d<\/strong>, the adjectives <strong>\u201c\u00e9conomique\u201d<\/strong> and <strong>\u201ceurop\u00e9enne\u201d<\/strong> must also take the feminine form.<\/p>\n<p>Attention helps avoid such pitfalls. It lets a text model \u201clook\u201d at each word in the source sentence when deciding how to translate. This is illustrated in a visualisation from the original paper:<\/p>\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"558\" height=\"548\" src=\"https:\/\/forklog.com\/wp-content\/uploads\/Neural_Machine_Translation_by_Jointly_Learning_to_Align_and_Translate_heatmap.png\" alt=\"What are transformers? (machine learning)\" class=\"wp-image-155334\" srcset=\"https:\/\/forklog.com\/wp-content\/uploads\/Neural_Machine_Translation_by_Jointly_Learning_to_Align_and_Translate_heatmap.png 558w, https:\/\/forklog.com\/wp-content\/uploads\/Neural_Machine_Translation_by_Jointly_Learning_to_Align_and_Translate_heatmap-300x295.png 300w\" sizes=\"auto, (max-width: 558px) 100vw, 558px\" \/><figcaption>Visualisation of machine translation using attention mechanisms. Source: <a href=\"https:\/\/arxiv.org\/abs\/1409.0473\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">\u201cNeural Machine Translation by Jointly Learning to Align and Translate (2015)\u201d<\/a>.<\/figcaption><\/figure>\n<p>This is a kind of <a href=\"https:\/\/ru.wikipedia.org\/wiki\/%D0%A2%D0%B5%D0%BF%D0%BB%D0%BE%D0%B2%D0%B0%D1%8F_%D0%BA%D0%B0%D1%80%D1%82%D0%B0\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">heat map<\/a> showing what the model \u201cpays attention to\u201d when translating each word in the French sentence. As expected, when the model outputs <strong>\u201ceurop\u00e9enne\u201d<\/strong>, it heavily considers both input words\u2014 <strong>\u201cEuropean\u201d<\/strong> and <strong>\u201cEconomic\u201d<\/strong>.<\/p>\n<p>Training data teach the model which words to \u201cattend\u201d to at each step. By observing thousands of English and French sentences, the algorithm learns interdependent word types. It learns to account for gender, plurality and other grammatical rules.<\/p>\n<p>Attention has been hugely useful for NLP since 2015, but in its original form it was paired with recurrent networks. The 2017 transformer paper\u2019s innovation was in part to dispense with RNNs entirely\u2014 hence the title \u201cAttention Is All You Need\u201d.<\/p>\n<p>The last ingredient is a twist on attention called \u201cself-attention\u201d.<\/p>\n<p>If attention helps align words when translating between languages, self-attention helps a model grasp meaning and patterns within a language.<\/p>\n<p>Consider two sentences:<\/p>\n<p><strong>\u00ab\u041d\u0438\u043a\u043e\u043b\u0430\u0439 \u043f\u043e\u0442\u0435\u0440\u044f\u043b \u043a\u043b\u044e\u0447 \u043e\u0442 \u043c\u0430\u0448\u0438\u043d\u044b\u00bb<\/strong><\/p>\n<p><strong>\u00ab\u0416\u0443\u0440\u0430\u0432\u043b\u0438\u043d\u044b\u0439 \u043a\u043b\u044e\u0447 \u043d\u0430\u043f\u0440\u0430\u0432\u0438\u043b\u0441\u044f \u043d\u0430 \u044e\u0433\u00bb<\/strong><\/p>\n<p>The word <strong>\u00ab\u043a\u043b\u044e\u0447\u00bb<\/strong> here means two very different things, which humans can easily tell apart from context. Self-attention allows a network to understand a word in the context of its neighbours.<\/p>\n<p>Thus, when the model processes <strong>\u00ab\u043a\u043b\u044e\u0447\u00bb<\/strong> in the first sentence, it can attend to <strong>\u00ab\u043c\u0430\u0448\u0438\u043d\u044b\u00bb<\/strong> and infer this is a metal key for a lock, not something else.<\/p>\n<p>In the second sentence, the model can attend to <strong>\u00ab\u0436\u0443\u0440\u0430\u0432\u043b\u0438\u043d\u044b\u0439\u00bb<\/strong> and <strong>\u00ab\u044e\u0433\u00bb<\/strong> to relate <strong>\u00ab\u043a\u043b\u044e\u0447\u00bb<\/strong> to a flock of birds. Self-attention helps neural networks resolve ambiguity, perform <a href=\"https:\/\/ru.wikipedia.org\/wiki\/%D0%A7%D0%B0%D1%81%D1%82%D0%B5%D1%80%D0%B5%D1%87%D0%BD%D0%B0%D1%8F_%D1%80%D0%B0%D0%B7%D0%BC%D0%B5%D1%82%D0%BA%D0%B0\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">part-of-speech tagging<\/a>, learn semantic roles and more.<\/p>\n<\/div>\n<div class=\"wp-block-text-wrappers-cards single_card\">\n<h2 class=\"card_label\"><strong>Where are they used?<\/strong><\/h2>\n<p>Transformers were initially pitched as networks for processing and understanding natural language. In the four years since their debut they have gained popularity and now underpin many services used daily by millions.<\/p>\n<p>One of the simplest examples is Google\u2019s <a href=\"https:\/\/blog.google\/products\/search\/search-language-understanding-bert\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">BERT language model<\/a>, introduced in 2018.<\/p>\n<p>On October 25th 2019 the tech giant announced it had begun using the algorithm in the English-language version of its search engine in the United States. A month and a half later, the company <a href=\"https:\/\/twitter.com\/searchliaison\/status\/1204152378292867074\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">expanded support<\/a> to 70 languages, including Russian, Ukrainian, Kazakh and Belarusian.<\/p>\n<p>The original English model was trained on the BooksCorpus dataset of 800m words and on Wikipedia articles. Base BERT had 110m parameters; the larger version had 340m.<\/p>\n<p>Another popular transformer-based language model is OpenAI\u2019s GPT (Generative Pre-trained Transformer).<\/p>\n<p>Today the most up-to-date version is GPT-3. It was trained on a dataset of 570GB and has 175bn parameters, making it one of the largest language models.<\/p>\n<p>GPT-3 can generate articles, answer questions, power chatbots, perform semantic search and produce summaries.<\/p>\n<p>GitHub Copilot, an AI assistant for automatic code writing, was also built on GPT-3. It uses a special GPT-3 variant, Codex AI, trained on code. Researchers have estimated that since its August 2021 release, 30% of new code on GitHub has been written with Copilot\u2019s help.<\/p>\n<p>Beyond that, transformers are increasingly used in Yandex services such as Search, News and Translate, in Google products, in chatbots and more. Sber has released its own GPT modification trained on 600GB of Russian-language text.<\/p>\n<\/div>\n<div class=\"wp-block-text-wrappers-cards single_card\">\n<h2 class=\"card_label\"><strong>What are the prospects for transformers?<\/strong><\/h2>\n<p>Transformers\u2019 potential is far from exhausted. They have proved themselves on text, and lately this class of network has been explored for other tasks such as computer vision.<\/p>\n<p>In late 2020, CV models showed strong results on popular benchmarks such as object detection on the COGO dataset and image classification on ImageNet.<\/p>\n<p>In October 2020 researchers at Facebook AI Research published a paper describing <a href=\"https:\/\/arxiv.org\/abs\/2012.12877\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Data-efficient Image Transformers (DeiT)<\/a>, a transformer-based model. They said they had found a way to train the algorithm without a huge labelled dataset and achieved 85% image-recognition accuracy.<\/p>\n<p>In May 2021 specialists at Facebook AI Research presented <a href=\"https:\/\/ai.facebook.com\/blog\/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">DINO<\/a>, an open-source computer-vision algorithm that automatically segments objects in photos and videos without manual labelling. It is also based on transformers, and reached 80% segmentation accuracy.<\/p>\n<p>Thus, beyond NLP, transformers are increasingly finding use in other tasks as well.<\/p>\n<\/div>\n<div class=\"wp-block-text-wrappers-cards single_card\">\n<h2 class=\"card_label\"><strong>What risks do transformers pose?<\/strong><\/h2>\n<p>Alongside obvious advantages, transformers in NLP carry risks. The creators of GPT-3 have repeatedly <a href=\"https:\/\/openai.com\/blog\/openai-api\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">stated<\/a> that the network could be used for mass spam, harassment or disinformation.<\/p>\n<p>Language models are also prone to bias against certain groups. Although the developers have reduced GPT-3\u2019s toxicity, they are still not ready to grant broad access to the tool.<\/p>\n<p>In September 2020 researchers at a college in Middlebury published a <a href=\"https:\/\/www.middlebury.edu\/institute\/sites\/www.middlebury.edu.institute\/files\/2020-09\/The_Radicalization_Risks_of_GPT_3_and_Advanced_Neural_Language_Models_0.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">report<\/a> on the risks of radicalisation associated with the spread of large language models. They noted that GPT-3 shows \u201csignificant improvements\u201d in generating extremist texts compared with its predecessor, GPT-2.<\/p>\n<p>The technology has also drawn criticism from one of the \u201cfathers of deep learning\u201d, Yann LeCun. He <a href=\"https:\/\/www.facebook.com\/yann.lecun\/posts\/10157253205637143\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">said<\/a> that many expectations about the capabilities of large language models are unrealistic.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>\u201cTrying to build intelligent machines by scaling up language models is like building airplanes to fly to the Moon. You may break altitude records, but flying to the Moon will require a completely different approach,\u201d wrote LeCun.<\/p>\n<\/blockquote>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Transformers are a relatively new type of neural network designed to handle sequences while coping easily with long-range dependencies.<\/p>\n","protected":false},"author":1,"featured_media":37442,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"select":"1","news_style_id":"1","cryptorium_level":"2","_short_excerpt_text":"","creation_source":"","_metatest_mainpost_news_update":false,"footnotes":""},"categories":[2113],"tags":[2130,438],"class_list":["post-37441","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cryptorium","tag-101-artificial-intelligence","tag-artificial-intelligence"],"aioseo_notices":[],"amp_enabled":true,"views":"52","promo_type":"1","layout_type":"1","short_excerpt":"","is_update":"","_links":{"self":[{"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/posts\/37441","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/comments?post=37441"}],"version-history":[{"count":1,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/posts\/37441\/revisions"}],"predecessor-version":[{"id":37443,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/posts\/37441\/revisions\/37443"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/media\/37442"}],"wp:attachment":[{"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/media?parent=37441"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/categories?post=37441"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/tags?post=37441"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}