{"id":95136,"date":"2026-03-11T17:41:13","date_gmt":"2026-03-11T14:41:13","guid":{"rendered":"https:\/\/forklog.com\/en\/?p=95136"},"modified":"2026-03-13T08:34:56","modified_gmt":"2026-03-13T05:34:56","slug":"set-in-silicon","status":"publish","type":"post","link":"https:\/\/forklog.com\/en\/set-in-silicon\/","title":{"rendered":"Etched in Silicon"},"content":{"rendered":"<p>Consumer <span data-descr=\"graphics processing units\" class=\"old_tooltip\">GPU<\/span> have traditionally been built for video games and rendering. Yet they can also tackle other tasks that demand parallel computation.<\/p>\n<p>A graphics processor can, for instance, run a <a href=\"https:\/\/forklog.com\/en\/news\/what-is-the-proof-of-work-pow-algorithm\">PoW<\/a> miner to mine cryptocurrencies, but competition from specialised rigs has pushed GPU farms into niches.<\/p>\n<p>A similar story is playing out in AI. Graphics cards have become the mainstay for neural networks. As the industry has advanced, demand has risen for dedicated AI hardware. ForkLog examines the current state of this new lap in the artificial-intelligence race.<\/p>\n<h2 class=\"wp-block-heading\">Optimising silicon for AI<\/h2>\n<p>There are several approaches to building hardware specialised for AI tasks.<\/p>\n<p>Consumer GPUs are the starting point for specialisation. Their prowess at parallel matrix maths proved handy for deploying neural networks and, especially, deep learning, but there was ample room for improvement.<\/p>\n<p>One of the chief problems of running AI on a GPU is the need constantly to shuttle large volumes of data between system memory and the GPU. These housekeeping chores can take more time and energy than the useful computation itself.<\/p>\n<p>Another issue stems from GPUs\u2019 generality. Their architecture caters to a wide range of jobs\u2014from graphics rendering to general-purpose compute\u2014leaving some hardware blocks redundant for targeted AI workloads.<\/p>\n<p>Data formats pose a further constraint. Historically, graphics processors were tuned for FP32\u201432-bit <a href=\"https:\/\/ru.wikipedia.org\/wiki\/%D0%A7%D0%B8%D1%81%D0%BB%D0%BE_%D1%81_%D0%BF%D0%BB%D0%B0%D0%B2%D0%B0%D1%8E%D1%89%D0%B5%D0%B9_%D0%B7%D0%B0%D0%BF%D1%8F%D1%82%D0%BE%D0%B9\" target=\"_blank\" rel=\"noopener\" title=\"\">floating-point numbers<\/a>. Inference and training typically use lower-precision formats: 16-bit FP16 and BF16, or integer INT4 and INT8.<\/p>\n<h3 class=\"wp-block-heading\">Nvidia H200 and B200<\/h3>\n<p>Some of the most popular products for inference and training\u2014<a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/h200\/\" target=\"_blank\" rel=\"noopener\" title=\"\">H200<\/a> chips and <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-b200\/\" target=\"_blank\" rel=\"noopener\" title=\"\">DGX B200<\/a> server systems\u2014are, in essence, beefed-up GPUs for data centres.<\/p>\n<p>The core AI-oriented feature of these accelerators is <span data-descr=\"tensor cores\" class=\"old_tooltip\">tensor cores,<\/span> designed for ultra-fast matrix operations such as training models and <span data-descr=\"batch inference\" class=\"old_tooltip\">batch inference<\/span>.<\/p>\n<p>To cut data-access latency, Nvidia equips its cards with vast amounts of high-bandwidth memory (HBM). The H200 integrates 141GB of HBM3e delivering 4.8TB\/s; on the B200 the figures are higher still, depending on configuration.<\/p>\n<h3 class=\"wp-block-heading\">Tensor Processing Unit<\/h3>\n<p>By 2015 Google had developed the <a href=\"https:\/\/docs.cloud.google.com\/tpu\/docs\/system-architecture-tpu-vm\">Tensor Processing Unit (TPU)<\/a>, an ASIC built on <a href=\"https:\/\/ru.wikipedia.org\/wiki\/%D0%A1%D0%B8%D1%81%D1%82%D0%BE%D0%BB%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8%D0%B9_%D0%BC%D0%B0%D1%81%D1%81%D0%B8%D0%B2\" target=\"_blank\" rel=\"noopener\" title=\"\">systolic arrays<\/a> for machine learning.<\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"730\" src=\"https:\/\/forklog.com\/wp-content\/uploads\/img-f481d662fa0cd0d0-8455907027100403-1024x730.jpg\" alt=\"Tensor_Processing_Unit_3.0\" class=\"wp-image-276640\" srcset=\"https:\/\/forklog.com\/wp-content\/uploads\/img-f481d662fa0cd0d0-8455907027100403-1024x730.jpg 1024w, https:\/\/forklog.com\/wp-content\/uploads\/img-f481d662fa0cd0d0-8455907027100403-300x214.jpg 300w, https:\/\/forklog.com\/wp-content\/uploads\/img-f481d662fa0cd0d0-8455907027100403-768x548.jpg 768w, https:\/\/forklog.com\/wp-content\/uploads\/img-f481d662fa0cd0d0-8455907027100403-1536x1096.jpg 1536w, https:\/\/forklog.com\/wp-content\/uploads\/img-f481d662fa0cd0d0-8455907027100403.jpg 1960w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Tensor Processing Unit 3.0. Source: <a href=\"https:\/\/ru.wikipedia.org\/wiki\/%D0%A2%D0%B5%D0%BD%D0%B7%D0%BE%D1%80%D0%BD%D1%8B%D0%B9_%D0%BF%D1%80%D0%BE%D1%86%D0%B5%D1%81%D1%81%D0%BE%D1%80_Google#\/media\/%D0%A4%D0%B0%D0%B9%D0%BB:Tensor_Processing_Unit_3.0.jpg\" target=\"_blank\" rel=\"noopener\" title=\"\">Wikipedia<\/a>.<\/figcaption><\/figure>\n<p>In conventional CPU and GPU architectures, each operation entails reading, processing and writing intermediate data to memory.<\/p>\n<p>The TPU streams data through an array of blocks, each performing a mathematical operation and passing the result to the next. Memory is touched only at the beginning and end of the computation sequence.<\/p>\n<p>This approach spends less time and energy on AI maths than a non-specialised GPU, though reliance on external memory remains a constraint.<\/p>\n<h3 class=\"wp-block-heading\">Cerebras <\/h3>\n<p>American firm Cerebras has figured out how to use an entire silicon wafer as a single processor, instead of dicing it into smaller chips.<\/p>\n<p>In 2019 the company <a href=\"https:\/\/www.cerebras.ai\/whitepapers\" target=\"_blank\" rel=\"noopener\" title=\"\">introduced<\/a> its first 300mm Wafer-Scale Engine. In 2024 it released the improved WSE-3, a 460mm die with 900,000 cores.<\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"804\" src=\"https:\/\/forklog.com\/wp-content\/uploads\/img-a338bd88ada90bd7-8455726269135373-1024x804.png\" alt=\"image\" class=\"wp-image-276639\" srcset=\"https:\/\/forklog.com\/wp-content\/uploads\/img-a338bd88ada90bd7-8455726269135373-1024x804.png 1024w, https:\/\/forklog.com\/wp-content\/uploads\/img-a338bd88ada90bd7-8455726269135373-300x236.png 300w, https:\/\/forklog.com\/wp-content\/uploads\/img-a338bd88ada90bd7-8455726269135373-768x603.png 768w, https:\/\/forklog.com\/wp-content\/uploads\/img-a338bd88ada90bd7-8455726269135373-1536x1207.png 1536w, https:\/\/forklog.com\/wp-content\/uploads\/img-a338bd88ada90bd7-8455726269135373.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Cerebras WSE-3 and two Nvidia B200 chips. Source: Cerebras.<\/figcaption><\/figure>\n<p>The Cerebras architecture distributes <a href=\"https:\/\/ru.wikipedia.org\/wiki\/SRAM_(%D0%BF%D0%B0%D0%BC%D1%8F%D1%82%D1%8C)\" target=\"_blank\" rel=\"noopener\" title=\"\">SRAM<\/a> blocks right next to logic modules on the same wafer. Each core works with its own 48KB of local memory, avoiding contention with other cores.<\/p>\n<p>According to the company, a single WSE-3 suffices for many inference workloads. For larger jobs, multiple chips can be clustered.<\/p>\n<h3 class=\"wp-block-heading\">Groq LPU<\/h3>\n<p>Groq (not to be confused with Grok from xAI) offers its own inference ASIC built on a Language Processing Unit (<a href=\"https:\/\/cdn.sanity.io\/files\/chol0sk5\/production\/4e3e6966d98da9ef4bfd9834dbfc00921da58252.pdf\">LPU<\/a>) architecture.<\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"1024\" src=\"https:\/\/forklog.com\/wp-content\/uploads\/img-5840ed09f68c252d-8455523251131636-1024x1024.png\" alt=\"image\" class=\"wp-image-276636\" srcset=\"https:\/\/forklog.com\/wp-content\/uploads\/img-5840ed09f68c252d-8455523251131636-1024x1024.png 1024w, https:\/\/forklog.com\/wp-content\/uploads\/img-5840ed09f68c252d-8455523251131636-300x300.png 300w, https:\/\/forklog.com\/wp-content\/uploads\/img-5840ed09f68c252d-8455523251131636-150x150.png 150w, https:\/\/forklog.com\/wp-content\/uploads\/img-5840ed09f68c252d-8455523251131636-768x768.png 768w, https:\/\/forklog.com\/wp-content\/uploads\/img-5840ed09f68c252d-8455523251131636.png 1125w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Groq chip. Source: <a href=\"https:\/\/groq.com\/groqcloud\" target=\"_blank\" rel=\"noopener\" title=\"\">Groq<\/a>.<\/figcaption><\/figure>\n<p>One key trait of Groq\u2019s chips is their optimisation for sequential operations.<\/p>\n<p>Inference generates tokens step by step: each step must finalise the previous one. Under such constraints, performance depends more on the speed of a single stream than on the number of streams.<\/p>\n<p>Unlike familiar general-purpose processors\u2014and some AI-specialised devices\u2014Groq does not assemble machine instructions on the fly. Every operation is pre-planned in a sort of \u201cschedule\u201d and pinned to a specific moment in the processor\u2019s work.<\/p>\n<p>As with several other AI accelerators, the LPU combines logic and memory on one die to minimise data movement.<\/p>\n<h3 class=\"wp-block-heading\">Taalas<\/h3>\n<p>All of the above assume a high degree of programmability: the model and its weights are loaded into rewritable memory. At any moment an operator can swap in a different model or tweak it.<\/p>\n<p>With this approach, performance hinges on memory availability, speed and capacity.<\/p>\n<p>Taalas <a href=\"https:\/\/taalas.com\/the-path-to-ubiquitous-ai\/\" target=\"_blank\" rel=\"noopener\" title=\"\">goes further<\/a>, choosing to hard-wire a specific model with fixed weights directly into the chip at the transistor level.<\/p>\n<p>What is usually software becomes hardware, eliminating the need for a separate general-purpose data store\u2014and the costs that come with it.<\/p>\n<p>In its first product, the HC1 inference card, the company used the open Llama 3.1 8B model.<\/p>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/forklog.com\/wp-content\/uploads\/img-fd33f723b5972ba4-8455627323924195.webp\" alt=\"image\" class=\"wp-image-276638\"\/><figcaption class=\"wp-element-caption\">Taalas HC1. Source: Taalas.<\/figcaption><\/figure>\n<p>The card supports low-bit precision down to 3-bit and 6-bit parameters, speeding up processing. According to Taalas, the HC1 handles up to 17,000 tokens per second while remaining relatively inexpensive and power-thrifty.<\/p>\n<p>The firm touts thousandfold performance gains over GPUs on a cost-and-power basis.<\/p>\n<p>Such hard-wiring, however, has a fundamental drawback: the model cannot be updated without replacing the chip.<\/p>\n<p>Even so, the HC1 supports <a href=\"https:\/\/www.ibm.com\/think\/topics\/lora\" target=\"_blank\" rel=\"noopener\" title=\"\">LoRA<\/a>, a method for fine-tuning <span data-descr=\"large language models\" class=\"old_tooltip\">LLM<\/span> by adding extra weights. With the right LoRA configuration, a model can be turned into a specialist in a particular field.<\/p>\n<p>Another difficulty is the design and manufacturing process for such \u201cphysical models\u201d. ASIC development is costly and can take years\u2014a serious constraint in a fast-moving AI market.<\/p>\n<p>Taalas says it has a new method for generating processor architectures to address this. An automated system converts a model and weight set into a finished chip design within a week.<\/p>\n<p>By the company\u2019s own estimates, the production cycle\u2014from receiving a new, previously unseen model to shipping chips that embody it in hardware\u2014will take about two months.<\/p>\n<h2 class=\"wp-block-heading\">The future of local inference<\/h2>\n<p>New specialised AI chips are first landing in vast data-centre installations, powering metered cloud services. The more radical options\u2014even \u201cphysical models\u201d realised directly in silicon\u2014are no exception.<\/p>\n<p>For consumers, the engineering leap will show up as cheaper services and faster response.<\/p>\n<p>At the same time, simpler, cheaper and more energy-efficient chips lay the groundwork for widespread local inference.<\/p>\n<p>Specialised AI chips are already in smartphones and laptops, security cameras and even doorbells. They enable local processing with low latency, autonomy and privacy.<\/p>\n<p>Radical optimisation\u2014even at the expense of flexibility in choosing and swapping models\u2014greatly expands what such devices can do, allowing simple AI components to permeate cheap mass-market products.<\/p>\n<p>If most users begin routing their queries to models that run on local devices, the load on data-centre capacity may ease, reducing the risk of overstrain. Perhaps then there will be no need to seek radical ways to expand compute\u2014such as launching it <a href=\"https:\/\/forklog.com\/en\/news\/the-economist-predicts-the-dawn-of-space-based-data-centres\">into orbit<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Why GPUs are hitting a wall with neural networks\u2014and the specialised chips rising to replace them.<\/p>\n","protected":false},"author":1,"featured_media":95137,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"select":"1","news_style_id":"1","cryptorium_level":"","_short_excerpt_text":"How AI chips overcome the 'memory wall'","creation_source":"ai_translated","_metatest_mainpost_news_update":false,"footnotes":""},"categories":[1144],"tags":[438,1295,1294],"class_list":["post-95136","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-longreads","tag-artificial-intelligence","tag-chips","tag-nvidia"],"aioseo_notices":[],"amp_enabled":true,"views":"277","promo_type":"1","layout_type":"1","short_excerpt":"How AI chips overcome the 'memory wall'","is_update":"0","_links":{"self":[{"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/posts\/95136","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/comments?post=95136"}],"version-history":[{"count":2,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/posts\/95136\/revisions"}],"predecessor-version":[{"id":95187,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/posts\/95136\/revisions\/95187"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/media\/95137"}],"wp:attachment":[{"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/media?parent=95136"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/categories?post=95136"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/forklog.com\/en\/wp-json\/wp\/v2\/tags?post=95136"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}