Researchers at Meta AI released the ‘protein language model’ ESM-2 with 15 billion parameters and the ESM Metagenomic Atlas database, containing over 600 million predicted structures of metagenomic compounds.
Proteins are complex molecules made up of up to 20 amino acids and perform all kinds of biological functions in organisms. They fold into complex three-dimensional structures, the shapes of which directly influence how they work.
Determining the type of interactions helps scientists understand how proteins function. Shape data also helps them find ways to imitate, modify or counter this behavior.
You cannot derive the final structure from amino acid formulas alone, and simulations and experiments take a long time.
In a statement, neural network transformer ESM-2 is a large language model designed to “study evolutionary patterns and generate accurate predictions of interactions directly from a protein sequence.”
The system processes gene sequences using a self-supervised learning method called masked language modelling.
According to researchers, they trained the algorithm on a dataset of sequences from millions of natural proteins.
“With this approach, the model should correctly fill in words in a snippet of text, for example ‘To __ or not __, that is __’. We trained the language model to fill in gaps in protein sequences like ‘GL_KKE_AHY_G’ among millions of different interactions,” the study says.
ESM-2 is the largest and most capable neural network of its kind. Scientists say the algorithm is 60 times faster than other contemporary systems such as AlphaFold from DeepMind.
The algorithm helped create ESM Metagenomic Atlas, predicting 617 million structures from the protein database MGnify90 in just two weeks on a cluster of 2,000 GPUs. For modelling a 384‑amino‑acid chain on one Nvidia V100 GPU, it would take about 14.2 seconds.
“With current computing tools, predicting the structure of hundreds of millions of proteins could take years, even with the resources of a major research institution. To make predictions at the metagenomics scale, a breakthrough in speed is crucial,” the developers noted.
Meta AI hopes that ESM-2 and the ESM Metagenomic Atlas will advance science and help researchers studying evolutionary history or combating disease and climate change.
“We are also exploring ways to apply language models to develop new proteins and help address health and environmental problems,” the scientists added.
In July, DeepMind’s AlphaFold predicted almost all known biological compounds discovered in plants, bacteria and animals.
In the same month, MIT researchers developed the deep-learning model EquiBind, which binds molecules to proteins for drug design 1,200 times faster than its rivals.
In July 2021, DeepMind’s AI modeled 20,000 human protein structures.
Subscribe to ForkLog’s Telegram news: ForkLog AI — all the news from the world of AI!
