Protein Language Models could transform outbreak response
Published: 30 March 2026
CVR researchers investigated whether pre-trained protein language models could be applied to viral protein sequences when very little data is available, providing useful insights during an emerging outbreak.
The emergence of viruses with pandemic potential is becoming increasingly likely as the human-wildlife interface expands and global connectivity increases. When a novel virus is discovered, or when new variants emerge, scientists rely on laboratory experiments alongside computational evolutionary and epidemiological methods to help characterise the threat.
However, this is a time-consuming process. Laboratory experiments often require weeks to complete, while evolutionary and modelling methods depend on the availability of large numbers of viral sequences. Because outbreaks can develop rapidly, there is a pressing need for tools that can provide useful insights in the earliest days of an emerging outbreak.
Researchers are now exploring whether artificial intelligence approaches known as protein language models (PLMs) could offer a solution. PLMs leverage advances in deep learning models applied to natural language processing to “learn the language” of protein sequences. Trained on the wide range of protein sequences available from across the tree of life, they learn the statistical dependencies between amino acids that reflect structural, functional, and evolutionary constraints.. This ability allows PLMs to predict several useful properties of sequence, such as their 3D structure and to predict where amino acids, i.e., mutations, are likely to be accommodated. Prior studies suggests that PLMs can predict the impact of viral protein mutations. However, this has not been well characterised for the foundation PLMs like ESM-2.
In the new research, the team investigated whether pre-trained protein language models could be applied to viral protein sequences when very little data is available – even when only a single sequence exists.
Strikingly, the researchers found that by comparing just one SARS-CoV-2 sequence with a reference sequence, PLMs could reveal important biological information about the virus’s spike glycoprotein, the protein responsible for enabling the virus to enter human cells.
The study shows that the model ESM-2 can identify where mutations are likely to accumulate in the spike protein. This is possible because the model has learned patterns from proteins across the broader sarbecovirus group and beyond.
Using ESM-2, the researchers demonstrated that the model could help identify sites that are critical to the protein’s function, detect mutational epistasis (where mutations influence each other’s effects), and, when additional sequences are available, highlight potential variants of concern.
To investigate these capabilities, the team used protein language models to conduct in silico deep mutational scans of viral proteins. The results from these scans were then used to fit predictive models to evaluate how well PLMs capture the behaviour of viral proteins. The findings show that PLMs can provide useful insights into how viral proteins function and evolve. The research also clarifies how concepts such as “grammaticality” and “semantics”, commonly used in language models, translate to the analysis of viral proteins and what these scores can reveal about mutation patterns.
Overall, the study demonstrates that pre-trained protein language models can help scientists understand viral proteins even when only limited sequence data is available. This makes them particularly valuable during the earliest stages of an outbreak, when rapid insights are most needed.
The researchers outline how these models could be applied throughout different stages of outbreak response – from the moment a viral sequence first becomes available, through to longer-term horizon scanning as part of routine surveillance.
When combined with experimental data, PLMs can become even more effective predictors and complement traditional laboratory and computational approaches.
As protein language models continue to improve, they may play an increasingly important role in analysing viral sequences and anticipating how viruses could evolve in the future. While the study highlights both the benefits and limitations of the approach, it suggests that these tools could one day become a central part of how scientists assess emerging viral threats and support rapid responses to future outbreaks.
First published: 30 March 2026
<< News