Johannes Söding is a Research Group Leader at the Max Planck Institute for Multidisciplinary Sciences in Göttingen, Germany.
He sent the following comment:
Martin Steineggers ’s comments are pretty complete and summarize very well the development of protein sequence search methods that facilitated the development of deep learning models trained on billions of sequences, such as AlphaFold2.
However, I would emphasize more the critical importance of the Linclust algorithm both for enabling the training protein language models and for ensuring the generation of sufficiently diverse multiple sequences alignments that AlphaFold2 requires for high-quality predictions. I think it is no exaggeration that Linclust is at the core of the breakthroughs in protein language models and deep-learning-based protein structure prediction and protein engineering. I would rearrange the content of the two paragraphs that Martin proposed in a slightly different way:
2016: MMseqs2: Fast iterative profile searches for building MSAs
The exploitation of the huge metagenomics sequence sets for iterative sequence searching to build MSAs required a fast sequence profile search tool that can handle datasets of billions of sequences. MMseqs2 filled that gap, with a search speed two to three orders of magnitude faster than PSI-BLAST or HMMER yet similar sensitivity. It would later enable the fast generation of MSAs for AlphaFold2 and Colabfold.
Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology
2017: Linear-time sequence clustering enabled the exploitation of huge metagenomic sequence corpora
In as much as large language models have profited from ever increasing sizes of their training corpus, the deep-learning revolution in protein biology, including AlphaFold, relies critically on training protein language models with huge numbers of non-redundant sets of protein sequences. AlphaFold2, for instance, was trained on a collection of representative sequences obtained by clustering 4 billion sequences from metagenomic and genomic sources (BFD database) and 1.6 billion sequences from MGnify v18. Generating such huge reference sets only became possible with Linclust, the first algorithm whose runtime scaled linearly instead of quadratically with the size of the input sequence set. Before Linclust, the practical limit for sequence clustering was at around 100 million sequences. AlphaFold2 profits in another way from the huge and diverse databases such as MGnify and BFD clustered with Linclust. The model quality depends on a sufficient diversity of the MSA built from the query sequence, and that diversity may depend crucially on the diversity of the sequence databases in which it searches for homologous sequences. (Removing both MGnify and BFD for the MSA generation reduced AlphaFold2’s mean GDT score by 6.1.)
Steinegger M. & Söding J. (2018). Clustering huge protein sequence sets in linear time. Nature communications 9(1) 2542.
Ovchinnikov, S. et al. (2018). Protein structure determination using metagenome sequence data. Science