From PDB to AlphaFold

This timeline is an ongoing collective history and reflection about the datasets (especially the PDB) and scientific advances that have given origin to AlphaFold.

Contributions are welcome, for example detailed accounts and verifications from the individuals involved, interviews with them, identifications of relevant documents and reflections about the future of AI in science that are inspired by this history. The following have already contributed:

Jane Richardson (Duke), Helen Berman (USC and Rutgers, a former Director of PDB, see Interview) , Pierre Baldi (UC Irvine, see comment and Interview), Søren Brunak (University of Copenhagen, see comment and Interview), Adam Godzik (UC Riverside, see comment), Tomaso Poggio (MIT, see comment), Alyssa Cruz (Sanford Burnham Prebys), Martin Steinegger (Seoul National University, a co-author in the original AlphaFold2 paper, see comment), Johannes Söding (Max-Planck Institute for Multidisciplinary Sciences, see comment), Joel Sussman (Weizmann Institute, a former Director of PDB, see comment), Philip Campbell (former Editor in Chief of Nature, see comment), Alexander Wlodawer (NIH, see comment),Tom Cech (former HHMI President, see comment), Harold Varmus (former NIH Director, see comment), Jake Feala (cofounder at Lila Sciences, see comment) and Mohammed AlQuraishi (Columbia University and OpenFold, see comment).

Comments and suggestions can be sent using our contact information.

The initial conclusions, about which the scientists listed above have been consulted, are the following:

- AlphaFold was built on several decades of contributions from scientists working in multiple fields, including protein structural biology, bioinformatics and deep learning AI.

- The PDB data were produced by an international scientific community. The PDB was one of the first large-scale, openly available, scientific data sharing resources. Full data sharing of protein structures was slowly accepted by scientists. This acceptance took 20-30 years and required a change in culture, which benefited from support from scientific institutions, including funders, journals and scientific associations. It was part of a broader trend towards open science.

- Key innovations also originated from a company, DeepMind, which was founded in 2010, when the majority of the AI scientific community was skeptical about the value of deep learning methods.

- The innovations introduced by AlphaFold have led to further academic contributions, as shown by the examples of RoseTTAFold and OpenFold.

Future applications of AI in science might benefit from synergies similar to those that were described in this history.

A brief timeline is shown below.

A detailed timeline with more information and references and a summary are also available.

1958-1960 The first two protein structures were determined.

1965 MIT’s Project Mac used computer graphics for the first time to display a protein structure.

1965 Margaret O. Dayhoff's Atlas of Protein Sequence and Structure compiled ~70 known protein sequences.

1969 Cyrus Levinthal described the paradox of protein folding: the folding process must be guided by specific interactions and not by a random search through all possible conformations, which would take an immensely long time.

1970–1982 The Needleman–Wunsch algorithm was the first systematic method for global sequence alignment. The Smith–Waterman algorithm provided a framework for local alignments to detect conserved regions.

These seminal works and their refinements developed the algorithm that is now executed to compute pairwise protein alignments.

1971 June. At a meeting at Cold Spring Harbor Helen Berman and her young colleagues present the idea for a data bank of protein structures.

Walter Hamilton, a senior scientist at Brookhaven National Laboratory, volunteered to set up the American protein data bank.

1971 August. The first 3D molecular graphics film was shown in a lecture by Joel L. Sussman on very small RNA structure.

1971-1976 The Protein Data Bank (PDB) was established. Originally contained 7 structures and initially it grew slowly. By 1976 a total of 13 structures were contained in the database.

1972 Christian Anfinsen received the Nobel Prize in Chemistry for his work showing that all the information for a 3D protein structure is contained in the sequence of amino acids.

1982 The sequence databases at GenBank in the US and at EMBL in Europe were opened to the public.

1988 First applications of neural networks to predict the secondary structure of proteins from the sequence.

1989 An article by Marcia Barinaga in Science about "The Missing Crystallography Data" provides a snapshot of the ongoing discussions about sharing protein structure data. Letters and petition signed by scientists encouraged sharing, and NIGMS supported this view.

There was a difference of opinion among Editors of the major scientific journals.

The International Union of Crystallography recommending deposition of data, but as a compromise among different viewpoints the release of some data (coordinates and structure factors) could be delayed.

1990 Altschul et al. introduced BLAST, a tool that revolutionized sequence searches by enabling rapid detection of sequence similarities.

1994 CASP (Critical Assessment of protein Structure Prediction) was co-founded by John Moult and Krzysztof Fidelis, as a blind and independent test of software for the prediction of protein structure from sequence.

The results improved until 2002, but after that date they were essentially flat. The next major improvements were in 2018 and 2020 with Alphafold and Alphafold2.

1995 The first PDB Browser was released. Selected proteins in the PDB could be easily downloaded, and their molecular structures visualized on lab computers.

1996 The PDB released AutoDep, the first web-based tool for macromolecular structure deposition and validation.

1996 November. A conference celebrating the 25th Anniversary of the PDB and the 10th Anniversary of Swiss-Prot was held in Jerusalem. This was one of the first meetings at which 3D structural data and sequence information were analyzed synergistically.

1997 Building upon the BLAST framework, PSI-BLAST allowed researchers to detect even more remotely homologous relationships efficiently.

1998 Sean Eddy developed HMMER, an efficient suite of methods that applies hidden Markov models to sequence search and alignment. HMMER significantly improved the sensitivity of sequence comparisons, establishing itself as a critical tool for the large-scale annotation of protein families and domains.

1998 Nature, Science and PNAS required the immediate release of high-resolution structural coordinate data upon publication, mentioning the opinion of a significant majority of the scientific community.

The NIH required grant recipients to deposit atomic coordinates for immediate release upon publication. Other funders, like HHMI, also similarly changed their policy.

1999 The Research Collaboratory of Structural Bioinformatics (RCSB) became the new manager of the PDB. A new data representation format, which was fully machine readable and facilitated quality control, started to be adopted by the RCSB.

2000 Larry Page, who co-founded Google in 1998, predicts the importance of AI for providing answers to search inquiries and for the future Google.

2000 The US National Institute of General Medical Sciences (NIGMS) at NIH supported the Protein Structure Initiative (PSI) for 15 years, from 2000 to 2015, providing grants for around $1 billion in total. The PSI was created to solve novel protein structures in a high-throughput manner.

There was much debate during the period of NIH funding of the PSI about the relative merits of investigator-initiated, hypothesis-motivated science versus more systematic discovery science The PSI was the largest of the structural genomics consortia, which were based not only in the US but also in Europe, Canada and Japan.

2002 PIR (originating from Dayhoff's Atlas), Swiss-Prot (established by Amos Bairoch in 1986) and TrEMBL received support by NIH and merged to form UniProt. The extensive open-source protein sequence data in UniProt was indispensable for generating diverse multiple sequence alignments, critical for training AlphaFold2.

2003 The worldwide PDB is announced. The goal is "maintaining a single archive of macromolecular structural data that is freely and publicly available to the global community".

2005–2012 Johannes Söding and colleagues developed HHsearch and HHblits, which greatly enhanced sensitivity for detecting remote homology, enabling the discovery of extremely distant relationships.

2008 The first significant use of GPUs (graphics processing unit) in machine learning applications. GPUs were initially developed for digital image processing and used by the videogame industry but were later found to be able to considerably speed up calculations needed in AI applications.

2008 PDB depositions required not only the coordinates but also the structure factors (the experimental data). The experimental data allowed a more comprehensive validation.

2009 Initial publication of ImageNet, a very large and systematic dataset of labelled images built by a group led by Fei-Fei Li and designed to support AI vision research. There was increased recognition of the importance of both algorithms and datasets in AI research.

2010 August. Demis Hassabis presented at a conference in San Francisco the key ideas behind DeepMind. He suggested that machine learning and knowledge of neuroscience could be combined to design artificial general intelligence.

In November DeepMind was officially founded by Hassabis, Shane Legg and Mustafa Suleyman.

2010 to early 2011 Funding of DeepMind by Venture Capital groups, led by Peter Thiel. Tomaso Poggio, a well-known MIT scientist, was also a minor investor. According to Shane Legg there was little support at the time for approaches to AI that aimed to build artificial general intelligence. One of their first breakthroughs was an algorithm that could play many different Atari videogames.

2012 AlexNet, a convolutional neural network, designed by Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton from the University of Toronto, won the ImageNet Large Scale Visual Recognition Challenge. It was the first model based on neural networks to win the competition, achieving a large improvement compared to previous methods. The three authors were hired by Google in 2013.

It has been widely commented that the ImageNet 2012 competition triggered the most recent explosion of interest in AI.

2014 January. Google bought DeepMind for around $600m but DeepMind initially remained as a separate entity. DeepMind obtained access to a large computational infrastructure and capital for expanding and acquiring top talent for their team.

2014 The total number of structures deposited in PDB surpassed 100,000.

2016 The AlphaGo match was another proof of principle for DeepMind. After the AlphaGo match DeepMind started a serious effort on the protein folding problem.

2016 The exploitation of the huge metagenomics sequence sets for iterative sequence searching to build multiple sequence alignments required a fast sequence profile search tool that can handle datasets of billions of sequences. MMseqs2 filled that gap. It would later enable the fast generation of multiple sequence alignments for AlphaFold2 and Colabfold.

2017 The integration of metagenomic sequencing data dramatically expanded the pool of available protein sequences by adding billions of sequences from diverse microbial communities. This expansion has vastly improved the breadth and accuracy of multiple sequence alignments used in protein structure prediction and other analyses.

2017-2018 Linear-time sequence clustering enabled the exploitation of huge metagenomic sequence corpora.

Additionally, the Uniclust resource was established to provide deeply clustered and annotated protein sequence databases based on UniProt data and was utilized for AlphaFold2 training.

2017 Publication of the "Attention Is All You Need" paper about transformers by a group from Google. Transformers are a type of AI architecture that can process entire sequences in parallel and better capture long-range dependencies. They have become the foundation of many state-of-the-art AI systems, including large language models like GPT, image classification systems and protein structure prediction systems like AlphaFold2.

2017 PDB produces a validation report that is required for review by an increasing number of journals. The report provides metrics to evaluate the quality of the experimental data, the structural model, and the fit between them.

2018 AlphaFold from DeepMind wins CASP13.

2020 DeepMind's AlphaFold2 wins CASP14, and it is considered by many to have essentially solved the protein folding problem. The authors stated that "This bioinformatics approach has benefited greatly from the steady growth of experimental protein structures deposited in the Protein Data Bank (PDB), the explosion of genomic sequencing and the rapid development of deep learning techniques to interpret these correlations."

Alphafold2 is based on a modified transformer architecture. It uses comparative evolutionary information in the Evoformer, and then passes information to another transformer called the structure module.

The DeepMind team working on AlphaFold was led by John Jumper and supervised by Demis Hassabis. A discussion started about the reasons for their success.

2021 RoseTTAFold, developed by a team led by David Baker, incorporated ideas from AlphaFold2 and achieved accuracies approaching it.

2022 ColabFold made AlphaFold2 predictions widely accessible to researchers and practitioners without access to large-scale computing infrastructure

2022 DeepMind releases structure predictions for 218 million proteins, nearly all known proteins.

2023 - 2024 A paper describing AlphaMissense, developed by DeepMind, was published in 2023. It predicted the pathogenicity of all possible human single amino acid substitutions. All the components of the AlphaFold and AlphaFold2 AI models were shared openly, but in the case of AlphaMissense some parameters were not shared.

In 2024 DeepMind releases AlphaFold3, which adds a diffusion-based method to predict binding structures and interactions of proteins with other molecules.

When AlphaFold 3 was published in Nature the code was not provided.

A petition signed by more than one thousand scientists expressed disappointment with the lack of disclosure of the AlphaFold3 code at the time of publication in Nature.

Six months after publication the code of AlphaFold3 was released for academic use.

2024 The lab of David Baker releases RoseTTAFold All-Atom, which predicts 3D structures of assemblies of proteins and other small molecules.

OpenFold, an open-source implementation of AlphaFold2 including the code and data required to train new models, is produced by a large academic collaboration and yields insights into its learning mechanisms and capacity for generalization.

2024 Oct. Nobel in Chemistry awarded to David Baker, Demis Hassabis and John Jumper "for computational protein design and protein structure prediction".

2024 Nov. At the AI for Science Forum, co-hosted by Google DeepMind and the Royal Society, Janet Thornton, which was closely involved in the PDB, said that it took 20 years for every scientist to come around to the idea of sharing the data. Some of the most famous scientists did not initially share. A change of culture was needed.

Siddhartha Mukherjee pointed out that patients might freely share their data to benefit the public good but might be less inclined to agree to do it for the benefit of a company.

Anna Greka suggested that a dataset that could play the same role as the PDB for future AI models of the cell could be obtained by systematically perturbing human cells.

Paul Nurse said that science has increased in complexity and silos have been created. We must begin by seeing how we can break down those silos, how we can get the different parts of the scientific community talking to each other, especially with respect to artificial intelligence, because we are all being influenced by it.

2024 Dec 8. Nobel lectures by Baker, Hassabis and Jumper. All the speakers said that the PDB data had been essential for their work.

A list of criteria was suggested by Demis Hassabis to determine if a scientific problem is suitable for an AI solution:

1- Massive combinatorial search space

2- Clear objective function (metric) to optimize against

3- Either lots of data and/or an accurate and efficient simulator

2025 January. Demis Hassabis stated that most of the old AlphaFold team at Google DeepMind is now working on the Virtual Cell, building an AI simulation of a working cell. They expect to solve this problem within the next five years.

Other approaches to AI in science are also being explored, as shown by the comments on this website from industry and academic scientists, including Jake Feala, Aviv Regev and Sarah Teichmann, Gene Yeo, Jack Gilbert, Gary Siuzdak and Bruno Conti, Andrea Califano, Pierre Baldi, Soren Brunak, Talmo Pereira, Andrew McCulloch and many others, included in the sections on Interviews , Roundtable and Surveys.