top of page
SCIENTIFIC COLLECTIVE and ARTIFICIAL INTELLIGENCE
Forum Posts
Giovanni Paternostro
Sep 15, 2025
In AI and Collective Intelligence
Summary of older AI and Collective Intelligence discussions
The recent progress in Artificial Intelligence provides both challenges and opportunities for scientific collective intelligence. Among the most notable examples of AI progress are ChatGPT and other Large Language Models, which have shown unexpected capabilities (Wei 2022, Mitchell 2023), and AlphaFold, which can predict the 3D shape of proteins from their genetic sequence with unprecedented accuracy (Jumper 2021). The development of AlphaFold has been recognized by the award of the 2014 Nobel Prize in Chemistry.
Challenges
AI poses specific challenges for science. There are many reports of errors in statements from ChatGPT and from other AI systems. The types of errors and blind spots seem different from those more common in humans.
These AI systems consist of neural networks with billions to trillions of parameters (Mitchell, 2023) and it is therefore not possible to provide a simple explanation for their outputs.
Another concern is that the most powerful AI systems are currently privately owned and not as transparent as they could be. One of the topics emerged from our discussion is the potential benefit for society of a public or nonprofit AI effort with the same scale and level of funding as the current large private efforts. Many comments have pointed out that, if the most advanced science were to be done only in the private sector, the lack of transparency will decrease trust in science, support for academic research will decline and society will not be able to fully benefit from the opportunities provided by AI in science.
Opportunities
There is wide support for the view that human intelligence evolved in response to intellectual challenges, possibly posed by social interactions. AI systems can be an opportunity that will stimulate our collective intelligence.
There are several examples of major challenges that have promoted collective scientific efforts. Among these are:
• The WW2 effort at Bletchley Park, where Alan Turing played a key role; this effort led to the first purely electronic digital computers, responding to cryptography advances by Germany.
• The Manhattan Project, prompted by the discovery of nuclear fission by German scientists and leading to advances in nuclear physics, for both military and peaceful applications.
• The creation of NASA that led to the Moon landing, sparked by the Sputnik launch and the space race with the Soviet Union.
AI can make large scale discussions possible by finding new ways to connect individuals and ideas. This could be an iterative process in which human scientists can decide which avenues to pursue and provide novel contributions.
In both artificial and natural neural networks some capabilities emerge when a certain size is reached. In both cases (AI and brain) other properties like connectivity are likely to synergize with size (Wei 2022, Tattersall 2023). It is an open question what properties might emerge in the case of collective intelligence as it grows further.
According to a survey from the Pew Research Center (Pew 2024), 76% of US adults express a great deal or fair amount of confidence in scientists to act in the public’s best interests. Scientists are held in higher regard than many other prominent groups, including elected officials, journalists and business leaders. Society is therefore likely to take seriously their views about AI, if these are reached after an open debate. An example of the effectiveness of transparent, community-level consultations among scientists in increasing support from the public and from funders is the Snowmass process in particle physics. Other fields are also adopting similar processes, as shown by the Decadal Survey on Astronomy and Astrophysics.
A fundamental aspect of scientific collective intelligence is the communication of ideas among scientists.
Renato Dulbecco shared his memories about a more open time in biomedical scientific communication. Multiple historical sources confirm his statements, and show that scientific habits that might seem immutable do change, responding to the changing world (using Dulbecco's words). Dulbecco also pointed out that scientific communication does depends more on the motivations of individuals that on technical means.
The section on Science Incentives outlines a strategy for the scientific community to self-motivate an open discussion about AI in science.
Discussion platforms can be biased and platforms for scientific collective intelligence would benefit from mechanisms that ensure trust.
In the case of this discussion, we intend to establish two oversight groups:
• An oversight group composed of younger scientists, with representatives nominated by associations of postdocs and graduate students. Trainees do not have a long-term link with a particular institution, are more likely to rapidly learn new AI techniques and are often more open to innovative ideas.
• An oversight group composed of current and former scientific leaders. Some of these are already involved in the current discussion and have participated by sharing ideas and by conducting interviews.
These two groups will serve as a system of checks and balances to ensure that the discussion is run for the benefit of the entire scientific community and of human society. They will determine their own internal structures and procedures.
An example of the benefit of involving both these groups of scientists can be found in the history of AI in science. The development of AlphaFold was only possible because of the data contained in the PDB (Protein Data Bank). The PDB started in 1971 as a grassroots proposal from a group of young scientists, supported by scientific leaders like Walter Hamilton and Max Perutz (Berman 2008, Strasser 2019). The history of PDB also provides an example of many potential obstacles for scientific sharing and collaboration (Barinaga 1989), which were in this case eventually overcome.
REFERENCES
- Barinaga, M., 1989. The missing crystallography data. Science, 245(4923), pp.1179-1181.
- Berman, H.M., 2008. The protein data bank: a historical perspective. Acta Crystallographica Section A: Foundations of Crystallography, 64(1), pp.88-95.
- Jumper, J., et al., 2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), pp.583-589.
- Mitchell, M. and Krakauer, D.C., 2023. The debate over understanding in AI’s large language models. Proceedings of the National Academy of Sciences, 120(13), p.e2215907120.
- Pew Research Center, 2024. Public Trust in Scientists and Views on Their Role in Policymaking.
https://www.pewresearch.org/science/2024/11/14/public-trust-in-scientists-and-views-on-their-role-in-policymaking/
- Strasser, B.J., 2019. Collecting experiments: Making big data biology. University of Chicago Press.
- Tattersall, I., 2023. Endocranial volumes and human evolution. F1000Research, 12.
- Wei J et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
0
0
7
Giovanni Paternostro
Aug 12, 2025
In From PDB to AlphaFold
Guy Salvesen and Giovanni Paternostro have spoken with Adam Godzik. Adam is the Bruce D. and Nancy B. Varner Presidential Endowed Chair in Cancer Research at the UC Riverside School of Medicine, Division of Biomedical Sciences. Adam was closely involved from an early stage in CASP, as a participant, and in the Joint Center for Structural Genomics (JCSG), one of the centers supported by the Protein Structure Initiative (PSI).
He sent the following comments:
I think that the success of PDB was driven by it being built by the crystallographic community itself, it was an effort from within, not from outside. It became widely accepted relatively early in its history, definitely before I got into the field.
There was another development in bioinformatics that enabled AlphaFold – residue-residue interaction predictions from MSA (work of Debora S. Marks, for instance https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0028766) or a more general contact map prediction field with quite a long history. This is really how AlphaFold works – predicting a contact map from MSA and then refining it.
Clustering of large databases to get manageable and uniform datasets started from UniProt90 and Uniprot50 and our CD-HIT program before it was taken over by Uniclust. UniProt50 and Uniprot90, now done with MMseqs2, still remain a main source of representative protein sequences.
0
0
7
Giovanni Paternostro
Jun 18, 2025
In From PDB to AlphaFold
Genentech and DeepMind are examples of start-ups that achieved major scientific advances, while parallel efforts by academic groups and by large companies on the same problems were not successful.
A comparative analysis might better show the reasons behind their achievements.
The research on DeepMind is described in the timeline, while the research on Genentech is based on published historical reconstructions, as the books by Stephen Hall (1987) and by Sally Hughes (2011), on interviews with many protagonists available from the Berkeley Library Digital Collections and on conversations with former employees, including Roberto Crea, who was one of the first five employees and even before that a key author in the papers describing the initial work done at City of Hope with Genentech support (Crea et al, 1978; Itakura et al, 1977; Hirose et al 1978; Goeddel et al, 1979).
The comparative analysis is ongoing, but the following points are emerging:
1- In both cases there was initially widespread skepticism, but the novel approaches were supported by some scientific leaders.
In the case of Genentech, a demonstration of the skepticism is the grant submitted to NIH by Riggs and Itakura, from City of Hope. It was not funded by NIH but the work was done with the support of Genentech and was a key proof-of-principle result for the company. As described in the book by Hughes (2011):
"To conduct the experiment, Riggs and Itakura needed funding. In February 1976 they submitted a grant application to the NIH entitled “Human Peptide Hormone Production in E. coli.” They asked for $400,000 for a three-year project to make somatostatin using DNA synthesis and recombinant DNA technologies. They went on to state with notable confidence, considering the uncertainties involved:
The work proposed here will lead to the production of human hormone peptides in E. coli. We think that E. coli can be used to produce human hormones more cheaply and of better quality than can be made by synthetic peptide [protein components] chemistry. The availability of inexpensive, high quality human hormones will have many clinical application[s].
That fall the NIH turned down their grant application. The reviewers decided that Riggs and Itakura could not accomplish the proposed research in the stipulated three years and labeled it “an academic exercise” without practical merit."
Riggs (2021) has stated in a review paper:
"Itakura and I then wrote and submitted to the National Institutes of Health (NIH) (in early 1976) a grant application in which we proposed to chemically synthesize the gene for somatostatin, clone it in E. coli, and assay for the production of the somatostatin polypeptide. In this grant, we also stated that if using somatostatin was successful, we would use similar technology to produce insulin. The grant was reviewed moderately well but not funded. The summary statement of the rejection noted: “In conclusion, the goals reflect extremely complex and time-consuming projects which may not be reasonably accomplished in three years . . . the only possible outcome of this work would be to confirm that these manipulations can lead to the synthesis of a human peptide in E. coli. Because of the poor choice of the biological system, this appears as an academic exercise.” In hindsight, our project on somatostatin was indeed an academic exercise, but a novel one that provided strong patents and quickly led to a flourishing new industry. "
The work was in reality completed in a few months but several novel methods which were not described in the grant applications were needed, for example the strategy to synthesize a fusion protein to protect a peptide from degradation and more efficient methods for the chemical synthesis of nucleotide oligomers (Itakura et al, Science 1987).
Several large companies contacted by Herb Boyer were initially equally skeptical. Even Michael Bishop (Nobel Medicine, 1989) stated in his Introduction to Herb Boyer's Oral History: " I recall that Herb did offer me one main chance--the opportunity to be an early investor in Genentech; I declined. It seemed a dubious scheme to me."
In the case of DeepMind, Shane Legg mentioned how they started the company at a time when deep learning methods in AI were not widely accepted.
In both cases, however, some well-known scientist supported the novel approach. For Genentech Herb Boyer from UCSF, in addition to the City of Hope team. For DeepMind Tomaso Poggio from MIT.
2- Venture capital provided initial funding. This is a type of funding that does not require consensus in the scientific community.
The leading investors were Kleiner & Perkins for Genentech and Peter Thiel for DeepMind.
A policy of gradual risk reduction was adopted, with follow-up investments in steps (rounds), depending on the results obtained.
The risk reduction strategy used for Genentech was explained in detail by Tom Perkins in his oral interview (Berkeley Library Digital Collections):
" A week or so after that they had put together the nucleus of a business proposal to do genetic engineering, Bob [Swanson] brought it to me for financing. It was very conventional, in that I would put up the money, they would hire the people, and it would be a straightforward venture. I took the view that the technical risk was so enormous. I remember asking, “Would God let you make a newform of life like this?” I was very skeptical. I said that I would agree to meet with Boyer. He came in that same week, and we sat down in our conference room for about three hours. Of course, I have a background in physics, electronics, optics, computers, lasers. Biology was never a strength for me. I really didn't know what kind of questions to ask. So I said, “Let 's just go through it, step by step. Tell me what you are going to do. What equipment you 'll need. How will you know if you have succeeded? How long will it take?” I was very impressed with Boyer. He had thought through the whole thing. He had an answer for all those questions - you'll need this equipment, these basic chemicals, and take these measurements, and on and on. I concluded that the experiment might not work, but at least they know how to do the experiment.
I still felt the risk was stupendous. The next day I got together with Swanson, and I took the view that I am willing to go along with this thing but that we have got to figure out a way to take some of the risk out of it - something instead of me giving you all of the money, then you renting the facility, buying the equipment, and hiring the people. With that approach, you'll have spent maybe a million dollars by the time you get to actually performing the experiment. Then if it doesn't work, it is all over and all that money is lost.
“Can’t we figure out some way to subcontract this experiment to different institutions each of which already had part of these capabilities?” Nobody had all of the capabilities, that was very clear. In order to give some incentive to do that, to subcontract the work, I said I would be willing to finance the thing in phases, to put up less money upfront. If this thing starts to work, then I will put up more and more money at higher and higher prices, and you and Boyer will end up owning more of the company than if we just do it the conventional way. I 'll want to own most of the company if I 'm going to take all of that conventional risk. Swanson thought that was not a ridiculous suggestion. He went back to Boyer and a few days later they had come up with three institutions that could do this work."
In the case of DeepMind multiple rounds of funding also took place, before the acquisition by Google. The initial focus on video games might also be considered a risk reduction strategy, given the previous track record of Demis Hassabis in this field.
3- A start-up originated the project, but large companies later supported their efforts, after proof-of-principle data were obtained.
Eli Lilly and then other companies for Genentech, Google for DeepMind.
4- Both start-ups generated publications, and the scientists were eventually recognized by major scientific prizes.
The companies encouraged publications, to provide recognition for the company and for the scientists, and to facilitate recruitment of the best scientists. In the case of Genentech, proprietary information was covered by patents.
5- Interdisciplinary teamwork was essential.
In both cases it was noted that this was done on a scale difficult to reach for academic groups. The motivation was the success of the company and not the career of the individual, as often the case in academic labs.
One of the early employees of Genentech, Herbert Heyneker, stated in his Oral Hystory:
" In academe, the motivation is quite different. Graduate students are there to get a PhD thesis, so they focus on their little aspect. That’s all there is to it. They don’t have to integrate into a bigger project. The postdocs are there to make a name for themselves because they want to become assistant professors, so they have to publish. Those are the most productive years. But again, the goal is very personal. “What contribution can I make to a certain understanding of whatever.” It can be very individualistic. In industry, the goals are more clearly defined, but often you need different disciplines to reach them. So, indeed, out of Genentech came articles with twelve or fifteen names on them, and it was always viewed by academe as a funny way of doing science. I found the contrary; it was a very different way of doing science, because this was a demonstration that you can accomplish a lot by working together with different disciplines."
In the case of DeepMind, it has been noted as remarkable that the AlphaFold2 paper (Jumper at al, Nature 2021) had 19 authors listed as having contributed equally.
6- Young scientists played a key role.
DeepMind was founded by postdocs.
Younger scientists enjoyed a large degree of independence in the initial activities of Genentech, especially after the company acquired its own lab facilities, while senior scientists played an advisory and strategic role. According to Hughes (2011) in the early days of Genentech "The young scientists banded together into flexible multidisciplinary teams that exhibited inexhaustible engagement, camaraderie, and a willingness to pull together to reach common ends."
REFERENCES
Berkeley Library Digital Collections
Bioscience Oral Histories
https://digicoll.lib.berkeley.edu/search?ln=en&cc=Bioscience+Oral+Histories
Berkeley Library Digital Collections
Science, Tech, & Health Oral Histories
https://digicoll.lib.berkeley.edu/search?ln=en&cc=Science%2C+Tech%2C+%26+Health+Oral+Histories
Crea, Roberto, Adam Kraszewski, Tadaaki Hirose, and Keiichi Itakura. "Chemical synthesis of genes for human insulin." Proceedings of the National Academy of Sciences 75, no. 12 (1978): 5765-5769.
Goeddel, David V., Dennis G. Kleid, Francisco Bolivar, Herbert L. Heyneker, Daniel G. Yansura, Roberto Crea, Tadaaki Hirose, Adam Kraszewski, Keiichi Itakura, and Arthur D. Riggs. "Expression in Escherichia coli of chemically synthesized genes for human insulin." Proceedings of the National Academy of Sciences 76, no. 1 (1979): 106-110.
Hall, Stephen S. "Invisible frontiers: The race to synthesize a human gene." New York: Atlantic Monthly Press, 1987.
Hirose, T., R. Crea, and K. Itakura. "Rapid synthesis of trideoxyribonucleotide blocks." Tetrahedron Letters 19, no. 28 (1978): 2449-2452.
Hughes, Sally Smith. “Genentech: the beginnings of biotech.” University of Chicago Press, 2011.
Itakura, Keiichi, Tadaaki Hirose, Roberto Crea, Arthur D. Riggs, Herbert L. Heyneker, Francisco Bolivar, and Herbert W. Boyer. "Expression in Escherichia coli of a chemically synthesized gene for the hormone somatostatin." Science 198, no. 4321 (1977): 1056-1063.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A. and Bridgland, A., et al, 2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), pp.583-589.
Riggs, Arthur D. "Making, cloning, and the expression of human insulin genes in bacteria: the path to Humulin." Endocrine Reviews 42, no. 3 (2021): 374-380.
0
0
43
Giovanni Paternostro
May 27, 2025
In From PDB to AlphaFold
Mohammed AlQuraishi is an Assistant Professor in the Department of Systems Biology at Columbia University. He is one of the leaders of the OpenFold consortium (https://openfold.io).
Thanks for reaching out about this.
With regards to additions, one piece from my own work is the RGN paper (https://www.sciencedirect.com/science/article/pii/S2405471219300766),
which was the first paper to do end-to-end differentiable learning of protein structure, and the first to show that a protein can be folded implicitly using a neural network. This ended up being the approach that AlphaFold2 ultimately took (with many more additions and elaborations on top of course).
Outside of my own work, this paper (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005324)
from Jinbo Xu anticipated much of what comprised AlphaFold1, and came out before. It was arguably the first paper to show that deep learning can really move the needle on protein structure prediction.
Another interesting paper is this one (https://openreview.net/forum?id=Byg3y3C9Km),
from John Ingraham, which did differentiable protein simulation,
and this one (https://www.mit.edu/~vgarg/GenerativeModelsForProteinDesign.pdf),
which introduced some primitives that were used in AlphaFold2.
Hope this is of some help.
0
0
41
Giovanni Paternostro
Apr 24, 2025
In From PDB to AlphaFold
Jake Feala is a cofounder at Lila Sciences, a company unveiled in March 2025 aiming to use AI and autonomous labs to accelerate scientific discovery.
We invited him to contribute to our historical timeline, from PDB to AlphaFold. Specifically, we asked him about the reasons why the protein folding problem was solved by a VC-backed company, DeepMind, and not by an academic group, and also if he thinks that this achievement provides a general solution for the future of AI in science.
Thanks for the opportunity to contribute!
I think it's not exactly the right question to ask why the protein structure prediction problem was solved by a company and not an academic group. A more fitting question is why DeepMind solved it and not some other entity, academic or not. Back then DeepMind was a totally unique company and not only beat out academia but also the entire biopharma industry which had plenty of resources and interest in the problem.
After the competition, by far the best take on what had happened, and better than I could provide, was in a blog post from Mohammed AlQuraishi. He poses the question of "why DeepMind" as well, and points out that AlphaFold's success was not only a win over academia, but just as much an indictment of big pharma's inability to innovate.
Some context on my own perspective at the time: we were working on deep learning for proteins at Generate Biomedicines in 2018, just before the CASP competition where AlphaFold was unveiled. We were trying to solve a different problem -- generate protein sequences for a given structure or function, rather than predict structure from sequence -- but we were still watching the field closely. Here are the best answers I heard at the time for why DeepMind won.
Engineers accelerate the researchers
AlQuraishi points out that "...competitively-compensated research engineers with software and computer science expertise are almost entirely absent from academic labs, despite the critical role they play in industrial research labs. Much of AlphaFold’s success likely stems from the team’s ability to scale up model training to large systems, which in many ways is primarily a software engineering challenge."
This is completely true but is too generous to "industrial research labs." Lack of engineering investment has been a problem with the industry for my whole career. I've been lucky enough to be part of some computationally well-resourced biotech companies, but these were the exception. Most biopharma companies and nearly all academic research groups are starving for talented software engineers relative to the tech industry. This has started to change recently, but we still have a lot of catching up to do in terms of culture, compensation, and technological maturity.
Protein folding as a game
Demis Hassabis is a master at picking problems. Recognizing that reinforcement learning (RL), DeepMind's bread and butter, worked best at solving games back then, he strategically went after both literal games (chess, Go, video games) and problems that could be "gamified." Keep in mind that at the time there already was essentially a massively multiplayer game for protein folding (Folding @ home).
Typically a game has a fixed environment with known rules that the RL algorithm can self-play toward mastery. While that's not exactly the case with protein structure prediction, and there was much more to the solution than RL, there is nevertheless a game-like aspect to the problem. It has a very clear objective where you know when you've won (a "finite game"). It has rules and, while not all of them are known, there is enough prior understanding (symmetries, 3D distances, bond angles, etc.) to get a head start. As they learned with AlphaZero, the algorithm can learn a strategy for playing the game while simultaneously learning the rules.
Protein structure of course also has lots of data. While there may be other problems in biology that can be gamified, none had a vast, clean dataset so perfectly matched to the objective of the game.
Finally, games invite competition, and competitions have winners. Over and over, DeepMind chose AI problems that they could objectively win in a loud, splashy way. The existence of the CASP competition was likely very enticing, if not a key reason they chose to work on this problem.
The machine
Once the perfect game was identified, it seemed like the specifics of the problem almost didn't matter -- DeepMind seemed to apply the same formula: recruit incredible talent, compensate them well, supply them with endless resources, and leave them alone to win the game. They could point their machine at any well-posed game and have a great chance of winning.
I know very little about the internal culture of DeepMind except that John Jumper was widely respected and a fantastic pick to lead the group, and that the team applied more compute and more software and data engineering resources to the problem than any other competing group.
I would add that DeepMind is not at all a typical VC-backed venture, especially for that time. During the 2010s, VC-backed software startups were mostly in the "Lean Startup" tradition of customer obsession and finding early product-market fit. DeepMind is pretty much the opposite of that. Their success, and that of OpenAI, SpaceX, etc., is part of why we see many more very future-looking, heavily funded "moonshot" companies now than we did back then. But back then they were completely unique in their huge upfront funding, lack of attention to near-term products or revenue, and grand long-term vision.
The future of AI for science
I have much to say about the future of AI in science that I won't get into here. Suffice to say it will obviously be a major driver of progress, but not the whole story. For one thing, I'm skeptical of aspirations to build ground-up simulations of biology, or to train a superintelligent "oracle" that can answer any scientific question. I think nature is too complex, and obviously we'll always need deep and constant contact with reality through experimentation.
For this reason, while I highly respect DeepMind, I have doubts about their further aspirations in biology. There may be other problems in the field that can be similarly "gamified” with existing data, but I think the opportunities are limited. In interviews, Demis has hinted at building a "virtual cell," which is a worthy but wildly underspecified challenge, especially for his approach. The datasets in cellular biology are messier and harder to interpret than protein structures, and there are so many potential objective functions to choose from that any successful solution has a much narrower range of values.
For example, you might build a model to perfectly predict gene expression profiles from single-cell sequencing data. Great! Extremely cool and useful. Or you might extend AlphaFold to predict structure of multi-protein complexes and binding to other molecules such as RNA or metabolites. That would be truly amazing! But while these would be incredible capabilities, they are both far from a "virtual cell," which would require dozens or even hundreds of such models across all of the metabolic, structural, and information processing systems of the cell, integrated and trained over every possible context (e.g. cell type, tissue or organ) or perturbation (e.g. a drug or mechanical stimulus).
Another problem is that there are few competitions to win in these areas, and so your model will instead be subjected to the less sexy arena of peer review and citations to signal superiority. Or worse, you'll have to compete for market share as a tool for the struggling drug industry. The incentives start to trail off both for investors and talent.
I hope that I’m wrong and genuinely wish them the best, as we're all working toward the same long-term goal of improving human health, but we are pursuing a different approach at Lila. You can read more on our website, but essentially we are working toward integrating AI with automated experimentation in a continuous loop, through which AI autonomously learns and proposes and carries out the best next experiment. We believe science is a process that can be accelerated, not a game that can be won.
REFERENCES
- https://www.prnewswire.com/news-releases/flagship-pioneering-unveils-lila-sciences-to-build-superintelligence-in-science-302397198.html
- https://www.lila.ai/
- Steve Lohr - The Quest for A.I. ‘Scientific Superintelligence’ New York Times, March 10, 2025 https://www.nytimes.com/2025/03/10/technology/ai-science-lab-lila.html
0
0
85
Giovanni Paternostro
Apr 21, 2025
In From PDB to AlphaFold
The timeline traces the historical milestones that led to AlphaFold, a landmark achievement in protein structure prediction powered by artificial intelligence (AI). It highlights key scientific, methodological, and cultural developments spanning over six decades, beginning with the first protein structures solved by Kendrew and Perutz (1958-1960).
Significant early milestones were the establishment of protein sequence and structure repositories, particularly the Protein Data Bank (PDB) initiated in the early 1970s by an effort including both senior and junior scientists. The PDB grew from these grassroots efforts amid debates about data-sharing practices, progressing gradually over several decades. The adoption of open data sharing policies was a consequence of community letters and petitions, initiatives prompted by the PDB leaders and decisions of scientific societies, journals (like Nature and Science) and funders (like HHMI and NIH).
Bioinformatics methods and computational tools evolved considerably, from algorithms for sequence alignment (1970s-80s), through other bioinformatics tools in the following decades, adopting practices of open-source software development and significantly enhancing sequence analysis capabilities.
The Critical Assessment of Protein Structure Prediction (CASP) launched in 1994 benchmarked computational predictions. After initial improvements, for more than 10 years there was no progress in the prediction metrics, until DeepMind’s AlphaFold achieved breakthroughs in 2018 and 2020.
The advent of GPUs (2008), large datasets like ImageNet (2009) and algorithmic innovations like transformers catalyzed advancements in AI.
DeepMind, a company co-founded by Hassabis in 2010 with venture capital support and acquired by Google in 2014, leveraged computational resources and data advancements, notably UniProt's extensive sequence datasets and PDB's comprehensive structure archives. Multiple sequence alignments provided evolutionary information. AlphaFold2 (2020) employed transformer-based neural networks, significantly outperforming prior methods in CASP14 and largely solved structure prediction for most single-chain globular proteins. Substantial challenges remained for disordered regions, multi-protein complexes, and dynamic conformational landscapes.
This historical perspective underscores crucial contributions by numerous scientists and institutions toward open data sharing, algorithmic innovation, and interdisciplinary collaboration. These advances led to AlphaFold and to its open-source academic derivatives, including RoseTTAFold, OpenFold and Colabfold.
Current efforts focus on extending AI applications to more complex cellular and biomolecular interactions, including the virtual cell, driving the next frontiers in science and biology.
0
0
31
Giovanni Paternostro
Apr 16, 2025
In From PDB to AlphaFold
Harold Varmus is a Professor of Medicine at Weill Cornell Medical College. His work has been recognized by the award of the 1989 Nobel Prize in Physiology or Medicine. He is a former Director of the NIH (1993-1999) and of NCI (2010-2015).
He sent the following comment about the change in policy at NIH in 1999 regarding the immediate release of structural data upon publication. We also asked him how this decision was influenced by his often-stated support for open science.
Like Tom Cech’s recollections, my memory of the dates and conversations pertinent to the history of making protein structural coordinates publicly accessible is a bit hazy. But the decision to promote rapid release of such information was based heavily on the very successful adoption of a very similar policy for DNA sequence information (the Bermuda Rules), which was then being generated by the Human Genome Project.
In addition, as you point out, the decision was also strongly influenced by my own feelings that publicly funded work should, in general, be made available as quickly as possible for others to use. That attitude was, as you note, closely linked to my interests, at about the same time, to create a public digital library (now known as PubMed Central) for open sharing of published work supported by the NIH and eventually by other funding organizations.
I am pleased that it has worked out so well and that you are developing this history to tell others that practices could have been different, and science could have advanced less quickly.
0
0
32
Giovanni Paternostro
Apr 09, 2025
In From PDB to AlphaFold
Thomas (Tom) Cech is a Distinguished Professor at the University of Colorado Boulder. He is a former President of HHMI (2000-2009). Dr. Cech's work has been recognized by the Heineken Prize of the Royal Netherlands Academy of Sciences (1988), the Albert Lasker Basic Medical Research Award (1988), the Nobel Prize in Chemistry (1989), and the National Medal of Science (1995). In 1987 he was elected to the U.S. National Academy of Sciences.
He sent the following comment about the change in policy at HHMI regarding the immediate release of structural data upon publication:
I appreciate your project, but my memory is hazy regarding the timeline of my own contributions (if any!) to the important requirement of depositing x-ray crystallography data on the PDB.
We began doing RNA crystallography in 1991, when Jennifer Doudna joined my lab as a postdoc, and began publishing structures in 1996. I became president of HHMI in January 2000, and I frankly don’t recall if Alex Wlodawer’s credit to me is deserved or if the HHMI policy was already in place.
I do recall that there was not unanimity about the policy – some wanted to keep their data secret for as long as possible, to thwart competition or perhaps to develop small-molecule inhibitors of an enzyme. Others shared my commitment to open science – in short, you aren’t required to publish your structure, but once you publish it you must make all data available. This allows the community to build on your work (and perhaps to falsify it!) and it moves science forward. The same principle of sharing applies to cell lines, transgenic mice, and computer code.
0
0
31
Giovanni Paternostro
Apr 05, 2025
In From PDB to AlphaFold
Alexander Wlodawer is a Senior Investigator at the Laboratory of Cell Biology, NCI, National Institutes of Health.
He sent the following comment:
In the history you stress the role of the journals and their rules regarding deposition of both coordinates and structure factors, but I did not see one other crucial development. That was the requirement by first HHMI, and then NIH to deposit such data as a condition of being funded. I don’t remember the exact dates, but I do remember talking to Tom Cech, then the director of HHMI, convincing him that such requirements should be put in place. He also promised to talk to Harold Varmus, then director of the NIH. Very soon after that conversation the rules were officially announced.
0
0
37
Giovanni Paternostro
Apr 01, 2025
In From PDB to AlphaFold
Philip Campbell is a former Editor in Chief of Nature. He is an astrophysicist and a Fellow of the Royal Society. He was knighted for services to science in 2015.
He sent the following comment about the change in policy at Nature and Science in 1998 regarding the immediate release of protein structural data upon publication:
Thanks for getting in touch about this topic and the valuable work on this strand of research history.
Background: I had the responsibility for working with colleagues in developing editorial policies for Nature and the Nature journals throughout my time as Editor in Chief of Nature (1995-2018) and then as Editor in Chief of Springer Nature, until my retirement from science publishing in 2023.
The substantial policy change you mention was one of the first that I can recall making. Memory plays tricks, especially over decades. However, as I recall, the trigger for the joint statement came from Floyd Bloom.
Floyd was Editor in Chief of Science at the time I (re)joined Nature. He contacted me at some point to introduce himself. He and my predecessor John Maddox had been in contact as the two members of what John had called “the smallest club in the world”: Editor in Chiefs of the two multidisciplinary science journals that also pursued journalism.
I had good relationships with Floyd and his successors. A key principle of these relationships was that, in the interests of researchers, editorial policy can be developed collaboratively between otherwise competitive journals, rather than being a basis of that competitiveness.
The first time this principle was experienced by me was when (as I hazily recall) Floyd raised the issue of open structural data. He was in favour of the journals simultaneously making the change, to compulsory openness.
This was certainly in response to changing needs and interests expressed by structural biologists. In general, editorial policy has sometimes followed in the wake of community change and on other occasions has come in at an early stage, after consultation, in order to encourage such community change. In retrospect, I’d say that this particular policy change was belated.
I discussed Floyd’s proposal with my colleagues, who were well aware of the community advocacy and controversies about our previous policies. No doubt we had already been considering the idea of a policy change. Colleagues were understandably cautious because there could be authors, especially those in industry, who might not be able or willing to submit important papers under such a condition.
As a compromise, I and my colleagues decided that Nature should make the change, but that Nature Structural Biology (NSB - now Nature Structural and Molecular Biology) should allow a six-month embargo period. NSB’s founding Chief Editor Guy Riddihough (or maybe a successor) subsequently changed to the fully open policy (I cannot remember when), by which time they no doubt perceived that the risk of losing important research publications was smaller than feared.
The idea that Science would make the change at the same time as Nature was an important source of internal reassurance, while also sending a strong signal for the need for change to those in the community who might still resist it.
So Floyd and I agreed on that simultaneous policy change in Nature and Science.
To me, the research and development benefits of sharing structural data were obvious. The broader push then and ever since has always been in that direction of openness, whether for research data, materials or, more recently, computer code.
However, any new editorial policy is likely to place an additional demand on researchers, editors and, often, infrastructure. In particular, the resources needed to enable data openness should never be forgotten. In 2012 the Royal Society published a report ‘Science as an Open Enterprise’. I was a co-author and had successfully urged that the report include specific examples of databases, with their services and costs included. The substantial services and costs of the PDB at that time can be found on page 92 of the report:
https://royalsociety.org/-/media/policy/projects/sape/2012-06-20-saoe.pdf
0
0
49
Giovanni Paternostro
Mar 31, 2025
In From PDB to AlphaFold
Joel L. Sussman is Professor Emeritus of Structural Biology at the Weizmann Institute of Science in Rehovot and is Co-Director of the Israel Structural Proteomics Center. In 1994–99, he was also the director of the Protein Data Bank (PDB).
He sent the following comment:
Please see attached a few changes and additions I've made to "From PDB to AlphaFold". [these are listed below]
1) The few points I've added in the early days of molecular graphics are essential as they were crucial for the PDB and AlphaFold. This is especially true for the first 3D PDB Browser.
2) AutoDep revolutionized the deposition and validation process at the PDB, and it opened up a close collaboration with the PDBe at the EBI. It streamed the deposition process and significantly improved the quality of the entries in the PDB.
3) The 25th Anniversary of the PDB and the 10th Anniversary of Swiss-Prot was held in Jerusalem. It was a real turning point in the field, as it was one of the first (if not THE first) meetings that brought together the sequence and 3D structure field. It is unlikely that you can include all the photos I've sent (which are only a few from the conference), but you can maybe include thumbnail-size images of them, which would expand if clicked on. Many scientists would be happy to see some of these early photos of the scientists who became real stars in the field. I'd hate losing them if something happened to my faculty website.
4) I had an incredible meeting with Nature's Editor-in-Chief, Philip Cambell. When I visited his office in London early in 1998, there was no chance that he would consider reversing Nature's long-standing policy regarding NOT requiring deposition and release of the 3D structure of biomacromolecules as a requirement for publication in Nature. To my absolute shock, he agreed virtually immediately. With me in his office, he phoned the Editor-in-Chief of Science, Floyd Bloom. They decided to publish similar editorials stating that this policy change was unequivocal. The shift in policy by Nature and Science made all the difference in the world for requiring the release of structural data, and almost all other journals quickly followed suit.
[Suggested changes and additions to "From PDB to AlphaFold". Several have been already added to the timeline, and for others further historical research is ongoing, to help present their context.]
1965 At MIT’s Project Mac, Cyrus Levinthal and Bob Langridge used computer graphics for the first time to display a protein structure, i.e., myoglobin. 3D visualization was achieved simply by rotating the structure on the screen. (Levinthal 1966).
1969 Cyrus Levinthal described the paradox of protein folding: the folding process must be guided by specific interactions and not by a random search through all possible conformations, which would take an immensely long time (Levinthal, 1969).
1971 August. At the ACA Conference in Ames, Iowa, the first 3D molecular graphics film was shown in a lecture by Joel L. Sussman on very small RNA structure, UpA [https://www.youtube.com/watch?v=PraieqBi048] (Seeman et al, 1971).
1995 The first 3DB Browser was released by the PDB-BNL. It dramatically enhanced the PDB's printed index listings and various ad hoc search protocols that had been developed to find PDB entries. Selected proteins in the PDB could be easily downloaded, and their molecular structures visualized on lab computers (Stampf et al, 1995 & Sussman et al, 2001) via RasM (Sayle, 1995) and other 3D visualization tools.
Figure 2: 3DB Browser as a tool to visualize recently published structures. (1) Search for author: Hendrickson; text query: HIV. (2) Six hits obtained, PDB ID Code 1GC1 highlighted. (3) 3DB Browser Atlas page. Ovals highlight the expression systems used for the different components in the multicomponent system. (4) Structure as visualized with MDL's Chemscape Chime plug-in.
1996: The PDB release of AutoDep, the first web-based tool for macromolecular structure deposition and validation. It was developed at the PDB-BNL, but was also given to the PDBe, who used it as the first remote site for deposition in the PDB. Within 3 months of its release, over 50% of all new submissions were deposited via AutoDep. n.b. This was when the WWW was very young, and people weren't as familiar with it as they are today. In fact, AutoDep predated any web submission of papers to journals (Lin, 2000).
1996 November. A conference celebrating the 25th Anniversary of the PDB and the 10th Anniversary of Swiss-Prot was held in Jerusalem. [http://www.weizmann.ac.il/csb/faculty_pages/Sussman/pdb25sp10]. This was one of the first meetings at which 3D structural data and sequence information were analyzed synergistically. It is remarkable how many scientists attended this meeting in 1996 and continued to be active in the field for many years following: Enrique Abola, Lia Addadi, Amos Bairoch, Nir Ben-Tal, Frances C. Bernstein, Herbert J. Bernstein, Helen Berman, Tom Blundel, Steven E. Brenner, Stephen H. Bryant, Cyrus Chothia, Miroslaw Cygler, Meir Edelman, David Eisenberg, Ken Fasman, Alan Fersht, Gary Gilliland, Adrian Goldman, Arthur Grollman, Mitchell Guss, Michal Harel, Osnat Herzberg, Barry Honig, Leroy Hood, Amnon Horovitz, Joel Janin, Chen Keasar, Ephraim Katzir, John Kendrew, Olga Kennard, Michael Levitt, Laua Lai, Doron Lancet, Olivier Lichtarge, Dawie Lin, N.O.Manning, Edgar Meyer. Leonid Mirny, John Moult, John Norvell, Ruth Nussinov, Wilma Olsen, Manuel C. Peitch, Shmuel Pietrokovski, Jaime Prilusky, Otto Ritter, John Rosenberg, Mark Safro, Chris Sander, Gideon Schreiber, Boaz Shaanan, Manfred Sipp, Jeffrey Skolnck, Bill Studier, Joel L. Sussman, Muttaiya Sundaralingam. Janet Thornton, Ed Trifonov, Tomitake Tsukihara, Ron, Unger, Keith D. Watenpaugh, Shoshan Wodak, and Ada Yonath.
Photographs of some of these young scientists can be seen at: https://www.weizmann.ac.il/csb/faculty_pages/Sussman/pdb25sp10/Picture-index.html and a few are shown below
Figure 3. Cover of abstract book of the conference celebrating the 25th Anniversary of the PDB and the 10th Anniversary of Swiss-Prot that was held in Jerusalem in Nov 1996.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
Figure 4. Photographs from the 25th Anniversary of the PDB and the 10th Anniversary of Swiss-Prot, which was held in Jerusalem, November 1996. (a) Left to right, Edgar Meyer, unknown, Enrique Abola, Nancy Manning & the 7th President of Israel, Ezer Weizman; (b) Keith Watenpaugh & Eli Admon; (c) Amos Bairoch; (d) Olga Kennard; (e) Janet Thornton; (f) Left to right, 4th President of Israel, Ephraim Katzir, Chris Sander, Janet Thornton, Olga Kennard, Barry Honig & Schneior Lifson (g) Left to right, Helen M. Berman, Janet Thornton, Shoshana Wodak & Olga Kennard; (h) Chris Sander; (i) Shoshana Wodak & Michel Levitt; (j) Joel L. Sussman & the 4th President of Israel, Ephraim Katzir; (k) center Edgar Meyer; (l) Wilman Olson & Muttaiya Sundaralingam; (m) Gideon Schreiber & Cyrus Chothia; (n) Bill Studier; (o) Leroy Hood; (p) Cyrus Chothia, JanethThronton Joel Janin, unkown, Shoshana Wodak, Joel L. Sussman & Helen M. Berman, (q) Peer Bork; (r) Joel L. Sussman, John Kendrew & the 4th President of Israel, Ephraim Katzir; (s) Tom Blundell; (t) John Moult, Stephen Bryant & Osnat Hertzberg; (u) Janet Thornton & Stephen Bryant; (v) Jaime Prilusky; (w) Frances Bernstein
1998 Nature, Science and PNAS reversed their long-standing policy of not requiring the immediate release of high-resolution structural coordinate data upon publication. This occurred in the Spring of 1998, following a visit by Joel L. Sussman, Head of the PDB, to Philip Cambell, the Editor-in-Chief of Nature, in London, to discuss the possibility of changing Nature’s policy of not requiring deposition of 3D macromolecular structures in the PDB. Cambell was very open to the idea and immediately phoned Floyd Bloom, the Editor-in-Chief of Science. They discussed that matter and immediately agreed to issue a joint statement with their new policy of requiring the deposition of the structures to the PDB and its release at the time of publication. Joel was amazed that Nature was speaking with Science(!).
Nature stated in their 9-Jul-1998 issue that "It is clear that there is a significant majority opinion in the community against permitting a one-year hold. Accordingly, Nature, simultaneously with Science, is changing its policy. Any paper containing new structural data received on or after 1 October 1998 will not be accepted without an accession number from the Brookhaven Protein Data Bank (PDB) accompanied by an assurance that unrestricted (“layer-1”) release will occur at or before the time of publication." Floyd Bloom, of Science, published in their 10-Jul-1998 issue a very similar editorial: https://www.science.org/doi/full/10.1126/science.281.5374.175c. The Editor-in-Chief of PNAS published a similar editorial in the 28-April01998 issue.
Figure 5. Editorial by Nick Cozzarelli in PNAS detailing the new policy requiring deposition of structures in the PDB-BNL.
REFERENCES
(see also links within the text)
Levinthal, C. 1966. Molecular model-building by computer, Scientific American, 214(6). pp. 42-52.
Levinthal, C. (1969) How to fold graciously, in: J.T.P. DeBrunner, E. Munck (Eds.) Mossbauer Spectroscopy in Biological Systems Proceedings, Univ of Illinois Press, Illinois, 67(41), pp. 22-24.
Lin, D., Manning, N.O., Jiang, J., Abola, E.E., Stampf, D., Prilusky, J., and Sussman, J.L. 2000. AutoDep©: A web based system for deposition and validation of macromolecular structural information, Acta Crystallographica D Biological Crystallograrphy, 56(7) pp. 828-841.
Sayle R.A. and Milner-White, E.J. (1995). RASMOL: biomolecular graphics for all. TIBS, 20(9) pp. 374-376
Seeman, N.C., Sussman, J.L., Berman, H.M., & Kim, S.-H. (1971). Nucleic acid conformation: crystal structure of a naturally occurring dinucleoside phosphate (UpA). Nature New Biology, 233(37). pp. 90-92.
Stampf, D.R., Felder, C.E., & Sussman, J..L. (1995). PDBbrowse--a graphics interface to the Brookhaven Protein Data Bank, Nature, 374(6522) pp. 572-574.
Sussman, J.L, Lin, D., Jiang, J., Manning, N.O., Prilusky, J. & Abola, E.E. 2001. The protein data bank at Brookhaven, in: M.G. Rossmann, E. Arnold (Eds.) International Tables for Crystallography, Volume F. Crystallography of Biological Macromolecules, Kluwer Academic Publishers, Dordrecht, pp. 649-656.
The following additional comment was received from Joel, with information about the extensive efforts in 1996 to promote deposition of structural data at the time of publication (with delayed release still allowed upon request) and the subsequent changes in the rate of submissions. It also contains more details about the 3DB browser and about its importance.
1) 1996 At an ‘International Seminar-Cum-School on Macromolecular Crystallographic Data’ at Calcutta, India, in November 1995, a formal discussion of the archival journal requirements for data deposition was held. This resulted in a letter to the editor of Nature (as well as seven other journals) that contained, in part, the following:
“… We recommend, therefore, that publication of macromolecular crystal structures should be accompanied by deposition of atomic parameters and also structure amplitudes …”
The 8 letters are:
· Baker, E. N. , Blundell, T. L., Vijayan, M., Dodson, E., Dodson, G., Gilliland, G. I. and Sussman, J.L. 1996. Crystallographic data deposition Nature 379(6562) pp. 202.
· Baker, E. N. , Blundell, T. L., Vijayan, M., Dodson, E., Gilliland, G. I. and Sussman, J.L. 1996. Deposition of macromolecular data. Acta Crystallographica D Biological Crystallography 52(3), pp. 609.
· Baker, E. N. , Blundell, T. L., Vijayan, M., Dodson, E., Dodson, G., Gilliland, G. I. and Sussman, J.L. 1996. Archival journal requirements for data deposition. Biophysical Journal 70(6) pp. 2994.
· Baker, E. N. , Blundell, T. L., Vijayan, M., Dodson, E., Dodson, G., Gilliland, G. I. and Sussman, J.L. 1996. International Seminar-cum-School on Macromolecular Crystallographic Data at Calcutta, India. Anticancer Drug Design 11(2) pp. 173-174.
· Baker, E. N. , Blundell, T. L., Vijayan, M., Dodson, E., Dodson, G., Gilliland, G. I. and Sussman, J.L. 1996. A formal discussion of the archival journal requirements for data deposition. Biochem Biophys Res Commun 219(3) pp. 976-977.
· Baker, E. N. , Blundell, T. L., Vijayan, M., Dodson, E., Dodson, G., Gilliland, G. I. and Sussman, J.L. 1996. Publication of macromolecular crystal structures. FEBS Lett 380(3) pp.301
· Baker, E. N. , Blundell, T. L., Vijayan, M., Dodson, E., Dodson, G., Gilliland, G. I. and Sussman, J.L. 1996. Diffraction data deposition. Structure 4(2) pp. 217.
· Baker, E. N. , Blundell, T. L., Vijayan, M., Dodson, E., Dodson, G., Gilliland, G. I. and Sussman, J.L. 1996. Archival journal requirements of macromolecular crystallographic data. Journal of Biomololecular and Structural Dynamics 13(4) pp. 583.
These letters, together with a change in the scientific community in favor of depositing experimental data with the 3D coordinates to the PDB, can be seen in the letter to the editor (Jiang et al., 1996). It contains a table showing a significant increase in the percent of X-ray structures deposited to the PDB, which included structure factors, from 25% in 1994 to 63% in 1997.
It stresses the importance of a standard format for experimental data: “. . .In order to facilitate the use of deposited structure factors, we at the PDB, together with a number of macromolecular crystallographers and the IUCr Working Group on Macromolecular CIF, developed a standard interchange format for structure factors [PDBStructure Factor mmCIF at Protein Data Bank Quart. Newslett. No. 74, p. 1 (1995) [https://files.wwpdb.org/pub/pdb/doc/newsletters/bnl/news74_oct95/newslttr.txt]. Finally, it concludes with:
“ …The ready availability of structure-factor files in a standard format has made it possible for any scientist to validate a structure in the PDB versus its experimentally observed data. … The PDB has also observed that one of the most popular uses for these stored structure factors is for the crystallographer who did the experiment to be able to retrieve his/her own data which have been misplaced in their laboratory.”
Jiang, J., Abola, E. and Sussman, J.L. (1998). Deposition of structure factors at the protein data bank. Acta Crystallographica D Biological Crystallography 55(1) pp. 4.
2) 1995 The first PDB Browser was released by the PDB-BNL. It dramatically enhanced the PDB's printed index listings and various ad hoc search protocols developed to find PDB entries. Selected proteins in the PDB could be easily downloaded, and their molecular structures visualized on lab computers (Stampf et al, 1995 & Sussman et al., 2001) via RasMol (Sayle, 1995) and other 3D visualization tools. The following year, the browser was significantly improved, becoming the “3DB Browser” (Prilusky et al, 1996, Sussman et al, 1998)
[See Figure 2 above for image of 3DB Browser]
Figure 2: 3DB Browser (Prilusky et al., 1996, Sussman et al., 1998) as a tool to search and visualize 3D biomacromolecular structures. (1) Search for author: Hendrickson; text query: HIV. (2) Six hits were obtained, with PDB ID Code 1GC1 highlighted. (3) 3DB Browser Atlas page. Ovals highlight the expression systems used for the different components in the multicomponent system. (4) Structure as visualized with MDL's Chemscape Chime plug-in or alternatively via RasMol (Sayle, 1995)
Prilusky, J., Sussman, J.L. and Abola, E.E. (1996) Three dimensional database of biomacromolecules structure (3DB): a 'multi-tool' based browser as a solution for complex data and complex queries. in Abstract of 10 Anniversay of the Swiss-Prot Database and the 25th anniversary of the Protein Data Bank [https://www.weizmann.ac.il/csb/faculty_pages/Sussman/pdb25sp10/abstracts/Abola.html]
Sussman, J.L, Lin, D., Jiang, J., Manning, N.O., Prilusky, J., Ritter, O. and Abola, E.E. (1998). Protein Data Bank (PDB): a database of 3D structural information of biological macromolecules. Acta Crystallographica D Biological Crystallography 54(6) pp. 1078-1084
0
0
55
Giovanni Paternostro
Mar 17, 2025
In From PDB to AlphaFold
Johannes Söding is a Research Group Leader at the Max Planck Institute for Multidisciplinary Sciences in Göttingen, Germany.
He sent the following comment:
Martin Steineggers ’s comments are pretty complete and summarize very well the development of protein sequence search methods that facilitated the development of deep learning models trained on billions of sequences, such as AlphaFold2.
However, I would emphasize more the critical importance of the Linclust algorithm both for enabling the training protein language models and for ensuring the generation of sufficiently diverse multiple sequences alignments that AlphaFold2 requires for high-quality predictions. I think it is no exaggeration that Linclust is at the core of the breakthroughs in protein language models and deep-learning-based protein structure prediction and protein engineering. I would rearrange the content of the two paragraphs that Martin proposed in a slightly different way:
2016: MMseqs2: Fast iterative profile searches for building MSAs
The exploitation of the huge metagenomics sequence sets for iterative sequence searching to build MSAs required a fast sequence profile search tool that can handle datasets of billions of sequences. MMseqs2 filled that gap, with a search speed two to three orders of magnitude faster than PSI-BLAST or HMMER yet similar sensitivity. It would later enable the fast generation of MSAs for AlphaFold2 and Colabfold.
Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology
2017: Linear-time sequence clustering enabled the exploitation of huge metagenomic sequence corpora
In as much as large language models have profited from ever increasing sizes of their training corpus, the deep-learning revolution in protein biology, including AlphaFold, relies critically on training protein language models with huge numbers of non-redundant sets of protein sequences. AlphaFold2, for instance, was trained on a collection of representative sequences obtained by clustering 4 billion sequences from metagenomic and genomic sources (BFD database) and 1.6 billion sequences from MGnify v18. Generating such huge reference sets only became possible with Linclust, the first algorithm whose runtime scaled linearly instead of quadratically with the size of the input sequence set. Before Linclust, the practical limit for sequence clustering was at around 100 million sequences. AlphaFold2 profits in another way from the huge and diverse databases such as MGnify and BFD clustered with Linclust. The model quality depends on a sufficient diversity of the MSA built from the query sequence, and that diversity may depend crucially on the diversity of the sequence databases in which it searches for homologous sequences. (Removing both MGnify and BFD for the MSA generation reduced AlphaFold2’s mean GDT score by 6.1.)
Steinegger M. & Söding J. (2018). Clustering huge protein sequence sets in linear time. Nature communications 9(1) 2542.
Ovchinnikov, S. et al. (2018). Protein structure determination using metagenome sequence data. Science
0
0
18
Giovanni Paternostro
Mar 17, 2025
In From PDB to AlphaFold
Martin Steinegger is an Assistant Professor in the biology department at the Seoul National University. He is a co-author of the 2021 paper describing AphaFold2, by Jumper et al, Nature, 596(7873), pp.583-589.
He sent the following comment:
These are my thoughts on the milestones for MSAs required for the success of AlphaFold2.
The term “MSA” might be misinterpreted in this context. In my text, I’m referring to query-centered MSAs generated through homology searches, not the global alignments that we can obtain from progressive aligners like ClustalW once we have more data.
1965: Margaret Oakley Dayhoff's Atlas of Protein Sequence and Structure
Margaret O. Dayhoff pioneered the systematic collection and analysis of protein sequences with the publication of the Atlas of Protein Sequence and Structure. This work compiled all ~70 known protein sequences at the time. Dayhoff's efforts laid the groundwork for bioinformatics and the development of substitution matrices—such as the PAM (Point Accepted Mutation) matrices—which are used for sequence alignment scoring to date.
Dayhoff, M.O. et al. (1965). Atlas of Protein Sequence and Structure, National Biomedical Research Foundation.
1970–1982: Needleman–Wunsch, Smith-Waterman to Gotoh
In 1970, Saul Needleman and Christian Wunsch introduced the Needleman–Wunsch algorithm, the first systematic method for global sequence alignment using dynamic programming. This method was followed by the Smith–Waterman algorithm in 1981, which provided a framework for local alignments to detect conserved regions. In 1982, Osamu Gotoh refined these methods by devising an elegant approach to compute affine gap penalties, thereby enabling rapid and biologically accurate sequence alignments. Together these seminal works developed the algorithm that is now executed billions of times daily to compute pairwise protein alignments.
Needleman, S.B., & Wunsch, C.D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology
Smith, T.F., & Waterman, M.S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology
Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecular Biology
1986–2002: Swiss-Prot, TrEMBL, and UniProt
In 1986, Amos Bairoch established Swiss-Prot, a database of curated proteins and later, to handle the exponential growth in protein sequence data, TrEMBL was introduced in 1996 as a complementary database containing computationally annotated entries. In 2003, Swiss-Prot and TrEMBL merged to form the Universal Protein Resource (UniProt). The extensive open source protein sequence data in UniProt was indispensable for generating diverse multiple sequence alignments, critical for training AlphaFold2.
1990: BLAST (Basic Local Alignment Search Tool)
Altschul et al. introduced BLAST, a tool that revolutionized sequence searches by enabling rapid detection of sequence similarities. This innovation dramatically improved researchers' ability to search through ever-growing protein databases through a seed-and-extend-based alignment scheme. Additionally, the introduction of the E-value—derived from the Karlin-Altschul statistical framework—provided a robust measure for assessing the likelihood of a match occurring by chance, thereby grounding sequence alignment in solid statistical principles.
Altschul, S.F. et al. (1990). Basic local alignment search tool. Journal of Molecular Biology
Karlin, S., & Altschul, S.F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences
1997: PSI-BLAST (Position-Specific Iterated BLAST)
Building upon the BLAST framework, PSI-BLAST constructs sequence profiles from initial alignments and iteratively searches through the database. This allowed researchers to detect even more remotely homologous relationships efficiently.
Altschul, S.F. et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research
1998: HMMER: fast profile Hidden Markov Models to sequence alignments
Sean Eddy developed HMMER, an efficient suite of methods that applies hidden Markov models to sequence search and alignment. By incorporating probabilities for insertions and deletions into the profile scoring, HMMER significantly improved the sensitivity of sequence comparisons, establishing itself as a critical tool for the large-scale annotation of protein families and domains.
Eddy, S.R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755–763.
Eddy, S.R. (2011). Accelerated Profile HMM Searches. PLoS Computational Biology
2005–2012: HH-suite - fast HMM-HMM alignment
Johannes Söding and colleagues developed HHsearch and HHblits, which compare hidden Markov models against hidden Markov models (HMM–HMM). This method greatly enhances sensitivity for detecting remote homology, enabling the discovery of extremely distant relationships that might be missed by traditional approaches.
Söding, J. (2005). Protein homology detection by HMM–HMM comparison. Bioinformatics,
Remmert, M. et al. (2012) HHblits: lightning-fast iterative protein sequence searching by HMM–HMM alignment. Nature Methods.
2016–2019: MMseqs2 and Linclust, clustering protein in linear time
Efforts to cluster vast amounts of sequence data led to the development of tools like MMseqs2 (Many-against-Many sequence searching) and Linclust. These methods enable ultra-fast clustering and detection of distant homologs in large-scale sequence datasets, facilitating the building of large-scale reference databases and comprehensive multiple-sequence alignments for downstream analyses. Additionally, the Uniclust resource was established to provide deeply clustered and annotated protein sequence databases based on UniProt data utilized for AlphaFold2 training.
Hauser, M., Steinegger, M., & Söding, J. (2016). MMseqs: software suite for fast and deep clustering and searching of large sequence sets. Bioinformatics
Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology
Mirdita, M. et al. (2017). Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Research
2017: Metagenomic Data Integration for structure prediction
The integration of metagenomic sequencing data dramatically expanded the pool of available protein sequences by adding billions of sequences from diverse microbial communities. This expansion—driven by a global effort to sequence and deposit experiments—
has vastly improved the breadth and accuracy of multiple sequence alignments used in protein structure prediction and other analyses.
Ovchinnikov, S. et al. (2018). Protein structure determination using metagenome sequence data. Science
2021: Making AlphaFold2 accessible to all through ColabFold
ColabFold made AlphaFold2 predictions widely accessible to researchers and practitioners without access to large-scale computing infrastructure by providing high-quality, rapid and free-of-charge multiple-sequence alignment (MSA) generation through a publicly accessible MMseqs2 search server and a user-friendly Google Colab-based notebook interface.
Mirdita, M. et al. (2022). ColabFold: making protein folding accessible to all. Nature Methods
0
0
18
Giovanni Paternostro
Mar 05, 2025
In From PDB to AlphaFold
Søren Brunak is a Professor of Disease Systems Biology at the University of Copenhagen and Professor of Bioinformatics at the Technical University of Denmark. He is also Research Director at the Novo Nordisk Foundation Center for Protein Research at the University of Copenhagen Medical School.
He sent the following comment:
I agree that access to data was key to AlphaFold and to previous work in this field. I still have in my office the original magnetic tapes that contained the PDB data used to train the machine learning methods in the 1988 and 1990 papers:
The AlphaFold method predicts inter-residue distance distributions and then converts predicted distance probabilities into statistical potential for energy minimization to obtain 3D coordinates.
In 1990, we were first to predict the distance matrix for proteins by neural networks (Bohr, 1990). At my Center for Biological Sequence Analysis, we later developed energy minimization methods that converted the distance matrices into coordinates. Many other people did that, but the 1990 paper made the important step of designing neural networks to predict distance matrices. We also organized a meeting in 1993 around distance-based methods and published the contributions as proceedings (Eds. Bohr & Brunak, 1994)
Bohr H, Bohr J, Brunak S, Cotterill RM, Fredholm H, Lautrup B, Petersen SB. A novel approach to prediction of the 3-dimensional structures of protein backbones by neural networks. FEBS Lett. 1990 Feb 12;261(1):43-6. doi: 10.1016/0014-5793(90)80632-s. PMID: 19928342.
Eds. Bohr H, Brunak S. Protein Structure by Distance Analysis. IOS Press, Amsterdam, 352 pp., 1994.
1
0
31
Giovanni Paternostro
Feb 20, 2025
In From PDB to AlphaFold
George Blumberg told us about talking with his father, Medicine Nobel laureate Baruch (Barry) Blumberg about the importance of exploratory research in science.
We discussed these statements from Barry Blumberg's autobiography:
“Arriving at a medical diagnosis can be viewed as an example of scientific process. […]
In the induction phase, we first collect data and with them formulate a hypothesis. […]
Pure induction is always modified by some preconception of where the investigation will go. The fact that a decision was made to collect data in a particular place (the beach at Cape Cod, the surface of the moon, the glaciers of Greenland) and within a particular category of nature (plants, mollusks, clouds, quarks, nucleic acids, etc.) indicate the existence of a working hypothesis. […]
In the deductive phase, the hypothesis is stated first, and then experiments are devised and observation made in an attempt to support (“prove”) or reject it.” (Pages 24-25)
“It is a common experience of scientists that unexpected data are often the most interesting because they generate totally new kinds of ideas. Recognizing this we began to organize our study design so as to produce unexpected results. This might seem semantically facetious – if you expect something unexpected, can it really be unexpected? – but it works.” (Page 26)
Blumberg, B.S., 2002. Hepatitis B: The hunt for a killer virus. Princeton University Press.
Helen Berman also described her discussions with Barry about data collection and discovery-based research..
Reading these reflections now, some questions are: Which types of hypotheses and data collection are more productive in AI-based science? What are the roles best played by human scientists and by AI?
0
0
28
Giovanni Paternostro
Feb 20, 2025
In From PDB to AlphaFold
Why was it DeepMind, rather than an academic group, that built AlphaFold2 ?
Pierre Baldi (UC Irvine, Director of the AI in Science Institute):
The main reason is hardware and infrastructure. You need a large cluster of GPUs (most if not all academic lab do not have one) and the corresponding IT/software-engineering infrastructure.
Mohammed AlQuraishi (Columbia University):
[from https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/#s5
where a longer reflection is presented]
First and foremost it has to do with the people who make up the AF2 [AlphaFold2] team. One should not pretend that they are substitutable. Even within DeepMind, if it were a different set of people we would likely have had a different outcome. This may seem obvious but I repeatedly heard people treat the AF2 team as an amorphous blob. Let us not forget that the main reason they did so well is because of who they are, their talents, and their dedication. In this most important sense, it is not about DeepMind at all.
Resources also helped and this is not to be underestimated, but I would like to focus on organizational structure as I believe it is the key factor beyond the individual contributors themselves. DeepMind is organized very differently from academic groups. There are minimal administrative requirements, freeing up time to do research. This research is done by professionals working at the same job for years and who have achieved mastery of at least one discipline. Contrast this with academic labs where there is constant turnover of students and postdocs. This is as it should be, as their primary mission is the training of the next generation of scientists. Furthermore, at DeepMind everyone is rowing in the same direction. There is a reason that the AF2 abstract has 18 co-first authors and it is reflective of an incentive structure wholly foreign to academia. Research at universities is ultimately about individual effort and building a personal brand, irrespective of how collaborative one wants to be. This means the power of coordination that DeepMind can leverage is never available to academic groups. Taken together these factors result in a “fast and focused” research paradigm.
Jake Feala (cofounder at Lila Sciences) also addressed this question as part of his contribution.
0
0
39
Giovanni Paternostro
Feb 20, 2025
In From PDB to AlphaFold
Tomaso Poggio is a Professor at MIT and co-director of the Center for Brains, Minds, and Machines.
He sent the following comment:
The main remark is that the real breakthrough did not happen with deep learning but rather with machine learning about 25-30 years earlier, when the central paradigm in computer science changed dramatically from programming to training. It was such a radical revolution that most of the key figures—Vapnik, Fukushima, Hinton, Baldi, Schoelkopf, myself—were not computer scientists! In machine learning, until around 2010, the best architectures were shallow networks (Support Vector Machines, Radial Basis Functions). The first successes in biology, generative graphics, finance, autonomous driving, and vision took place in the 1990s.
Tomaso has also played a role in this history as one of the initial investors in DeepMind. This is the translation of a relevant paragraph from one of its books:
"Around that time, a brilliant young visiting scientist from London arrived in my laboratory, someone who had a lot to do with games. Demis Hassabis had already been a chess prodigy at age four, and by 17—having finished high school two years ahead of his peers—he had co-designed and programmed Theme Park, a video game that went on to sell more than ten million copies.
After meeting Peter Thiel, co-founder of PayPal, at a Google-organized conference, and finding that other major investors were also interested, Demis realized his dream of founding an artificial intelligence startup focused on gaming was feasible. His idea was that gaming could be the initial goal, with the eventual aim of tackling more significant scientific problems. After a long discussion, Demis asked me to join the first round of investment. Hesitant, I invited him to dinner at our home to talk more about the project, although my secret intention was to introduce him to Barbara. After dinner, Barbara had no doubts: "Go ahead and invest," she told me. Thus, with my very modest contribution, DeepMind—the mind of "deep" learning—was born, starting as a small group of computer scientists and software engineers confined to a London office.
The turning point came in 2014, when Google acquired DeepMind and, notably, the brilliant minds working there, beginning with Demis and co-founders Shane Legg and Mustafa Suleyman."
translated from
Poggio, T. & Magrini, M. Cervelli menti algoritmi. (Sperling & Kupfer, 2023).
at <https://www.sperling.it/libri/cervelli-menti-algoritmi-marco-magrini>
0
0
29
Giovanni Paternostro
Jan 29, 2025
In From PDB to AlphaFold
Giovanni Paternostro, Guy Salvesen and Silvia Vicenzi have spoken with Helen Berman.
Helen has had a key role in starting and promoting the Protein Data Bank (PDB) and was for many years the Director of the PDB. Helen has been recognized with many awards, notably the 2006 Buerger Award from the American Crystallographic Association and the 2012 Carl Brändén Award from the Protein Society. In 2023 Helen was elected to the National Academy of Sciences. More details are shown in her online memoir.
Dear Helen,
Do you have any comments about the timeline describing the history "From PDB to AlphaFold" and any suggestions on how we could improve it? [Helen's comments and suggestions have been incorporated into the current version of the timeline]
Helen: Some parts of it I really liked a lot but there were a couple of omissions and one mistake. The mistake is that the picture you mention with me, Sung-Hou Kim, Joel Sussman and Ned Seeman was not taken when we were on our way to the Cold Spring Harbor meeting in 1971. But that is the car we used and several of those in the picture went to that meeting.
There is another thing that is not mentioned- the management structure of the RCSB consortium. When the this consortium took over the management of PDB in 1999, I became the Director; John Westbrook from Rutgers, Peter Arzberger (and then Phil Bourne) from SDSC, and Gary Gilliland from NIST became the co-Directors.
Another thing that's missing, that I know was and continues to be very important for allowing the data to be used effectively is the creation of a new data representation called mmCIF. This was done under the auspices of the IUCr, by a group headed by Paula Fitzgerald, The difference between the legacy PDB format, and mmCIF, is that mmCIF is machine readable, self-defining, with complete data item definitions. The relationships among data items are completely explicit- for example, the relationship between the atomic coordinates and sequence in the atom records and the sequence in the experiment. That format (now called PDBx/mmCIF), which is an entirely new way of doing data representation, was absolutely key to the AlphaFold success.
Do you remember when it was officially adopted?
Helen: We started working on mmCIF in about 1990, we had our first version in 96, the RCSB began using it then, before we became in charge of the PDB, and then it was used by the PDB and it finally got complete buy-in in 2011 and then officially became the master format in 2015. It's really important that some description of mmCIF is there.
In one of your presentations, in 2021, called "The Evolution of the Data Sharing Culture in Structural Biology " ( https://www.youtube.com/watch?v=oeY0mB7xVTQ ) you pointed out as a key step that from 2008 everyone had to deposit not only the coordinates but also the structure factors in the PDB.
Helen: Yes, that was key to validation because without the structure factors you could only validate the geometry but with the structure factors you could validate against the processed data, so that was very important.
We hope scientists in different fields will start considering what was done and what they could do next in their field, using the PDB example. That seems to need a collective effort.
Helen: Yes, that would be a very good thing. I was asked to give a talk at a Chan Zuckerberg Initiative meeting about the Future of Imaging. They specifically asked me to speak about the history of the PDB. The reason they wanted it is that there's a huge amount of data in many different formats and they want to know how they can help organize their community, to put the data in some form that can be used. A lot of the talk I gave there was about the sociology of all of this. I belong to a collaborative called the Stakeholders Alliance where we discuss effective ways of collaboration and communication. That's obviously very important. I believe part of the success of the PDB had to do with continuous communication.
What you said about the sociology aspect is very relevant. How was the evolution of this social process, of convincing everyone to share data?
Helen: It was a constant conversation. People talked to one another, at the beginning just to set the thing up, and there were a few of us doing that, then we had that petition in 1971, then we managed to get the PDB set up at Brookhaven and then we had to write letters to everybody to convince them to put the data in. There was no requirement to put data in the PDB.
In the very early days, some people were really afraid to let go of their data because they thought they wouldn't get the full recognition, and/or they wouldn't be able to do all the derivative experiments.
Finally, especially because of the emergence of HIV in the 80s, people felt very strongly that the data had to be completely shared, and had to be in the PDB in order to develop new drugs. That was the background of the petitions that were done by Fred Richards and others. The International Union of Crystallography (IUCr) set up a committee to decide what data should go in the PDB. Part of the reason for the success in this was that the crystallographic community is extremely well organized. In the crystallographic community it was always understood that we had to have standards and procedures. It's maybe a part of the discipline, in order to do a structure, you have to be super organized. There's also the social influence of the early actors in structural biology. They were social thinkers themselves, like JD Bernal, and that had a big influence on the people who did the work. There was a very different kind of culture in the early crystallography community, compared with the culture that evolved in molecular biology.
How important do you think visualization was in in the development of the PDB?
Helen: The beginnings of the PDB came partially as a result of my visiting MIT and seeing project MAC, where they had one of the earliest computers used to do visualization; we were able to see the molecules, and that really had a big influence. After we convinced Walter Hamilton at Brookhaven to set up the PDB, we formed a project called CRYSNET. The idea of CRYSNET was to network the main computer at BNL to computers around the country that would have computer graphics. We were able to do all their calculations at Brookhaven using DARPANet, and then visualize the proteins on home computers. We actually got a grant in January of 1973 to do just that. Most structural biologists are really into visualization and they can't write a paper about a structure without figuring out how to show it and how to illustrate the different features. Visualization is extremely important and was part of the motivation of the PDB.
What do you think about CASP, were you involved with that?
Helen: I wasn't involved with CASP, except to help make sure that they got what they needed at the right time. In other words, the PDB had to cooperate closely with CASP, because we were being asked to hold back data long enough for the prediction to happen, without too much of a delay.
On one hand we were supposed to get the data out but on the other hand if the data were out then they wouldn't be able to properly test the methods. We had to work very closely with CASP to make sure the data were out to them at the right time.
Torsten Schwede, who is based in Switzerland, was very much involved with the CASP evaluations. He asked whether we would be willing to publish the sequences of the structures with a short delay of four or five days, so that they could use computational methods to do rapid predictions. He used a program called CAMEO which did automated structure predictions, and we had to work closely, and I would say cooperatively, so that we wouldn't compromise the PDB by holding back structures longer than absolutely necessary.
Another point I would like to make clear is that people often say, and I get it really annoyed, they say: oh well the PDB data are special and that's why it worked. The only reason the PDB data are special is that we took the time and effort and energy to define everything that was in the PDB, to curate all the data. Anybody can do that if they're willing to do it, but many people will not take the time. If you do data curation for a living, people don't think it's much of anything, they think it's basically secretarial work, they have no idea how much it is involved and so they really belittle the whole thing. Sometimes they do not treat the curators properly. We wouldn't have AlphaFold if it weren't for the curators.
What was your opinion of the importance of JCSG in the development of the field? JCSG was a joint center for structural genomics which had a large NIH grant, and also had a lot of industrial input, working on cloning, expression and structure of many proteins.
Helen: I was part of that. It was super important. It was part of an initiative supported by the NIH, the Protein Structure Initiative.
JCSG was one of the really excellent centers. I don't remember exactly how many centers there were, there might have been seven core centers and maybe 10 other kinds of centers. I think the initiative was super important and what it achieved was that in addition to getting the structures of a lot of proteins it improved the methodology for crystallography enormously, so that now many steps are done with robots. People said: oh if you do fast structure determination it's going to degrade the quality of the structures, but in fact the structures done by structural genomics centers including JCSG were of much higher quality than the usual structures. They were very well done and the tragedy in my opinion is that NIH ended the whole program abruptly and we lost a lot as a community, as a crystallographic community and as a scientific community. That was not a good decision on the part of NIH, because, if they had just let the initiative go on, we would have a whole lot more structures. They were producing huge numbers of structures of very high quality and improved the methodology enormously. They were really important, so I think it was tragic when that was ended.
We read that you were for many years at the Fox Chase Cancer Center, and I wonder if you ever meet Nobel laureate Baruch (Barry) Blumberg there.
Helen: Yes, and he had a big influence on me.
I used to have lunch with him often. He believed that if you collect the data and put them in good form, someday you'll figure out what to do with it. We used to talk all the time about how you organize data. He used relational databases for all his blood samples. When we set up the PDB we used similar database technology.
He was a terrific guy. He got the Nobel prize for discovering the hepatitis B virus and then some people got really angry because he was not doing hypothesis driven research. In fact, he was doing discovery-based research which eventually led to the creation of the hepatitis B vaccine.
0
0
55
Giovanni Paternostro
Oct 27, 2024
In AI Roundtable
To expand the cellcomm.org AI discussion and encourage participation of early career scientists, Zinia Charlotte Dsouza and Cristiana Dondi, the chairs of the SPB-Science Network (the society of postdocs and students), along with Guy Salvesen and Giovanni Paternostro have hosted a Roundtable on the AI Revolution in Science at Sanford Burnham Prebys in La Jolla, on October 22, 2024. Invited speakers were Giorgio Quer (Scripps), Talmo Pereira (Salk), Sanjeev Ranade (SBP), Karen Mei (UCSD), Ani Deshpande (SBP), Will Wang (SBP) and Sanju Sinha (SBP).
The following question, emerged from previous discussions and surveys, has been addressed by recent Interviews and by the roundtable participants:
What could be achieved if there was a public or nonprofit AI effort with the same scale and level of funding as the current large private efforts? What would be the benefits for society?
This question can only be answered by a wide and open sharing and integration of ideas by the scientific community.
The in-person roundtable presentations and Q & A can be read on this page.
You are welcome to post here extended remarks or enter below shorter comments or questions about AI in Science. Please add your name and affiliation. Criticisms of individuals or groups are not appropriate, we focus on discussion and analysis of ideas. You can also comment on how the debate should be organized and motivated.
0
1
87
Giovanni Paternostro
Oct 11, 2024
In Interviews with Experts
Guy Salvesen and Giovanni Paternostro have interviewed Talmo Pereira, a Fellow and Principal Investigator at the Salk Institute. His lab builds computational tools that leverage deep learning and computer vision to study complex biological systems. He has developed widely used tools that track movements for animal studies of behavior (1).
Dear Talmo,
What could be achieved if there was a public or nonprofit AI effort with the same scale and level of funding as the current large private efforts? What would be the benefits for society?
Talmo:
That's a really great question, and frankly, it's one that needs to be addressed at multiple levels. I think that Europe is developing some interesting initiatives to centralize around clusters of high-performance computing that enable AI. When we talk about the kind of resources that are necessary to do AI, really what we're talking about is GPUs. It's a specialized kind of hardware. It is very expensive, because NVIDIA has a monopoly on it, and because the supply chain is deadlocked in geopolitics now, with TSMC and these other manufacturers.
The CHIPS Act has helped a little bit, but at the moment all the software is really designed around NVIDIA's chips, and it is really hard to break out of that, unless you're at Google, and using Google's TPUs, which are really the only other feasible option. These massive pools of GPUs are needed to enable training of what we call foundation models.
These are large-scale models, models of the same scale and capacity as ChatGPT. In science we've seen that some of these already begin to emerge, certainly in protein folding, and also in multi-omics, and omics in general. There's been a series of papers recently on training these foundation models on large-scale genetic data. That's an obvious application. It's sequential and easy to encode in a way that makes sense. In the same way that ChatGPT has all these emergent properties and capabilities, just by digesting a lot of text, you can imagine we could have a lot of properties and capabilities that emerge out of doing this with genetic information.
The most reductionist description I've heard is that you will spend a hundred million dollars to do RNA-seq on every single cell in the brain or any other part of the body, and even in cancer and then you throw it all into a UMAP plot. This is a very simplistic description of what AI models could enable. You could fine-tune these representations, steering the kinds of information captured in omics data to then do a lot more, everything from open reading frame prediction to having this augment the protein folding, to predict binding sites.
If you include regulatory data and expression data, then it can begin to infer the structure of regulatory networks. If you have a cell state that you want to get to, you can do what's called a reverse perturbation.
Essentially, the big idea with these models is that they enable in silico experimentation. Virtual biology, not computational biology, not bioinformatics, but virtual biology.While something like bioinformatics seeks to process the data better, while computational biology seeks to model it better, virtual biology seeks to emulate the process of doing bench science with a sufficiently capable simulacrum of the biological system.
If you have a sufficient amount of data in a sufficiently capable model, it can reproduce some part of that biological system. And you can design it in such a way that it is not just a description of the data, not just a hypothesized model for it, but really a direct analog. You could point to a specific layer in your neural network and say, this corresponds to this gene, this corresponds to this neuron. Then the experiments that fall out of this type of modeling are directly testable.
Guy:
Who will do the experiments needed to test these predictions?
Talmo:
We find that many experimental biologists would very gladly use an AI simulation of the work they're doing at the bench that can tell them how a new result, integrated into all the previous results, might predict the next experiment. And maybe not just what happens if they do a specific experiment, but what happens if they do this whole class? For example, ablating every type of neuron or trying every type of stimulation pattern. The model will give you a ranked list of things that modify and achieve the phenotype that you want, and then you can go and do it in the lab. Scientists already go to NCBI and look at the genome browsers before doing a perturbation and designing CRISPR probes. These kinds of tools are already part of the workflow.
The challenge is to make AI models more accessible, in the sense of the technical barriers but also the cost, such that they can become a run-of-the-mill tool. And when we get there, then you will not need folks like me anymore to run those things and tell you what those hypotheses are. Instead, my experimental collaborator will do it directly. This direct contact will help to ground the algorithmic process in the types of questions that experimentalists want to ask.
There is one ongoing initiative by NSF called NAIRR that is attempting to move us in that direction, but it is still at an early stage.
Another challenge I want to mention is infrastructural. I have worked at Google during part of my PhD work. This made me appreciate how fundamentally important it is to have specialized support. It's not just enough having the GPUs. You need to be able to harvest them.
High-performance computing is a new form of computing, especially when it comes to GPUs. At Google they had all these systems to help you scale from 1 GPU to 1000 or 10,000 GPUs. Many software engineers were dedicated to this task, to make sure that all the GPUs were up and running and monitoring their progress. This support is invaluable.
Guy:
What can we accomplish in the public domain that the private domain large players like Google DeepMind can't or will not do?
Talmo:
If we look at some fields in science and engineering, we can see clear examples of how that route might look like. We might look at space research. Initially, we wanted to compete with the Russians, so we poured billions upon billions of dollars into that basic research, and it was really spread out. A lot of it was done at NASA, but they also just funded technology development everywhere. And then the knock-on effects were incredible. There are so many positive externalities of doing that. What ends up being the big advantage, is that we are a little bit looser and less mission-oriented with public funding. You never know where the next great idea is going come from. In the private sector you can see how Google is now decreasing the resources dedicated to open-ended science, which was done by DeepMind, and putting more resources in LLMs, to face the competition of OpenAI.
We are encouraging researchers at different career stages to share ideas about complex science problems that could benefit from a large-scale AI effort. We found that motivation and recognition could be provided if you and other well-known scientists were willing to talk to people that suggest the best ideas. You would be the judge and decide if any idea is deserving of your attention. Any scientist selected might receive advice but could also be a potential collaborator. Many ideas will be produced, and society will take notice. Would you be willing to talk to any of these scientists?
Talmo:
Yes, of course. I have encountered competition, as any other scientist, but it does not stop me from pursuing a philosophy of openness, sharing and collaboration. This is an approach to science that has served us quite well.
I think that, as the current generation of junior scientists grows up in science, there's going be a rapid culture shift towards understanding more how team science needs to evolve, how recognition and credit assignment need to evolve. It's already happening. I sit in search committees now, and we do not just look at who is the first versus the co-first author, we try to understand exactly what their contributions were.
REFERENCES
1- Marx, V. 20 years of Nature Methods: how some papers shaped science and careers. Nat Methods 21, 1786–1791 (2024). https://doi.org/10.1038/s41592-024-02452-x
0
0
26
Giovanni Paternostro
Admin
More actions
bottom of page