Giovanni Paternostro, Guy Salvesen and Silvia Vicenzi have spoken with Helen Berman.
Helen has had a key role in starting and promoting the Protein Data Bank (PDB) and was for many years the Director of the PDB. Helen has been recognized with many awards, notably the 2006 Buerger Award from the American Crystallographic Association and the 2012 Carl Brändén Award from the Protein Society. In 2023 Helen was elected to the National Academy of Sciences. More details are shown in her online memoir.
Dear Helen,
Do you have any comments about the timeline describing the history "From PDB to AlphaFold" and any suggestions on how we could improve it? [Helen's comments and suggestions have been incorporated into the current version of the timeline]
Helen: Some parts of it I really liked a lot but there were a couple of omissions and one mistake. The mistake is that the picture you mention with me, Sung-Hou Kim, Joel Sussman and Ned Seeman was not taken when we were on our way to the Cold Spring Harbor meeting in 1971. But that is the car we used and several of those in the picture went to that meeting.
There is another thing that is not mentioned- the management structure of the RCSB consortium. When the this consortium took over the management of PDB in 1999, I became the Director; John Westbrook from Rutgers, Peter Arzberger (and then Phil Bourne) from SDSC, and Gary Gilliland from NIST became the co-Directors.
Another thing that's missing, that I know was and continues to be very important for allowing the data to be used effectively is the creation of a new data representation called mmCIF. This was done under the auspices of the IUCr, by a group headed by Paula Fitzgerald, The difference between the legacy PDB format, and mmCIF, is that mmCIF is machine readable, self-defining, with complete data item definitions. The relationships among data items are completely explicit- for example, the relationship between the atomic coordinates and sequence in the atom records and the sequence in the experiment. That format (now called PDBx/mmCIF), which is an entirely new way of doing data representation, was absolutely key to the AlphaFold success.
Do you remember when it was officially adopted?
Helen: We started working on mmCIF in about 1990, we had our first version in 96, the RCSB began using it then, before we became in charge of the PDB, and then it was used by the PDB and it finally got complete buy-in in 2011 and then officially became the master format in 2015. It's really important that some description of mmCIF is there.
In one of your presentations, in 2021, called "The Evolution of the Data Sharing Culture in Structural Biology " ( https://www.youtube.com/watch?v=oeY0mB7xVTQ ) you pointed out as a key step that from 2008 everyone had to deposit not only the coordinates but also the structure factors in the PDB.
Helen: Yes, that was key to validation because without the structure factors you could only validate the geometry but with the structure factors you could validate against the processed data, so that was very important.
We hope scientists in different fields will start considering what was done and what they could do next in their field, using the PDB example. That seems to need a collective effort.
Helen: Yes, that would be a very good thing. I was asked to give a talk at a Chan Zuckerberg Initiative meeting about the Future of Imaging. They specifically asked me to speak about the history of the PDB. The reason they wanted it is that there's a huge amount of data in many different formats and they want to know how they can help organize their community, to put the data in some form that can be used. A lot of the talk I gave there was about the sociology of all of this. I belong to a collaborative called the Stakeholders Alliance where we discuss effective ways of collaboration and communication. That's obviously very important. I believe part of the success of the PDB had to do with continuous communication.
What you said about the sociology aspect is very relevant. How was the evolution of this social process, of convincing everyone to share data?
Helen: It was a constant conversation. People talked to one another, at the beginning just to set the thing up, and there were a few of us doing that, then we had that petition in 1971, then we managed to get the PDB set up at Brookhaven and then we had to write letters to everybody to convince them to put the data in. There was no requirement to put data in the PDB.
In the very early days, some people were really afraid to let go of their data because they thought they wouldn't get the full recognition, and/or they wouldn't be able to do all the derivative experiments.
Finally, especially because of the emergence of HIV in the 80s, people felt very strongly that the data had to be completely shared, and had to be in the PDB in order to develop new drugs. That was the background of the petitions that were done by Fred Richards and others. The International Union of Crystallography (IUCr) set up a committee to decide what data should go in the PDB. Part of the reason for the success in this was that the crystallographic community is extremely well organized. In the crystallographic community it was always understood that we had to have standards and procedures. It's maybe a part of the discipline, in order to do a structure, you have to be super organized. There's also the social influence of the early actors in structural biology. They were social thinkers themselves, like JD Bernal, and that had a big influence on the people who did the work. There was a very different kind of culture in the early crystallography community, compared with the culture that evolved in molecular biology.
How important do you think visualization was in in the development of the PDB?
Helen: The beginnings of the PDB came partially as a result of my visiting MIT and seeing project MAC, where they had one of the earliest computers used to do visualization; we were able to see the molecules, and that really had a big influence. After we convinced Walter Hamilton at Brookhaven to set up the PDB, we formed a project called CRYSNET. The idea of CRYSNET was to network the main computer at BNL to computers around the country that would have computer graphics. We were able to do all their calculations at Brookhaven using DARPANet, and then visualize the proteins on home computers. We actually got a grant in January of 1973 to do just that. Most structural biologists are really into visualization and they can't write a paper about a structure without figuring out how to show it and how to illustrate the different features. Visualization is extremely important and was part of the motivation of the PDB.
What do you think about CASP, were you involved with that?
Helen: I wasn't involved with CASP, except to help make sure that they got what they needed at the right time. In other words, the PDB had to cooperate closely with CASP, because we were being asked to hold back data long enough for the prediction to happen, without too much of a delay.
On one hand we were supposed to get the data out but on the other hand if the data were out then they wouldn't be able to properly test the methods. We had to work very closely with CASP to make sure the data were out to them at the right time.
Torsten Schwede, who is based in Switzerland, was very much involved with the CASP evaluations. He asked whether we would be willing to publish the sequences of the structures with a short delay of four or five days, so that they could use computational methods to do rapid predictions. He used a program called CAMEO which did automated structure predictions, and we had to work closely, and I would say cooperatively, so that we wouldn't compromise the PDB by holding back structures longer than absolutely necessary.
Another point I would like to make clear is that people often say, and I get it really annoyed, they say: oh well the PDB data are special and that's why it worked. The only reason the PDB data are special is that we took the time and effort and energy to define everything that was in the PDB, to curate all the data. Anybody can do that if they're willing to do it, but many people will not take the time. If you do data curation for a living, people don't think it's much of anything, they think it's basically secretarial work, they have no idea how much it is involved and so they really belittle the whole thing. Sometimes they do not treat the curators properly. We wouldn't have AlphaFold if it weren't for the curators.
What was your opinion of the importance of JCSG in the development of the field? JCSG was a joint center for structural genomics which had a large NIH grant, and also had a lot of industrial input, working on cloning, expression and structure of many proteins.
Helen: I was part of that. It was super important. It was part of an initiative supported by the NIH, the Protein Structure Initiative.
JCSG was one of the really excellent centers. I don't remember exactly how many centers there were, there might have been seven core centers and maybe 10 other kinds of centers. I think the initiative was super important and what it achieved was that in addition to getting the structures of a lot of proteins it improved the methodology for crystallography enormously, so that now many steps are done with robots. People said: oh if you do fast structure determination it's going to degrade the quality of the structures, but in fact the structures done by structural genomics centers including JCSG were of much higher quality than the usual structures. They were very well done and the tragedy in my opinion is that NIH ended the whole program abruptly and we lost a lot as a community, as a crystallographic community and as a scientific community. That was not a good decision on the part of NIH, because, if they had just let the initiative go on, we would have a whole lot more structures. They were producing huge numbers of structures of very high quality and improved the methodology enormously. They were really important, so I think it was tragic when that was ended.
We read that you were for many years at the Fox Chase Cancer Center, and I wonder if you ever meet Nobel laureate Baruch (Barry) Blumberg there.
Helen: Yes, and he had a big influence on me.
I used to have lunch with him often. He believed that if you collect the data and put them in good form, someday you'll figure out what to do with it. We used to talk all the time about how you organize data. He used relational databases for all his blood samples. When we set up the PDB we used similar database technology.
He was a terrific guy. He got the Nobel prize for discovering the hepatitis B virus and then some people got really angry because he was not doing hypothesis driven research. In fact, he was doing discovery-based research which eventually led to the creation of the hepatitis B vaccine.