Interview with Sofia Serrano, Zander Brumbaugh and Noah Smith

Giovanni Paternostro and Guy Salvesen recently spoke with Sofia Serrano, a UW doctoral student in the Paul G. Allen School of Computer Science & Engineering and soon to start as Assistant Professor in the Department of Computer Science at Lafayette College in Easton, PA; with Zander Brumbaugh, a masters student in the Allen School; and with Noah A. Smith, a professor in the Allen School and also working part-time at the Allen Institute for Artificial Intelligence. They recently published “Language Models: A Guide for the Perplexed,” a paper explaining large language models to a non-specialist audience.

- Can you briefly introduce large language models (LLMs)?

Sofia: These are statistical models that are trained on a lot of web text to successfully predict the next word. Whenever they are prompted, models will apply patterns in their training data (all the web text they have been trained on) that they have picked up to determine what kind of word is likely to come next. Once a model has generated the first word, it will do that again and again until it predicts where the sequence is likely to end. This is the high-level description.

What has made these models catch on are capabilities beyond fluency of the text produced. If the training data are large enough, they also capture world knowledge or what can be called common sense. This implicit knowledge is needed to do well in next word prediction. The models can shift to a different context with additional training but in some cases also without it.

Guy: Dictionaries also provide definitions of words using their context. Is this similar to what you described?

Sofia: The use of context has been important for a long time in language models, even before LLMs. But the basic idea of knowing a word by the company it keeps is still a core motivating idea that has spurred a lot of innovative work in this field.

- Can you comment on whether some of the LLMs' properties emerged unexpectedly when language models designed for simpler tasks like translation were scaled up?

Noah: I do not know for sure what everybody was thinking while they were developing these models, but from my point of view, language models have been around for a long time. They were used in systems for speech recognition and for machine translation to generate fluent output. They were essentially fluency modules. For translation, there are two goals when you are generating text: you want to be faithful to the input, to convey the same meaning as the input, and you want to be fluent in the output language, to make sense to someone that speaks that language. The language model was there to help with the fluency part. The language models of ten years ago were not as powerful as the current ones; nobody would have used them the way we use LLMs today. I have found the recent developments to have been quite surprising, and I suspect everybody, even the people who were first experimenting with contemporary language models, thought it was quite surprising to see how far you could go with a fluency machine.

Arguably the people that built models like GPT3 were coming at this from a different perspective, a more entrepreneurial, machine learning perspective. Maybe they were not surprised, but anyone that took my classes would have been as surprised as I was.

- What do AI researchers still not fully understand about large language models?

Noah: There are different levels of understanding. We fully understand the math. There is no magic in the equations that are used to do all the internal calculations that a language model does when it predicts the next word. The structure of these equations was essentially designed to maximize throughput on the GPU hardware. On the other hand, we can observe models' behavior, and there has been a lot of experimentation about what happens if we manipulate the input in a certain way and how the output changes. We understand that, if you predict the next word, it is a good way of faking your way through a lot of things that look like intelligence. I do not mean that in a derogatory way, it is just a description of what happens when you have seen a lot of data.

But in the middle, regarding the mechanisms, there is a lot of debate discussing if there is a more abstract model predicting what goes on inside, beyond predicting the next word. We do not really have good tools for this; you cannot look at the weights on a particular artificial neuron or at the activation of a neuron in the network and make sense of it. The connectionist view implies that everything is interdependent. It is the entire network that produces the behavior, and we are at a loss in terms of good mathematical and empirical tools for making sense of what goes on in the middle. This is very exciting, but it is also frustrating because it is in sharp contrast to the models we had before, where there was a tradition of not only understanding the parts but also knowing how to design the model to change its behavior and improve the quality of its output. Now we even have to debate what quality of output means. There is not just one score we can point to in order to guarantee that a model got better after we made some experimental change to it. There are many tradeoffs, and we have debates about what to evaluate, what the models are for, the extent of their capabilities, and, as mentioned, we do not really know how they work inside. It is time for science!

Giovanni: This is something we also encounter in biology, when we study very complex systems. We know something about the components, genes and proteins within a cell or neurons within a brain, but we do not fully understand what determines the behavior of the systems.

Noah: These are good analogies. The brain might be a particularly apt one. However, what sometimes holds us back in our field is that when we talk about what these models do and the way they work, the only analogies we have at hand are about human cognition. We fall back on language to describe what the AI modules do that is by analogy of what our minds do. We have a lot of verbs in English for what happens in our heads and, in my view, we use these terms too loosely; they do not have formal definitions. What is attention? Attention is a very complicated thing in humans and we borrow that word to describe a very specific mechanism inside a neural network. Sofia has done research on this. Just using that word has led people to make assumptions about how the mechanism functions that are not always right. Analogies to humans are inevitable– we have no other language for thinking about these things– we have to use these words, but they mislead us. Our papers are rife with imprecise reasoning because of these analogies with human intelligence. I think these models do not work at all how our brains do.

Sofia: There is another anecdote that helps to illustrate this point. I was talking to researchers in another discipline, and they told me that while reading papers about LLMs, they kept seeing the term "reasoning" coming up. They wanted to know the meaning within the LLM literature, but in reality the term is not used precisely, and there are different possible meanings. We latch onto terms that have already a lot of definitions attached within cognitive science, for example, without really intending a full analogy.

Guy: This is an interesting point. Should we use the English language to talk about the models and make analogies or is there a better language that we could use?

Noah: We are using a part of the English language. We share some of it but if you talk about biomedicine, I might not understand what you are saying and you might not be familiar with language I use to describe computational concepts. We all have sub-languages that we use in specific contexts. Part of the work of an academic discipline is to develop a good language to talk about the things that the community needs to reason about. In some fields that language has a lot of mathematics. The language we use to talk about the techniques of machine learning and AI has gone from being very mathematical just ten years ago to being much less so. The methods we use now are different, and it takes time to develop good nomenclature and good conventions for talking about these topics– partly because things move very fast, partly because a lot has happened in industry, where there is not a lot of pressure to communicate precisely to the outside world, partly because in computer science we always have a level of description in the code, which is in a sense fully self-contained. Everything in a model can be explained in code, which is a kind of math. That alleviates pressure for people to always have good natural language communication. One of the things that our students need to learn is that we communicate both through the code and through what we say in the paper. The paper contains a mix of mathematics, English and diagrams, and we must make sure that the description in the paper and the code match. This is actually a difficult skill. When people share the code we often find that there is a lot missing in the paper. We are still figuring out how to talk about our field.

- Can LLMs interact and be integrated with other AI systems that complement them? An example of a different AI system that has been validated convincingly in biomedicine is AlphaFold. In previous interviews computational biologists have mentioned the possibility of using LLMs to extract information from the scientific literature (which is already too vast for human parsing) and to use it within other AI models.

Sofia: I think we are moving in that direction, but it is very important to be aware of the kinds of gaps that can arise between an LLM's output and the actual source of that information.

Giovanni: We are aware of so-called hallucinations where references given by LLMs do not actually exist, but it should be possible to find a way to validate these references within the AI systems.

Noah: I would add that beyond asking what the technology can do and how it can organize information and present it, there is still an open question about what the best way is for humans to use the tools. Right now, the view that has been promoted by OpenAI is that it should be a chatbot and that we should have a conversation with it, like we would do with another person. That is one type of interface, which is fine for some cases, like when we want directions while driving, but for a scientist that wants to understand the literature, a chatbot might not be the best choice. We must think more broadly about the way that the information is presented to users by the system. There is a huge amount of work still to be done to identify the best interface in the case of scientific information. We do not know how that interface will look like in a few years, but I am fairly sure it will not be just chat; it will be something that has more access to your work, to other things you have written and to other data, if appropriate, in a private manner. This is not even purely an AI question, as it's also perhaps even more relevant to the field of human-computer interaction. Interfaces that lend themselves well to human users' needs and also to incorporating the strengths of AI are a huge open question for research, and there is a lot of interest in exploring this.

Zander: Noah mentioned that AI could use your individual repository of literature. Some current variants of structuring a language model being explored in natural language processing research are different kinds of augmented generation, where for example, you can use documents as context for your prompt. Another variant of LLMs might have live access to the internet, so you could combine a traditional search engine or information retrieval from a database with LLMs, also providing contexts and reducing hallucinations. It could be that the answer to the scientific problems you mentioned will not be standalone LLMs, because they might never truly capture all of their training data accurately. There will probably be methods developed that ensure that true knowledge is elicited from an LLM.

Sofia: There are examples of ongoing research directions for scientific literature search; one of them is Semantic Scholar, which is being developed at the Allen Institute for AI. It is a type of machine learning based approach that is very different from interacting with a chatbot.

- We know that Noah works part-time for the Allen Institute for AI, one of the main nonprofit organizations developing AI systems. The largest AI models, including LLMs, are currently owned by a few big companies. How could these models be democratized (following the terminology you use in your paper)? Many scientists and historians are concerned that keeping models that play important scientific roles private and partially secret will decrease trust in science.

Noah: I would add that there are already concerning examples of the influence of private companies on fundamental science: for example, the company that produces the journal impact factors and for-profit publishers that put huge barriers to information access. I am proud to say that our field (natural language processing) has a fully volunteer-run journal that is completely open access with no publication costs. I wish this model was everywhere. If research is publicly funded, it should be shared with the entire public. These private companies are not incentivized to match the values that are shared by society. Letting OpenAI or other similar companies control a technology that is at the center of our work is very dangerous. I think these AI models need to be open; at the Allen Institute we are working on open models, as are others, and it is good for science to have multiple models to choose from. In our models we put a lot of scientific data; we really focus on applications for scientists. We also make public the dataset that we use to develop the models, which is very important for science, for trustworthiness, for increasing understanding of what the models do, and for a better appreciation of limitations and capabilities. None of the for-profit companies are ever going to show the data.

Giovanni: What you say seems very reasonable. However, if in the future there will be legal restrictions on certain AI models for safety reasons, maybe based on model size, it might still be useful to develop nonprofit AI models that are as transparent as possible and governed by the pursuit of the common good, not leaving the largest models just to private companies.

Noah: Many of the things people worry about for these models become less of a worry if the models are open, because openness allows us to do research. People talk about the hallucination problem, when models say things that are not true, and I think that the only way you are going to solve it is if a lot of researchers, including graduate students, have access to models and can study the problem and explore ways to mitigate it, understanding the factors that lead to this and possible workarounds. This can only work if a lot of researchers, with different values from those of the companies that are making the big models, get access to them. Openness does not fix everything, but it increases the chance that we can make improvements. I would also add that regulating how researchers can work on these topics is really premature. In the US there is no history of having a lot of regulations about what people can do research on; it only happens in cases with the most extreme ethical risks, and I do not see anything like that in this case. This is not like a medication that could kill somebody.

Zander: I agree with Noah. Some concerns about risk recently have been about models that can execute code and interact with their environment, but they can only interact with what you give them permission to. We have a section in the paper talking about AI and regulation. Both the EU AI Act and the US Executive Order on AI cite specific thresholds of computational power beyond which models can be considered a potential risk; this shows how hard it is to define risk directly, as many researchers would probably frame the problem of risk differently. These early attempts at legislation can become quickly outdated as the technology rapidly progresses and new hardware is developed.

Sofia: I would add another benefit of openness. We have mentioned that open models can spur additional research and be improved by it. Another benefit of democratizing models is improving our tools for AI literacy. From a public education perspective, the more we as scientists know about these models, the better tools we have to communicate back to the public how these models work and therefore increase the public understanding of the kind of patterns the models are likely to produce in their outputs. This can increase public preparedness to deal with possible negative effects of the models.

- Can you comment on our recent surveys of biomedical scientists and economists, especially on the willingness of most of them to interact preferentially with AI systems that are more transparent, and where the scientific community participates in the control?

As you mention in the paper, human interactions contribute to training the models.

Noah: One of the ways that these models get better is by interacting with real people. That is part of how OpenAI is cornering the market; they have the best model right now, most people are using it, they are gathering data, and they can use that to keep making the model better. We are trying to develop LLMs for the research community, and we think it is like building other research infrastructure, for example telescopes or deep-sea exploration facilities. These are research tools that many organizations can benefit from. We welcome advice and participation from scientists across many different disciplines, users that can not only help to train the model but also contribute to designing how the interface should look. They can help us learn how researchers from different disciplines, from microbiology to political science, want to interact with these systems and how they want to access their literature. These are open questions. We both want to study the models and develop them to provide the best service to scientists, and you cannot do that without scientists invested in the process.

Giovanni: Scientists are more likely to participate if trust is developed. Some might worry after having seen how OpenAI started as a nonprofit and then changed to a mainly for-profit organization, setting up a hybrid model.

Noah: I can point out that the Allen Institute for AI has a long track record of doing research for the public good. We are also part of a larger family of Allen Institutes doing nonprofit research, including research on biomedical problems.

Giovanni: Another advantage coming from a wide participation of scientists is that they might suggest ideas strengthening the rationale for the funding of projects like the one you propose. Both governments and philanthropic sources will respond if a sufficiently large portion of the scientific community shares this opinion.

As an additional point, would it be reasonable to say that if nonprofit AI systems elicit the participation of a large number of scientists willing to train them, they can potentially become better than the for-profit AI systems?

Noah: It is possible, but on the other hand the for-profit sector at the moment has access to a much larger amount of money that they can use to buy hardware. Hardware is not everything and there are equally smart people in both the nonprofit and for-profit sectors, which have access to essentially the same scientific knowledge. The hardware is, however, a differentiating factor. For data the importance is a bit more nuanced; Google has more data than anyone else, but quality is also important. The interaction data that OpenAI is gathering might turn out to be another differentiating factor. In my mind I do not frame this as a competition between for-profits and nonprofits. For-profits will follow advertising dollars and other sources of revenue, but we have different goals. We want to make scientific progress, we want to understand the models, we want to help humanity. We are driven by a different set of values. This does not mean that we are necessarily in conflict, but it means that the things we want to build are not the same. We can learn from for-profit scientists, we can talk to them, and many individual scientists in industry talk to colleagues outside their company, even if they do not share all the details. We try to influence them to try to avoid the worst possible outcomes, but fundamentally we have different goals, and we should not see this as a competition. The diversity of options will be beneficial for society.

Sofia: To reinforce the point that Noah made, I will add that several nonprofits are specifically interested in scientific use cases for AI models. If there are scientific concerns that are not tied to profit motives, the nonprofit side is likely to be much more responsive in developing solutions to address these concerns. Scientists can use multiple models, but this is a strength of the nonprofit side.

Guy: Which is the limiting factor to developing better AI models?

Noah: The limiting factor is that we do not know what that means. AI is not one thing. We as humans pretend that we know what intelligence is and that we have a way to measure it, like an IQ test, but these are all tools that have been created in society for practical purposes. They are not measuring something that we are sure is a scientifically valid concept. Cognitive scientists will tell you that there are many kinds of intelligence that compete with each other. Intelligence is not one thing and we do not know how to measure it. Similarly, researchers working in AI are not in agreement about what their ultimate objective is. I am not saying that if we clarify the goals, we can just do it, but that definitely holds us back. As we said earlier, we do not even have great language for talking about the goals. We can make a lot of useful things, and this is how the field has progressed until now; the users have decided what sticks.

Guy and Giovanni: Dear Sofia, Zander and Noah, really many thanks for the very interesting conversation!