Assessing Large Language Models: A Multidimensional View of the Elephant
Attempting to assess a large language model (LLM)’s capabilities and behaviors is like the parable of the blind men and the elephant. In this long-told story, people holding onto different parts of an elephant try to describe the whole animal based on their limited information. In our case, the LLM elephant is truly enormous, oddly shaped, and its size and depth are barely perceptible. Each person’s individual perception is necessarily limited in the face of what we’re dealing with, so we need to galvanize and coordinate a diverse, collective effort: we need many more diverse people touching different parts of the LLM elephant, in different ways, to really begin to characterize and understand the beast we are dealing with.
LLMs are incredibly general; the input and output space is only bounded by what can be described in language. This means there are not only unknown-unknown use cases, but there are unknown-unknown effects (behaviors, impacts, risks, other outcomes) within each of those use cases. Searching through the unbounded, combinatorially exploded space of possible inputs and outputs is incredibly difficult, as is keeping that knowledge up to date as LLMs are modified and social behaviors evolve. (Yes, the elephant changes shape e.g. as new LLMs are trained or fine-tuned, or plugged into new applications).
Our current methods of evaluating LLMs are generally limited to pre-deployment red-teaming by contractors or experts (trying to prompt the model to see where it generates undesirable outputs), and model evaluations (usually running the model through a dataset of prompts, with a mark scheme for its output answers). This doesn’t quite address the problem of there being unknown-unknown use cases that a userbase of millions could discover, but that a small number of AI lab red teamers haven’t thought to try. Nor are these metrics deeply multidimensional enough to capture real-world impact. Model outputs are meaningful insofar as they may interact with the world — as in, we care about impact rather than model outputs in isolation. And only knowing possible outputs from red-teaming or prompt-based evals in the lab won’t get us far enough in understanding impact. (At this point, we might even realize that the “elephant” isn’t just the LLM itself, it’s the LLM combined with society. The thing we care about understanding is how models interact with the world.)
If each effort only manage to shine light on a small part of the elephant, we should make sure that our efforts are approaching them in very different ways, and capturing many possible use cases. We need many more types of people involved in setting the agenda for what use cases to study, and to glean information from studying real-world use data (more on this later).
There are many different ways of knowing that we’ll need to employ, to fully understand even one particular effect for one particular use case. For example, to understand the capacity of LLMs for deception, we need to not only have engineers probe models with prompts, but also to have controlled tests where real people who interact with the LLMs may be actually deceived, and we can measure the extent of that deception. What about how risky LLMs are for creating bioweapons? There’s expert red-teaming to figure out whether eliciting the right information from the model is possible. But we also need ways of knowing how likely people are to try this, to succeed (given different phrasings, model settings, models, or safeguards applied), how different it is to other methods of obtaining the same information (e.g. Google), how resilient external systems are to this threat (e.g. what biology labs’ safeguards are against people calling them up asking for chemicals to be made), and more.
Similarly, consider emotional effects from the use case of people becoming involved with AI-generated romantic companions. Prodding models before deployment will reveal some limited possibilities, but will not be enough to really understand and make any judgments on the real-world effects we care about; you’ll need studies commissioned by psychologists doing RCTs on people’s psychological states when it comes to developing these relationships. You’ll probably also want analyses done from companies like Replika on their real-world usage data. We need to assess impact more comprehensively.
These systems are already out in the wild — we need assessment strategies to account for tracking what impacts are happening right now, and how we can extrapolate from that. In general, we need more people measuring the elephant with different tools, and creating varied metrics with sufficient coverage, to create a meaningfully multidimensional view. The current ecosystem’s focus on pre-release scrutiny of model outputs, means that it seems to be proportionally underinvesting in methods that study realistic scenarios, as well as messy, real-world outcomes post-deployment.
Post-release scrutiny is really not easy, of course. Many parts of the real-world elephant are difficult for human senses to access and to measure. Firstly, we cannot see how people are using them: it’s impossible to get a sense of billions of private interactions with ChatGPT, and there are obvious privacy issues from exposing data from any LLM API. People are also able to run performant models locally now, without any data sent to any server.
Secondly, even if we hypothetically had complete access to all data on all servers, our evaluations aren’t there yet for metricizing that data into meaningful statistics. Even with full model access, evaluations are often brittle and shallow. Even with real world usage data, it’s hard to create metrics for natural language that reliably test what we want it to test. There’s also the difficulty of inferring the context that the LLM is being used in, and what role it is playing in that context.
One important approach to this problem is collective intelligence for sourcing assessments and evaluations. The task of creating the evaluations and metrics we need fits a diverse, collective audience, since LLMs are so general that a few people cannot catch all use cases or effects—the intelligence of the masses is needed—and because it requires multidisciplinary collaboration on different metrics, many of which have to be collected by non-AI-industry actors. We need to start thinking about how to set up a system of sensors “in the wild” for advanced AI impacts; CIP and collaborators, including Prof. Geoff Mulgan, Prof. Thomas Malone, Joshua Tan and Lewis Hammond, have called for a Global AI Observatory as part of our thinking here.
This is a critical problem. We can’t figure out what to do if we don’t first know what’s going on—same as with any other domain. If we can’t evaluate—in the true, full sense of that word—it is difficult to set or track policies meaningfully. Creating a multidimensional view of the elephant, by leveraging many different people’s input and diverse sources of knowledge, is essential for creating accountability, standards and rules for AI governance.
CIP plans to tackle a piece of this in concert with the UK Foundation Model Task Force in the coming months — stay tuned for more.
Written by Saffron Huang.