Teaching Assistants: Generative Misinterpretation
We covered Yonathan Arbel and David Hoffman‘s Generative Interpretation two years ago. It seems only right that we give equal time to Generative Misinterpretation, by James Grimmelmann (left) Benjamin Sobel (below right) & David Stein (below left), collectively, the Authors. Let me be clear, as the unofficial official blog of the AALS Section on Contracts, this Blog cannot take sides in a debate among Section members. Speaking in my personal capacity, I am rooting for the techno-skeptics. Having long since passed the point when upgrades to the technology on which I rely improve or simplify my life, I won’t mind if the whole system crashes, and burns and we are forced to go back to reading books and talking to one another.
The Authors state their bold conclusions in their abstract: Large language models (LLMs) are not yet for use in judicial chambers. “The superficial fluency of LLM-generated text conceals fundamental gaps between what these models are currently capable of and what legal interpretation requires to be methodologically and socially legitimate.” In their introduction, the Authors describe generative interpretation as “Potemkin interpretation: an attractive facade with nothing behind it.” They acknowledge that LLMs have their uses, even in performing or contributing to certain legal tasks. But interpretation is different, and LLMs are not yet up to it.
In the body of the paper, after a literature review and a discussion of LLM technology, the Authors explore two challenges that LLMs face: a reliability gap and an epistemic gap. Those who advocate for using LLMs for interpretation in legal contexts promote AI as more reliable, more consistent, and more efficient that humans acting alone. Hoffman and Arbel make only modest claims as to the usefulness of LLMs, using the technology to settle discrete interpretive dispute rather than expecting it to generate an entire opinion. The Eleventh Circuit’s Judge Newsom and the DC Circuit’s Judge Deahl have used LLMs in a similar manner to unlock the meaning of disputed terms or to test common-sense understandings of phrases. But others go beyond this limited advocacy of generative interpretation and praise, or even market, LLMs’ capabilities as adjudicators.
The Authors then proceed to poke holes in the LLM-enthusiasts’ claims. Beginning with the examples used in Generative Interpretation, the Authors show that different LLMs yield different results in resolving interpretive questions. The differences are not subtle, and they are outcome determinative. One chatbot will tell you that interpretation A is one hundred times more likely than interpretation B. Another will say that B is twice as likely as A. The Authors note that Professors Arbel and Hoffman explore different approaches to how one might use LLMs. In the abstract, this seems like good experimental methodology. The problem is that each approach is a choice, and that choice can change the outcome. Sometimes, the Authors observe, an embarrassment of riches is still an embarrassment. Because the output of LLMs turns on often subtle differences in the prompts fed into them, LLMs are not reliable guides to deciding cases, the Authors conclude.
In the next section of the paper, the Authors discuss and reject four justifications that LLM-proponents offer in defense of the us of LLMs to decide cases or issues. They dub this “the epistemic problem” — should judges turn to LLMs to provide authoritative guidance on legal issues? LLMs do not engage in deduction. They operate through prediction, not through reasoning to a conclusion. The more sober advocates of LLMs recognize their limitations; the true believers remove all doubts by hypothesis, proclaiming their that AI model is “a perfectly neutral arbiter and interprets words with perfect mathematical accuracy.”
The proponents make a more informal argument — because LLMs have trained on the entire Internet, they can offer meaningful insights into ordinary meaning. This argument, taken alone, proves little. “Mere exposure to large amounts of natural-language text, the Authors observe, “does not automatically confer authority about linguistic meaning.”
The real test of the pudding is in the eating, and so the Authors look for empirical evidence that LLM models are effective interpreters of legal texts. The work on that has only just begun, but the Authors believe that they have already shown that, at least for now, LLMs are not reliable, because they produce wildly varying answers depending on small changes in the inputs used to generate the LLMs’ answers. Attempts to replicate human results in legal decisionmaking with LLMs have thus far not been that successful. One LLM proponent, Adam Unikowsky, trained an LLM on the briefs and records in 37 Supreme Court cases. The LLM got the same result as the Court in 73% of such cases, which is not very impressive. Unikowsky claims to find the LLM opinions “more persuasive” than the actual opinions, but Unikowsky’s subjective assessment is not the appropriate standard.
The Authors then explain why the use of LLMs in legal interpretation cannot be justified based on claims that the LLMs provide persuasive reasons for their conclusions. It is not enough for the Authors that LLMs can replicate the outcomes of legal opinions. The Authors are not persuaded that the use of LLMs is justified based either on the fact that LLMs can reach the same conclusions as actual judges. Nor is it sufficient that they seem reasonable. Finally, the Authors are skeptical that what LLMs do is reasoning rather than just predicting. Predictability alone is not the main goal of adjudication.
Overall, I am less persuaded by Generative Misinterpretation than I had hoped. I think the Authors score some easy points by discussing both scholars and judges who are using LLMs for discrete purposes and full-throated LLM-enthusiasts who are trying to sell products. The former make reasoned arguments in defense of a limited deployment of LLMs in legal interpretation. The latter make some reasoned arguments but also engage in a lot of sales puffery that casts the entire project into doubt. I think the piece would have been more powerful if it were two pieces, one devoted to academic and juristic efforts to make use LLMs to do generative interpretation and another devoted to the more ambitious arguments put forward by people who are trying to monetize the technology.
More generally, I think the Authors dissect generative interpretation with a thousand cuts. It cannot be justified on this basis alone, nor on that basis alone. But its defenders do no offer a single defense of the usefulness of LLM models; they offer several, and the whole is greater than the sum of the parts. The more I read, the more puzzled I was by what it would take to satisfy the authors of the usefulness of generative interpretation. It’s not enough that the models yield predictable results. It’s not enough that they replicate the conclusions of actual courts. It’s not enough that LLMs generate persuasive opinions. Fair enough. None of those things are enough by themselves, but taken together, we are getting close to something useful, and the burden shifts to the Authors to explain why we should not just weigh the benefits of LLMs (efficiency, predictability, persuasiveness) against the costs.
It is important to consider the socio-political context in which we are operating. People are losing faith in courts because judges’ ideological biases seem to overdetermine outcomes. People might prefer a system that would not always yield the same result but where the differing results are the product of something other than political ideology. Dispute resolution always involves risk. People might prefer that the risk lie in factors beyond our control. Changing the inputs in trivial ways might change the outcome that the LLM produces, but legal realism has long taught that outcomes may depend on what the judge had for breakfast. So long as the person in charge of the inputs cannot purposefully manipulate inputs to produce a pre-determined result, the parties to the dispute might prefer that level of uncertainty to a world in which the outcome of their dispute turns on the political preferences of the adjudicator.
That said, the Authors are not trying to pour cold water on the entire generative interpretation enterprise. LLMs can be used to draft opinions on either side of an issue, which can help an adjudicator to identify the more persuasive arguments. LLMs can adjudicate simple issues efficiently. They can perform rote tasks the correctness of which is easily verifiable. LLMs can be used by parties, subject to adversarial refutation.
The Authors hold up two models from which the advocates of generative interpretation need to learn. Courts make use of trademark surveys in trademark litigation. The courts treat this evidence with skepticism, wary of its weaknesses, but they still consider the evidence. Advocates have used corpus linguistics, and some courts have experimented with it. The work of persuading people that these approaches have value is, as the Authors note, laborious and contentious. Generative interpretation may be on a similar path towards widespread use in dispute resolution. The Authors caution that its advocates should not try to skip the necessary steps that will render their models more reliable and more trusted in the long run.