Abstrakt: |
We report on our test of the Large Language Model (LLM) ChatGPT (GPT) as a tool for generating evidence of the ordinary meaning of statutory terms. We explain why the most useful evidence for interpretation involves a distribution of replies rather than only what GPT regards as the single "best" reply. That motivates our decision to use Chat 3.5 Turbo instead of Chat 4 and to run each prompt we use 100 times. Asking GPT whether the statutory term "vehicle" includes a list of candidate objects (e.g., bus, bicycle, skateboard) allows us to test it against a benchmark: the results of a high-quality experimental survey (Tobia 2000) that asked over 2,800 English speakers the same questions. After learning what prompts fail and which one works best (a belief prompt combined with a Likert scale reply), we use the successful prompt to test the effects of "informing" GPT that the term appears in a particular rule (one of five possible) or that the legal rule using the term has a particular purpose (one of six possible). Finally, we explore GPT's sensitivity to meaning at a particular moment in the past (the 1950s) and its ability to distinguish extensional from intensional meaning. To our knowledge, these are the first tests of GPT as a tool for generating empirical data on the ordinary meaning of statutory terms. Based on our results, we offer five lessons for using LLMs to generate empirical evidence of the ordinary meaning of statutory terms, a move toward developing a set of best practices. Legal actors have good reason to be cautious, but LLMs have the potential to radically facilitate and improve legal tasks, including the interpretation of statutes. [ABSTRACT FROM AUTHOR] |