Stanford University researchers found that law professors preferred AI-generated contract law answers over those written by fellow professors approximately 75% of the time in a recent study. In 2,918 blinded comparisons, 16 professors from 14 U.S. law schools selected Google's Gemini 2.5 Pro responses 75.92% of the time and NotebookLM responses 74.75% of the time over human instructor answers. The study tested whether large language models could align with professional legal reasoning standards across legal doctrine, case law, hypotheticals, and policy issues, as law schools and courts increasingly integrate AI tools into legal practice.
The study involved 16 professors from 14 U.S. law schools, including Stanford, Yale, New York University, the University of Chicago, Georgetown, UCLA, and the University of Virginia. The professors created 40 contract law questions covering legal doctrine, case law, hypotheticals, and policy issues. Researchers designed the evaluation to test AI capabilities in domains requiring judgment rather than single correct answers.
"Large language models (LLMs) are increasingly promoted as educational tutors, yet most evaluations focus on domains with a single ground truth," the researchers wrote. "Many disciplines, however, hinge on judgment: reasoning, weighing ambiguity, and reaching defensible conclusions. Law provides a sharp test."
Professors evaluated answer pairs in blinded comparisons, selecting the response they would rather give a student without knowing whether the answer came from AI or a human instructor.
Google's Gemini 2.5 Pro won 75.92% of its matchups against human instructors, while NotebookLM won 74.75% of the time. The researchers analyzed whether the results reflected broader professional consensus by examining agreement rates when professors evaluated the same answer pairs.
"Observed agreement exceeded the level expected if judgments were entirely idiosyncratic, indicating that the LLMs' success reflects alignment with common disciplinary criteria," the researchers wrote.
AI models outperformed human instructors across multiple categories, including recall questions relating to case, code, or doctrine, hypotheticals, and policy discussions. The study tested whether AI advantages stemmed from surface-level writing style rather than substantive content by analyzing lexico-syntactic features such as answer length, structural organization, reasoning nuance, legal anchors, confidence tone, clarity, and pedagogical support.
In a separate analysis of additional models, Anthropic's Claude Opus 4.7 ranked first, followed by OpenAI's ChatGPT 5.4 and Gemini 2.5 Pro. Every AI model evaluated outperformed human instructors on average.
AI-generated answers were flagged as harmful less often than those written by professors. Gemini recorded a 3.41% harmfulness rate and NotebookLM recorded 3.64%, compared with 12.06% for human instructors.
The researchers noted that the study did not measure whether answers matched each professor's individual teaching preferences. "While LLM responses are generally preferred over those of human instructors, our evaluation setting does not allow us to directly measure the extent to which instructor preferences are satisfied," the study stated. "It is at least theoretically possible that LLMs, although generally delivering stronger responses, still generate answers that are merely viewed as 'good enough.'"
The Los Angeles Superior Court began testing AI tools in March to help judges manage growing caseloads. Law schools are adding AI training programs as the legal profession integrates artificial intelligence.
"The potential benefits of these new technologies as a force multiplier in the practice of law just can't be ignored," Mississippi College School of Law Dean John P. Anderson told Decrypt. "Whether our students plan to be litigators or transactional attorneys, their future employers will expect familiarity with these AI tools. We want the firms hiring our students to be confident that every MC Law grad is competent in AI technologies."
Law firms continue to confront cases undermined by hallucinations and other AI-generated errors. In April, law firm Sullivan & Cromwell admitted to a U.S. bankruptcy court that a recent filing in a high-profile case contained fake citations generated by AI.
What percentage of the time did law professors prefer AI-generated answers over human-written answers in the Stanford study?
Law professors preferred AI-generated answers approximately 75% of the time in the Stanford study. Google's Gemini 2.5 Pro won 75.92% of its matchups against human instructors, while NotebookLM won 74.75% of the time across 2,918 blinded comparisons.
How did AI harmfulness rates compare to human instructor responses in the study?
AI-generated answers recorded lower harmfulness rates than human instructor responses. Gemini had a 3.41% harmfulness rate and NotebookLM had a 3.64% rate, compared with 12.06% for human instructors.
What AI tools is the Los Angeles Superior Court testing?
The Los Angeles Superior Court began testing AI tools in March to help judges manage growing caseloads, though the specific tools were not identified in the source.
Related News
Microsoft Build releases 7 AI models, with Token usage 60% lower than competing products
Microsoft Unveils Seven AI Models Claiming Edge Over Claude and Nano Banana
Alphabet boosts AI compute expansion by $80 billion through a share increase; Berkshire makes a $10 billion equity investment
Alphabet Seeks $80 Billion to Fund AI Infrastructure Expansion
AI Cost Crisis Fuels Fresh Dot-Com Bubble Comparisons