Figure 2 (page 6) screams problems. There's only 16 professors (3k comparisons each?!?!) and the professors are all over the place. That's very high variance, suggesting the study has no meaningful statistical power. Poor instructor 16 can't catch a break lol
There's also really clear bias given that the main results only feature Google models. Other models show up elsewhere, why not there?
I'm no lawyer, but I'm a pretty competent statistician and can confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over
But is it a surprise law professors aren't great statisticians?
I don't have a similar intuition calibrated for what could go wrong when asking AI to draft a legal document. Some things seem harmless, i.e. drafting a will, but I don't really know- our legal system is notoriously rife with footguns.
Any lawyer who isn't using LLMs for research is behind the curve, though. They are unbelievable at finding niche cases you would never have found on your own. Previously it was a lot of exact search term matching, which is inherently useless for a lot of legal research. I need something that can search on vaguer terms, which AI can do incredibly well. Just check the results. I'm sure the LLMs from Lexis Nexis/Westlaw are probably better than the general purpose ones.
LLMs make fantastic paralegals. If you're doing any legal work, you should be using it, even if it's just to shoot ideas at. Have it play devil's advocate. My friend always has it play the other party's lawyer to see what all the counter-arguments are going to be.
Just like you would with software development. If you care about what you are creating, CHECK THE OUTPUT.
I liken it to me googling things as a sysadmin vs. Jane from accounting doing it. The non-tech end user is far more likely to make the problem worse, or install something sketchy from the ad riddled results than I am, or one of my help desk employees are.
I wouldn't trust myself to draft an important legal document using AI without the advice of a lawyer, much like I wouldn't really want to rely on my lawyer to use AI to write code for me.
> I think this is probably true for most skilled professions.
I agree, BUT I also find that it's easy for experts to atrophy quickly. When the AI is right 80/90% of the time it lulls you into over confidence.I find those that are best and make the greatest use are the ones who remain skeptical but also use the tool. The same people who were already nuanced and picky before AI. The same people who already doubted and questioned their own work, and used that suspicion to help prevent them from having over confidence in their own work. If you weren't willing to just "lgtm" with your own code, it's difficult to do that with AI.
(To be clear, I'm not saying perfectionists. Some might call them that because the picky people have higher standards, but a good expert has to also understand that perfection doesn't exist. That's often a driving force in the suspicion! This also tends to cause them to continually improve)
The danger of those mistakes creeping in also grows exponentially the farther a lawyer strays from their core legal expertise. There are a few statutes I know inside and out, and I can spot LLM analytical errors related to them in a split second, but once I venture out into domains where I am not an expert (but where I am nevertheless reasonably qualified to practice), it becomes much harder to spot drafting mistakes because I have not refreshed my own understanding of the law by reviewing the relevant cases or statutes as I would when drafting the analysis myself from scratch.
Yet that is exactly what a lot of C-Suiters (many of whom are lawyers), are doing.
i think devs overestimate their own role and underestimate others
i am seeing lawyers and doctors roll out their own software with AI
but we dont have their training and experience
I would imagine it's similar in law, in that it takes a lawyer or judge to know where the foot guns lie.
The time lag between drafting and "deployment" also makes for much less effective, much more expensive debugging loops. You can deploy your code to prod in seconds, see an error pop up in the logs, and immediately start debugging. But it will take at a minimum days and frequently as long as several years before an error in a contract or a court filing will be detected, and often the error is beyond correction at that point. Thus, the errors are both more difficult to detect and to resolve.
And the consequences of error are often much greater, both because they are not correctable and because a legal error may risk someone's life, liberty, or substantial property. Although that's not categorically the case, obviously bugs in certain safety critical systems can be as bad or even worse than legal mistakes. But in general, most software is lower stakes than most legal writing.
On the flip side, LLMs do seem to do a better job with basic style and structure for legal documents compared to code. Things like following IRAC format, citing assertions of law (although hallucination remains an issue), and writing comprehensible sentences. These would be the equivalents in code to best practices like good comments, cohesion, consistent use of design patterns, test coverage, clear variable names, DRY, etc. Although the better performance on those more qualitative metrics may just be because even the longest legal documents are typically simpler in structure and have fewer lines of text than a large, complex codebase. Or maybe it's because LLMs are trained on natural language text more than on code. Or because natural language is more forgiving than code, in that minor variation in diction or grammar is unlikely to have any significant effect on how the document is interpreted, whereas even single character errors in code can have enormous effects.
For murder that's not such a huge deal because the statutes are typically easy to track down and don't really differ all that much substantively, but once you get really into the weeds on something like commercial contracts it can be a huge pain to do cross-jurisdictional research.
And that's just a tiny, super obvious example of how impenetrable statutory law is, which isn't even the really pernicious problem. Case law is infinitely worse. It makes me absolutely furious how difficult legal research still is. The Westlaw/LexisNexis duopoly is a moral crime and wildly destructive to the quality of government in this country. Every single written court opinion should be publicly available for free on the internet in an easily searched format. It would cost practically nothing to achieve. We're talking about less text than Wikipedia hosts. Yet still many states make it almost impossible to access case law. Even though these cases are law. Binding law that we are supposed to follow, yet we cannot even easily access. It's insane, and largely perpetuated by the complacency of lawyers who can charge others for what should be free, the lobbying of the duopoly, and the incompetence of politicians.
If all of the laws were consistently available and stored in reasonable, consistent citation formats (I would settle for hyperlinking as a replacement for the rat's nest of wildly varying jurisdiction-specific citation systems), it would even be possible to introduce a form of unit testing for legal drafting that would allow us to automatically verify if the LLM hallucinated a citation.
It also doesn't help that we (for what were at the time very good reasons) moved away from the system of legal writs that used to provide fairly standardized, almost "cut and paste" templates for legal filings. So now every legal document (filings, memos, contracts, court opinions, statutes) is drafted like a bespoke, artisanal creation with few strict structural or stylistic conventions. That makes automated interpretation much harder than it needs to be.
Such a document may not make a difference to the person that eventually will have died, but it can make or break the life of generations to come in countries that are so heavily optimized for dynasty building like the US.
Absolutely not harmless if you're the executor of an estate forced to deal with a screwed up AI will. I just handler my dad's estate this spring. It's a frustrating and confusing process even with the simplest of estates.
Median household net worth is in fact somewhere in the $100k-200k range, which is definitely something that could be meaningfully called an "estate." (Most of this tends to be the house, the median net equity in which is about $190k as of 2022).
Source: https://www2.census.gov/library/publications/2024/demo/p70br...
[1] This doesn't mean "homeowners," rather it's a recognition that assets for married or cohabitating couples are usually commingled.
An "estate" is a legal term for property, assets, and liabilities a person leaves behind upon their death. A family member is a top practitioner in the field of estate planning and resolution, and some of the messiest estates they have handled are pro-bono cases of exactly the type of people you would put in italicized "most people": poor, not really able to upkeep a house they inherited from a relative which hadn't had title properly transferred on a previous death because they didn't have money for an attny, now can't get a loan to fix the roof...
Yeah, if you are homeless, carless, and have only the clothes on your back and a shopping cart of stuff, you don't have an estate. Everyone in the middle class in the US has an estate. Much of the time it passes automatically to their spouse on death, but it's still an estate.
And if you are concerned about where it goes, get a GOOD attny. There are many bad ones hanging out their shingle as "Trust & Estate" attnys, and some of the next messiest cases are fixing problems made by those not-so-good attnys.
And NO, AI is not good enough.
e.g., https://www.npr.org/2026/04/03/nx-s1-5761454/penalties-stack...
can't get more foot gun than "well according to [fiction] it is a well established practice (that the defendent is guilty)"
One thing I learned, just bite the bullet and re-write the whole fucking will instead of making riders.
Piecing the will together from riders was terrible. Al the clauses fell away everyone got older. The final will could have been 8 pretty clear pages.
The other part that is hard is just knowing all of the things that happen with assets and a passing. Luckily we had another lawyer and financial folks to advise us. It was still a lot and not that easy to find details. This was pre-ai that would have helped walk through his shit.
I don’t know if that’ll be true for long. I just had my colleague who’s a very competent engineer IMO hand me a frontier model vibed PR to review (after reviewing it himself, he claims) which contained random variable assignments, conditionals that do nothing, etc. He’d never do such a thing before. People become too comfortable and get confirmation bias as well.
Or worse, use historical data to determine the laws of today.
TL;DR Its never a good idea and it will bite you.
1. https://finance.yahoo.com/news/valve-wins-trial-against-pate...
In the framing of using LLMs as legal tutors, with the implication of lowering the cost of legal training, this seems like a socially-positive outcome. Furthermore, it feels kind of intuitive to me that any contemporary system operating with an LLM and access to legal reference material will be prepared to answer _student-originated questions_ comprehensively and with breadcrumbs or direct references to educational/source materials, as seems to have been found in the study.
The authors explicitly and intentionally emphasize that many legal questions require contextualization, as opposed to some discrete calculated answer. The result of the study implies that the LLM-based systems were capable of using what many of us here understand to be the "stochastic best-fit algorithmic generation" of a contemporary language model to adequately contextualize a student's question, providing insight into the trade-offs or complications implicit in the question, while then, critically, _meeting the professional standards of legal educators in explaining that complexity to a student_.
Realistically, I would hope this provides some confidence to readers of HN that they can actually ask a legal question to an LLM and expect the response will explain the complexity of the law in relation to the question. This is great news, and is likely the minimal pre-work any of us should do before actually consulting a lawyer, if time permits.
On the other hand, I do _not_ think that this study provides any indication that an LLM is prepared to actually provide direct legal counsel. Possibly in the same way that a legal textbook does not replace legal counsel, or perhaps more accurately, the same way that stumbling upon a legal case study for approximately the same situation you're in doesn't guarantee you'll have the same result.
This is a pretty limited introductory course based on what it says in the methods of the paper itself.
EDIT: just found out that Google is a major donor to HAI. So this research is at least partially funded by Google. Which is probably the reason the authors fail to declare no conflict of interest.
The quality of LLMs depends heavily on, among other things, how you word your questions.
Knowing the correct questions to ask is not something most students know how to do given that it tends to require a fair bit of pre-existing domain knowledge.
There are however LLM context building techniques that anchor completions in data structures that persist the structure of claims that support the conclusion contained in a completion. Lots of different patterns exist —organizing logic in language is a rich domain— but the one I’ve liked the most is something called a Claim Dependency Graph that models the relationships between atomic claims as graph edges.
There’s a whole suite of operations you can perform on these structures, and “reconstruct how you came to this conclusion” is absolutely one of them.
Model interpretability work has advanced a lot. Arguably we already can explain AI decision-making better than human brains.
The point is familiar but there are good illustrations in the Atlantic article by a book editor. At first it seems abstract AI hate, but then she gets to the details. AI text cannot be edited. https://www.theatlantic.com/technology/2026/05/how-to-tell-a... or https://archive.ph/YJsGK
Asking the LLM in a way where it annotates its sources, it can greatly increase the pattern matching to closely simulate logic, just like in humans.
I understand the question of why did you say this, not that, I have seen other ways of asking that which do not seem to trigger the LLMs over-response in the other direction.
If a person using the service is given inaccurate legal advice and acts on that advice, the person can't be charged with a crime, can't be given any civil penalties, etc., as long as the law in question is non-obvious.
Obviously if by some exploit, some fundamentally obvious crime (murder, theft, obvious fraud, etc.) is said to be legal, that wouldn't apply, but of course the service should try to prevent those kinds of exploits anyway.
Could limit this to something like business regulations to begin with, or even specifically for small businesses, or contracts within some time limit and dollar amount that would otherwise be coverable by small claims court, etc.
Julian Nyarko
Professor of Law
Co-Chair Stanford Law AI Initiative
Senior Fellow, Stanford Institute for Human-Cented AI (HAI)
LOL!I killed my Arch installation and was stuck at the GRUB prompt.Unwilling to brush up my rusty knowledge of GRUB syntax, I asked Gemini for help. The commands Gemini suggested would have wiped my hd...
Once Gemini was told that I was using BTRFS, the suggestion from Gemini looked a bit more sane, but still looked incorrect to me.
It was only after I informed Gemini that I was using a NMVE with BTRFS that it finally produced a sane command.
NotebookLM was considered slightly better than 2.5 Pro by the evaluators.
That's the problem, you never know when the 25% deliver a true stink bomb, and that's not considering prompting - while a fair prompt/question maybe considered objective, it's very easy to stray.
But imagine if a dev team didn’t have to go engineer -> product manager -> legal team to get a question answered on local data retention requirements. You could ship that much faster.
you can get away with anything
If the only purpose of asking a lawyer is transferring risk (aka cover your ass) while getting the same advice as an LLM, that’s slowing down delivery for purely bureaucratic reasons.
I’ve seen that mentality at big companies where everyone is scared to stick their neck out and be accountable for a decision. And nothing gets done. Drives me crazy.
But the people who move up are the people who take ownership and get shit done (and are right a lot).
(BTW, I have been at companies that were sued by regulators. They never really punish the individual(s) who were in the room when the decision is made. So your worry is kind of misplaced.)
THEN I find a human lawyer and give AI's answers to them and say "Can you find any errors in this? Can you improve it?" .
That way I think my legal bills should be smaller because the AI has already done most of the work. What do you think? Which LLM is best for legal work?
Please see attached contract we received from [counterparty]. ChatGPT says blah, blah and blah should be revised. What do you think? Is there anything else that we should change?
It's bit like with doctors, you'll want a second opinion, if you can afford it.
Probably for important deals, detailed human review will be expected.
Maybe the real value-add will be the insertion of language that LLMs won't be able to figure out, but which will be favorable for the side that inserted them.
i do second phase on codex, by asking to download all pdfs and extract all text of laws it references. can repeat fully local research step.
after i ask gemini to find issues and criticize.
UPDATE: there many legal skills on github to try, not used so any yet
75% win rate seems pretty good!
Paper link: https://law.stanford.edu/wp-content/uploads/2026/06/salinas_...
There's been a lot of news stories about lawyers using AI, and then getting in trouble for citing hallucinated laws or cases. It doesn't matter if the AI response is "preferred" over the human one if it gets thrown out when put under the scrutiny of a real case.
I don't think there will be any such market for "non ai" law. If I'm involved with the legal system I just want out as quick as possible as cheap as possible.
A bit of extrapolation from the study, but not a crazy stretch.
But I could also see a world where that, too, is fed to models for hyper-local results.
Could be a way off, but I could see it.
Even the good ones will not step above and beyond what they are paid to do
but an AI ? it will and can go above and beyond
Given the number of responses the professors were asked to rate (200 each), they probably graded them the same way that bar exam responses are graded: quickly and superficially. Not surprising that LLMs achieved higher scores in this scenario, since they excel at producing superficially nice answers that don't hold up under scrutiny.
Also...unless statistics has changed in the past 2 decades, the math in the charts doesn't math. That's probably why they're leaving out the actual numerical data. I also wouldn't be surprised if we learn in the coming days that the charts were AI generated.
The inaccessibility of justice is a huge driver of inequality. Any tools which bridge this gap will help make a more just society.
I think, in the right hands, this could be huge.
The authors point out that this other metric was computed in prior work and incorrectly dismiss it as being not as good as winning percentage in head to head competitions. The cited prior work shows that the models fare poorly on that metric. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5166938
I'm getting more convinced. I mean, sure it makes dumb mistakes sometimes but its a particular set of self serving mistakes, commenting out tests in order to pass. We obv don't want this behavior but I wouldn't say it's dumb.
It'll be like the Turing test, which we just blew past years ago and no one cared. After all the hand-wringing about sentience and rights of the AI if it passes the Turing test, and now we just have AI bots running 24/7 writing slop.
How does everyone else feel?
He stands to make billions if enough people believe him — unless you also do, consider that you’re the mark. For example, if that was true, it would have to mean that AI companies either aren’t letting customers use the good models or are instructing them to frequently make errors which reveal a fundamental lack of reasoning ability.
Consider also that his wealth means he hasn’t had to defend an idea stringently since the 90s. I wouldn’t be surprised if he does think LLMs give deep answers because it often looks that way until you critically review the response and ask questions like what’s missing which require you to have a decent understanding of the problem domain.
I also think it’s easy to think that AI gives good answers if you don’t know the field well. In fields where I know the material, the answers are pretty variable and can be quite bad.
Humans have the advantage of perspective. We always lack some knowledge and answer broadly. This is bad if you have a particular goal in mind, but better if you're just generally learning, because you see more and learn to discriminate the correct from the wrong. And most importantly, being wrong is part of human ingenuity - because sometimes we turn something "obviously" wrong into something right.
Investor with vested interest in AI companies makes claim of reaching "AGI".
He is one of the last people to listen to about AGI. Unless the term "AGI" means something entirely different to him vs to independent researchers vs to CEOs, since the term has become entirely meaningless.
https://fortune.com/article/rise-in-elite-students-seeking-a...
and where they wanted to ban words such as "chief", "stupid", "karen" and "American"
https://reason.com/2022/12/21/stanford-elimination-harmful-l...
Stanford and its donors of course want to replace anyone but its administrators, so they cheer on such anti-intellectual nonsense.