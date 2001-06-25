from the GIGO dept.
Lawyers representing Anthropic recently got busted for using a false attribution generated by Claude in an expert testimony.
But that's one of more than 20 court cases containing AI hallucinations in the past month alone, according to a new database created by French lawyer and data scientist Damien Charlotin. And those were just the ones that were caught in the act. In 2024, which was the first full year of tracking cases, Charlotin found 36 instances. That jumped up to 48 in 2025, and the year is only half way over. The database, which was created in early May, has 120 entries so far, going back to June 2023.
A database of AI hallucinations in court cases shows the increasing prevalence of lawyers using AI to automate the grunt work of building a case. The second oldest entry in the database is the Mata v. Avianca case which made headlines in May, 2023 when law firm Levidow, Levidow & Oberman got caught citing fake cases generated by ChatGPT.
The database tracks instances where an AI chatbot hallucinated text, "typically fake citations, but also other types of arguments," according to the site. That means fake references to previous cases, usually as a way of establishing legal precedent. It doesn't account for the use generative AI in other aspects of legal documents. "The universe of cases with hallucinated content is therefore necessarily wider (and I think much wider)," said Charlotin in an email to Mashable, emphasis original.
"In general, I think it's simply that the legal field is a perfect breeding ground for AI-generated hallucinations: this is a field based on load of text and arguments, where generative AI stands to take a strong position; citations follow patterns, and LLMs love that," said Charlotin.
[...] That said, Charlotin said the penalties have been "mild" so far and the courts have put "the onus on the parties to behave," since the responsibility of checking citations remains the same. "I feel like there is a bit of embarrassment from anyone involved."
(Score: 5, Funny) by Tork on Monday June 02, @07:40PM (2 children)
Heh yeah, before all this lawyers had an impeccable reputation.

(Score: 4, Touché) by JoeMerchant on Tuesday June 03, @03:24AM
Before AI, they had legal secretaries and interns and flunkies who would do their research for them... results might actually be improved using 100% AI in many legal offices.

(Score: 2) by DannyB on Wednesday June 04, @02:18PM
Not all lawyers are bad. Some lawyers are hard working and honest. In fact, I would estimate that 99% of lawyers give the rest a bad name.
Holy back pains, this applies equally to lawyers having a very delicate back.
They tried to strike down Murphy's Law as unconstitutional, but then something went wrong in the process.
(Score: 4, Insightful) by SomeGuy on Monday June 02, @10:40PM (13 children)
So how does this compare to errors and intentional fudging prior to all of this?
If people do not want to do their job now, then were they doing it before?
(Score: 1) by anubi on Tuesday June 03, @12:01AM
"If people do not want to do their job ..."
Why do they still have that job?
"If people do not want to do their job ..."

Why do they still have that job?
(Score: 4, Interesting) by Anonymous Coward on Tuesday June 03, @12:04AM
> If people do not want to do their job now, then were they doing it before?
Somehow, it seems harder for a person to make up a realistic sounding citation than to look up and cut/paste a real citation. My guess is that these fake "AI" generated citations are a new thing that courts are going to have to adapt to. Perhaps right now someone is working on an exhaustive listing of cited cases, for use with reality-checking software?
Pet peeve - sometime back I decided I wasn't going to anthropomorphize recent "AI" software. So I never use words like hallucinate (which is a human condition) when describing computer errors.
For the most hardcore version of this position, see David Parnas on actual software engineering. Here's a recent keynote address that rang true for me. I normally don't recommend long videos, but this one I found excellent and worth my time,
https://www.youtube.com/watch?v=YyFouLdwxY0 [youtube.com]
A cautionary video for anyone considering using "AI" for anything of consequence.
(Score: 5, Interesting) by Thexalon on Tuesday June 03, @01:48AM (10 children)
As long as the consequences are "everybody laughs, and the judge tells them to send it back right this time", more and more lawyers and firms will do this because it's cheaper and easier than paying a lawyer or paralegal to get it right. However, I'm reasonably confident that was not always the case at least back in the day.
A story from the 1980's: My mother got chewed out in court in front of everybody by future Supreme Court justice David Souter because Souter thought she had gotten a citation incorrect in her brief. She promised to re-check it, did so, discovered she had been right all along, and sent in a photocopy of what she had cited complete with page number and highlighted the correct section. She got back an apology, I think signed by Souter himself, that explained that he had accidentally checked it against an older edition of the law books which had since been replaced by the right ones. She always regretted not keeping that letter as a souvenir.
(Since he died recently, I'm going to eulogize him a bit: The above story tells you a few things - he had checked, himself, whether the citation was right, cared about getting it right, and had the integrity to admit when he was the one who had been wrong. My mother described him as the best judge she ever argued in front of with a knack for cutting right through the BS from both sides with clear questions, and having read some SCOTUS opinions I'm pretty sure he was the best judge on the court during his time there. For example, when deciding Bush v Gore he was the only justice to write an opinion that I'm confident would have been identical had the positions of Bush and Gore been reversed. He by all accounts hated the degree to which politics decided SCOTUS cases rather than facts and law.)
For the judiciary to do its job well, we need more judges like that who actually give a damn about getting things like that correct. At every level, from traffic court all the way to SCOTUS. Unfortunately, judges like that have a tough time getting onto the bench, because in the case of appointed judges politicians often consider them "unreliable" (since they won't always decide in favor of the politician or their party) and in the case of elected judges voters tend to think they're "soft" when they penalize prosecutors for BS and help innocent people go free.
"Think of how stupid the average person is. Then realize half of 'em are stupider than that." - George Carlin
(Score: 2, Insightful) by anubi on Tuesday June 03, @04:18AM (9 children)
It takes a far bigger man to admit / apologize for his own mistakes than one that only finds things others screwed up on.
But, blindly trusting an AI to do legal work to me would be akin to me placing so much trust in a circuit simulator that I go directly to production, never bothering you build a prototype to make damned sure the physical device does what I had in mind.
Especially knowing they are prone to hallucinations.
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
(Score: 5, Insightful) by pTamok on Tuesday June 03, @11:21AM (8 children)
AIs (LLMs) are not 'prone to hallucinations'.
They hallucinate everything - it is how they work, and what we interpret as 'hallucination' is 'simply' the algorithm giving the calculated 'best' response, in each case. 'Best' is some maximisation function for the choice and arrangement of tokens. There is no thought involved. Often enough, the output is not commensurate with reality. For this reason, all the output cannot be trusted - it has to be checked thoroughly by experts.
If it were a paralegal or intern, you give strict instructions that sources/citations are to be checked against a 'known good' source, such as one of the commercial legal databases, or court records: the person knows that if they get it wrong, they can be fired. The AI (LLM) doesn't care. It has no 'skin in the game'. It has no emotions, it has no long term memory, it doesn't learn after the initial (expensive) training, it doesn't have household bills and food to buy. What are you going to do - turn it off? It can be reactivated just as easily.
I hope the 'AI' bubble will soon be recorded in the annals of history alongside Tulip Mania [wikipedia.org] and the South Sea Company [wikipedia.org].
Note: there are things that LLMs are good for. But they are not now, have never been, and never will be intelligent. The fact that people believe that they are tells you a great deal about human gullibility and capability for self-deception.
(Score: 2) by stormreaver on Tuesday June 03, @02:25PM
I agree with you completely. It's ironic that AI is one of the roads to idiocracy.
(Score: 4, Insightful) by shrewdsheep on Tuesday June 03, @03:56PM (6 children)
Our neuronal activation patterns are stochastic much as is the generation process of LLMs. Therefore, I believe that your characterization is not helpful to understand the problem. LLMs learn distributions of sentences (word chains, really). Apparently, referencing sources is what cannot be learned this way. References are single data/few points and available data is insufficient to be capture those through distributions. This is what hallucinations are about: when there is one/few possible continuations possible, LLMs still smear out the distribution and produce incoherent results.
(Score: 2, Insightful) by pTamok on Tuesday June 03, @07:04PM (5 children)
Do you work with LLM software/hardware?
If you do, then I'd like you to expand on your statement that the generation process of LLMs is stochastic. As far as I know, it is not, although people do term LLMs as 'stochastic parrots'. LLMs are also not 'Markov chains on steroids'.
As I understand it, although stochastic techniques are used as part of the training process it is only part of the process.
All outputs of an LLM are equivalent: there is no 'hidden function' determining if they are truthful or not, or making any other distinction. As a result, the is no distinction between 'hallucination' and 'truth' - the output is the result of a token-processing exercise that generates, for want of a better term, a 'most likely' or 'best' output that meets the selection criteria defined by the underlying trained model. While you can tell an LLM that its output is at fault, and it will 'apologise', it does not update its model (this is far too computationally expensive) - if you are lucky it updates the context of the prompt (to be fair, contexts can be quite large), but once you exceed its context, it will continue to produce output not commensurate with reality as we see it.
While neural networks are based upon our models of physiological neuron networks, they are not functionally equivalent: there is much we still do not understand about neuron (and other cell) function in the brain, so to claim equivalence between an LLM or other neural-net-based model and the brain would be 'brave'.
If you want a vector of tokens describing a reference to be stable, then it has to be somehow protected in the learning/model training process (which is deliberately stochastic), which means that the training process has to be able to distinguish, ad hoc, which vectors are to be labile (and capable of 'summarisation') and which to be stable. I don't think this is currently achievable. The end result is that although the LLM may be well trained on the form of citations (lots of citations have the same format), it is likely to be unreliable regarding the content.
All the output is an hallucination that follows the form of written text (or image, or video), but the detailed content is 'fuzzy'. This is inherent to the training process.
Human memory is also 'fuzzy', yet we are capable of consistently producing text that is logical, factual, and with correct citations. There is more going on than we find in LLMs.
(Score: 2, Informative) by pTamok on Tuesday June 03, @09:15PM (1 child)
Background reading:
The Essential Guide to Tokenization for Large Language Models [tnt.studio]
(Score: 1) by pTamok on Tuesday June 03, @09:32PM
More on tokenizers
Association for Computational Linguistics: Findings of the Association for Computational Linguistics: NAACL 2024 - Tokenizer Choice For LLM Training: Negligible or Crucial? (DOI: 10.18653/v1/2024.findings-naacl.247 ) [aclanthology.org]
(Score: 1) by shrewdsheep on Wednesday June 04, @06:33AM (2 children)
LLMs are next token predictors (stochastic parrots). For every token to be produced a loaded die is thrown to choose from the permitted tokens (~ 10k) the one to put into the output (much like our own brain does). The "temperature" is a parameter controlling how loaded the die is. This is why the prompt is important as is the reinforcement fine-tuning (performed during training) to guide the emission of tokens. We discussed GPT-1 and GTP-2 on SN and back then I couldn't see how it could ever be useful. Yet, it turned out that LLMs do seem to generalize and to generate meaningful output on as yet unseen prompts. These can be called hallucinations as being pure inventions but when they interpolated previously seen knowledge, arguably, they can be seen as being valid. When LLMs veer off and make up wrong facts not based on input given at training, those are usually defined as hallucinations. All this happens as the result of a random walk in token sequence space.
(Score: 3, Informative) by pTamok on Wednesday June 04, @11:09AM (1 child)
Thank you for your extended reply: I appreciate it. I have some comments.
LLMs don't predict. They generate the next token of output according to a set of rules acting upon parameters in a model populated by a corpus of information during the 'learning' phase. Given the same model and the same input prompt, you should get the same output, unless randomisation has been added in to the output process to make it 'seem more natural' - they are in fact deterministic.
Do you have a reference/citation for non-paywalled description of that working of the brain? I would be interested to read it.
Note that the emission of tokens is determined by the statistical distribution of tokens within the training corpus. The model is not updated in real-time, so you need to be aware of the training-data cut-off date if your query is about events after that cut-off date. If you are not aware, it can produce odd results.
When evaluating which token in a sequence to emit next, there will typically be several tokens with similar statistical likelihoods of being the successor. At this point you can 'randomize' which the next in sequence from the set of 'most likely within a cut-off limit' will be: but at each decision point the likelihoods for each are fixed in the model. I guess it would ruin the illusion if, every time you posed the same question in a prompt, you got exactly the same answer back. Is this randomization 'creativity' or obfuscation?
Interpolation means 'making up data that is not there' - usually using a method people regard as sensible e.g. on graphs. Interpolations are dependent on the assumptions used in choosing the interpolation method, both explicit and implicit. All LLM output is determined by the statistical properties of the training corpus (this is an over-simplification, but it will do for now) - all the output is in fact 'interpolated' from the data in the corpus. It is all equivalent: there is no 'non-hallucinated' data - it is all 'made up' according to the statistics of the training corpus.
LLMs don't make up wrong facts - they make up facts all the time: it just so happens that the training corpuses used contain predominantly correct facts, so outputs tend to be predominantly correct. This is why 'sanitising' training corpora is so important. There is very little semantic difference between a conspiracy theory reported as fact by someone who believes it and the truth reported by someone else, so it is no surprise that LLM output will contain conspiracy theories reported as fact, and other output that does not correspond with reality.
A random walk in token sequence space will produce gibberish, because that will ignore positional encoding: the way in which the data analysis process attempts to capture semantic meaning within the model. The necessary simplifications required to reduce the model to a manageable size mean that fine differences in semantic meanings are lost, and they cannot be regenerated on output. Sometimes, the fine details are important, and can materially alter the meaning as interpreted by humans.
The choice of initial tokenization process populating the model influences the output.
The positional encoding method chosen will inevitably fail to encode all the relationships within the training corpus. This throws away semantic information.
Reduction/simplification of the model to make it manageable in size also throws away semantic information.
Tokenisation of the prompt is also subject to variation according to the procedure chosen.
The model has a cut-off date for data in the training corpus. Some LLMs can interrogate 'the Internet' and other sources to answer prompts, but the information obtained is not added to the model, as that is too computationally expensive. The model cut-off date can be important.
There is no process within LLMs for evaluating the truth or falsehood of the output. The output is in accordance with the encoded semantics within the model, modified by whatever 'system prompt' and other prompts that are used to generate output.
(Score: 2, Insightful) by pTamok on Wednesday June 04, @08:51PM
A gift that keeps on giving: LLMs are unreliable when it comes to arithmetic.
https://wandering.shop/@oli@olifant.social/114625905690269861 [wandering.shop]
This is not news (see the discussion), but the interesting point I picked up was:
An LLM model is a fixed 'snapshot' of the training corpus - it does not get updated or modified by answering prompts. If you expect to get different answers from the same prompt, where does the difference come from? The way this is handled is mixing in some 'randomness', usually by using a PRNG, which can be seeded by the current time, to ensure that the same prompt gives different (but hopefully closely related) answers to the same prompt.
People, on the other hand, don't fill up their memories in their youth, then rely on answering questions by only using what they learned before a cut-off date. If you met someone who did, would you trust them to give you correct answers to questions?