With 1000 rows and 100 samples and markdown-kv, I got these scores:
- gpt-4.1-nano: 52%
- gpt-4.1-mini: 72%
- gpt-4.1: 93%
- gpt-5: 100%
I was so surprised by gpt-5 getting 100% that I ran it again with 1000 samples. It got 999 correct, and one wrong.
To reproduce it yourself, clone the repo, add a .env file with OPENAI_API_KEY, `uv sync`, and then run:
uv run inspect eval evals/table_formats_eval.py@table_formats_markdown_kv --model openai/gpt-5 --limit 100
Update: Also, number of rows makes a massive difference, unsurprisingly; at 100 rows, gpt-4.1-nano scores 95%+ for both markdown-kv and csv. Both model and record count seem to matter a lot more than format. uv run inspect eval evals/table_formats_eval.py@table_formats_csv --model openai/gpt-5 --limit 100
uv run inspect eval evals/table_formats_eval.py@table_formats_json --model openai/gpt-5 --limit 100
uv add google-genai
uv run scripts/run_benchmarks.py --models google/gemini-2.5-pro --formats markdown_kv --limit 100
And add GOOGLE_API_KEY=<your-key-here> to a file called .env in the repo root.Unfortunately I started getting "quota exceeded" almost immediately, but it did give 6/6 correct answers before it crapped out.
100 samples:
- gemini-2.5-pro: 100%
- gemini-2.5-flash: 97%
> accuracy: 60%
Not to mention that the least poorly performing format is probably the stupidest way to encode tabular data, beating even XML. But I guess that’s the new normal because we’re trying to shoehorn conversational AI models to every use case rather than, say, training finetunes that are better at particular tasks. (Yes, of course you can’t train finetunes when the model is a proprietary black box on someone else’s computer.) Something about hammers and nails…
To explain the 60% a bit more...
With small amounts of input data, the accuracy is near 100%. As you increase the size of the input data, the accuracy gradually decreases.
For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test.
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing a basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
CSV: 84.25%
Markdown Table: 82.65%
YAML: 81.85%
JSON Lines (jsonl): 79.85%
Markdown key-value: 79.83%
Pipe-delimited: 79.45%
Natural language summary: 78.65%
JSON: 77.73%
HTML table: 75.80%
XML: 73.80%
So, the biggest takeaway really is: Use the best model you can reasonably afford, then format will matter less. The cheapest 100% coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1And if you have no control over model, then use CSV or Markdown Table.
It looks to me that the concisest way of representing each of these tables was a CSV and then a standard markdown table. The amount of tokens appears to be 1/2 or 1/3 of the other options. For experiments not in mice (GPT-4.1-nano), but in larger models or larger context aside from the data table itself, my guess is that preserving context is might be higher value than having the higher-LLM-legibility of the Markdown-KV.
Interesting.
On your section "Limitations and Areas for Further Study",
What I'd be curious on future work would be,
- changing the order of the data on each table type
- changing the order of the questions
I'm curious to know if what it fails is the same, if it changes depending on the location, if it's a bias.Is it always a specific question? Is it always a specific value? Is it always question #x (or around question #x?). Does it tend towards x or y on types of questions?
Good idea
Where do you eat?
A) floor
B) table
C) dirt
In this case, the questions asked have an answer. The bias would then be on the order of the input data. It’s different enough that it triggered my curiosity. ## Record 1
```
id: 1
name: Charlie A0
age: 56
city: New York
department: Operations
salary: 67896
years_experience: 7
project_count: 1
```
Which makes sense to me because the problem with formats like CSV and regular markdown tables is that it is too easy for the model to mistakenly associate a value in a row with the wrong header.Explicit key/value formats like this or YAML or JSON objects make that a lot less likely.
Then I realized from the table that XML used about 50% more tokens (~75K vs ~50K) for similar accuracy, and for the first time felt a kind of sympathy for the LLM…
> We only tested OpenAI’s GPT-4.1 nano.
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing at basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
CSV: 84.25%
Markdown Table: 82.65%
YAML: 81.85%
JSON Lines (jsonl): 79.85%
Markdown key-value: 79.83%
Pipe-delimited: 79.45%
Natural language summary: 78.65%
JSON: 77.73%
HTML table: 75.80%
XML: 73.80%
IMO the biggest takeaway really is: Use the best model you can reasonably afford, then the format chosen will matter less. The cheapest 100%-coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1 FWIW. However, if you have no control over model, then use CSV or Markdown Table as these have highest chance of success.The MAJOR issue that we might not want to admit is that there are a thousand confounders that prevent any meaningful canonical learning here. Crucially: The data within the tabular structure itself matters HUGELY. The scary probabilistic nature of LLMs mean the very subject of your queries can affect how the query is run, which is quite absurd from a IO/computing purity perspective. This is why tooling is so important. Enable the LLM to write and execute code safely, and you don't need to worry about such free-prose frailties.
Could they be referring to this?
"Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad" https://deepmind.google/discover/blog/advanced-version-of-ge...
If you're curious, check out how mathematicians like Robert Ghrist or Terence Tao are using LLMs for math research, both have written about it online repeatedly (along with an increasing number of other researchers).
Apart from assisting with research, their ability on e.g. math olympiad problems is periodically measured and objectively rapidly improving, so this isn't just a matter of opinion.
Yes LLMs suck at calculating stuff. However they can manipulate equations and such, and sometimes impressively so.
I've been stunned by how many smart people talk so casually about how because LLMs aren't perfect, they therefore have no value. Do they just forget that nothing in the world is perfect, and the values of things are measured in degrees?
Not to mention now you have the compounded problem of your mistakes plus the calculator’s mistakes.
Yes, it's not 1%, but the argument is about them being imperfect devices. It's not a horrible thing to start with the presumption that calculators are not perfect.
But you seem to have missed the main point I was making. See? Another error. They're everwhere! ;)
Ah, but whose error? ;)
You really could’ve done without this bit.
To hopefully clarify a bit...
I intentionally chose input data large enough that the LLM would be scoring in the region of 50% accuracy in order to maximise the discriminative power of the test.
The author didn’t see much more than 60% accuracy which is not very useful for many (most?) real world tasks.
LLMs are expensive. Spending tokens to do something in bulk that is well suited to existing tools and algorithms, is wasteful and slow. And the main reason is that, using LLMs, the original author indicated only a 60% success rate for the task. Why spend many times more time and money and energy just to use an LLM on a well-understood preparatory task that it sucks at, when you can get much better results more inexpensively with off-the-shelf tools, and feed their results to the LLM for its unique value.
Check out LLMWhisperer from Unstract —> it preserves table and layout fidelity when converting documents for LLM use. You can try it on complex PDFs or forms here: https://pg.llmwhisperer.unstract.com (no signup needed)
Layout preservation upstream often improves downstream accuracy more than choosing between CSV, JSON, or Markdown. Find more details here: https://unstract.com/llmwhisperer/
> Performance Optimization: Reducing processing overhead while maintaining accuracy
What on earth does it mean that this “optimized performance”? This is nonsensical content. Performance wasn’t even measured, accuracy was. You can tell this was AI generated because “ Reducing processing overhead while maintaining accuracy” would likely be true for a perf optimization, but it has no meaning whatsoever in the context of the article.
This really throws into question whether I can take the rest of the article and data seriously.
"Write a function to find years of experience by name? Return just the number, e.g. '12'."
It works much better, and it can single-shot many of the processing requirements just from type definitions it can infer from the data.
This way it's easier to stick to tabular formats that have easy reading libraries, like with TypeScript/JavaScript JSON, and with Python, maybe CSV...
System Sales(a)
Number of Units (in Millions)
────────────────────────────────────────────────────────────────────────
KFC Division 31,981 $ 34,452
Taco Bell Division 8,757 17,193
Pizza Hut Division 20,225 13,108
Habit Burger & Grill Division 383 713
YUM 61,346 $ 65,466
I'm seeing pretty good success with extracting data out of 10-Qs which are formatted like this by default using the `edgartools` library's default `filing.text()` method.They can transform information in tables but information is lost due to that lack of understanding.
I experimented with an approach to use the llm to generate a bespoke transformation machine that uses an LLM to generate a series of transform steps to extracting key data from large data sets.
https://tombers.github.io/oblique-angles/ai/education/2025/0...
Most existing pipelines address this by preprocessing the table into a linearized 1D string before passing it to the LLM — a question-agnostic step that may lose structural information.
Instead, one could retain the original table form and, when a question is asked, feed both the question and the original table (as an image) directly into the VLM. This approach allows the model to reason over the data in its native 2D domain, providing a more natural and potentially more accurate solution.
Much more important than citation farming a paper on 1 % improved performance
I once tried to get Claude and ChatGPT to build me a excel financial model, failed pretty hard. The models seem to lose track where they are in a table
Anything below 100% is actually pretty useless when it comes to stats.
You can confirm it's doing the right thing by reviewing the code it wrote.
I've been using pandas on-and-off for over a decade and I still haven't come close to doing that.
The tool uses an LLM to write code to parse the data and conduct the analysis to return back to the LLM. Otherwise, we found pumping raw table data into a LLM is just not reliable, even if you go to the effort to conduct analysis on smaller chunks and merge the results.
https://ochagavia.nl/blog/configuration-files-are-user-inter...
https://news.ycombinator.com/item?id=45291858 (135 comments)
Why would anyone trust the output of an LLM, if it is barely better than guessing and much much worse than humans?
GPT-5 shows more impressive numbers, but for that particular task, the precision should be 100% - always. No matter how large the data set is or in which format. Why are we doing this?
* Multiple tasks vs 1
* O3/o3-mini + 4o/4o-mini instead of nano
* Extra credit: Inside a fixed cost/length reasoning loop
Ex: does the md-kv benefit disappear with smarter models that you'r typically use, and thus just become a 2-3x cost?
As markdown kv performs so well, I am now curious about TOML.
The odd ones to me are HTML which uses th and td to make indexed-based rows but did better than JSON somehow, and XML which is like JSON with even more syntactic noise placing better than INI. If I had to guess I'd say because vast amounts of the web were in the training set.
The context I used in the test was pretty large. You'll see much better (near 100%) accuracy if you're using smaller amounts of context.
[I chose the context size so that the LLM would be scoring in the ballpark of 50% accuracy (with variation between formats) to maximise the discriminative power of the test.]
Y. Sui, M. Zhou, M. Zhou, S. Han, and D. Zhang, “Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study,” in Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Merida Mexico: ACM, Mar. 2024, pp. 645–654. doi: 10.1145/3616855.3635752.
C. Pang, Y. Cao, C. Yang, and P. Luo, “Uncovering Limitations of Large Language Models in Information Seeking from Tables,” June 06, 2024, arXiv: arXiv:2406.04113. doi: 10.48550/arXiv.2406.04113.
This should have been a python script.
How much of the current peak of the Gartner Hype Cycle should just be python scripts?
Or in this case gpt-4.1-nano
This has made me chuckle several times - thanks!