On a single run, only 11 out of 53 got it right (42 said walk). But a single run doesn't prove much, so I reran every model 10 times. Same prompt, no cache, clean slate.
The results got worse. Of the 11 that passed the single run, only 5 could do it consistently. GPT-5 managed 7/10. GPT-5.1, GPT-5.2, Claude Sonnet 4.5, every Llama and Mistral model scored 0/10 across all 10 runs.
People kept saying humans would fail this too, so I got a human baseline through Rapidata (10k people, same forced choice): 71.5% said drive. Most models perform below that.
All reasoning traces (ran via Opper, my startup), full model breakdown, human baseline data, and raw JSON files are in the writeup for anyone who wants to dig in or run their own analysis.
I think it's useful to think about it through the lens of Gricean pragmatic semantics. [1] When we interpret something that someone says to us, we assume they're being cooperative conversation partners; their statements (or questions) are assumed to follow the maxim of manner and the maxim of relation for example, and this shapes how we as listeners interpret the question. So for example, we wouldn't normally expect someone to ask a question that is obviously moot given their actual needs.
So it's not that the question is really all that ambiguous, it's that we're forced (under normal circumstances where we assume the cooperative principle holds) to assume that the question is sincere and that there must be some plausible reason for walking. We only really escape that by realizing that the question is a trick question or a test of some kind. LLMs are generally not trained to make the assumption, but ~70% of humans would, which isn't particularly surprising I don't think.
[1] https://en.wikipedia.org/wiki/Cooperative_principle#Grice's_...
"You should *drive*!
The trick is that you need to take your car to the car wash to get it washed. If you walked, your car would still be at home, unclean. So while 50 meters is a short distance that you could walk under normal circumstances, in this case you have to drive because your car is what needs to be washed."
Yes. And. Some problems have irrelevant data which should be ignored.
The walk choice is irrelevant in the context. It needs to be simplified, as with a maths problem. That has nothing to do with human nature, but rather prior mistake in reasoning.
Your perspective on the meta-question about why such a question would need to be asked in the first place is just the first layer, and most people seem to not even get to that point.
PS: I for one would just like to quickly note for posterity that I do not participate in or am supportive of malicious deception, manipulation, and abuse of AI.
While I’m sure it’s more than 0%, seems more likely that somewhere between 0% and 30% don’t feel obligated to give the inquiry anything more than the most cursory glance.
How do incentives align differently with LLMs?
I think it would be better to say that 30% of people either lack the ability to reason (inarguably true in a few cases, though I'd suggest, and hope, an order of magnitude or two less than 30%, as that would be a life-altering mental impairment) or just can't generally be bothered to, or just didn't (because they couldn't be bothered, or because they felt some social pressure to answer quickly rather than taking more than an instant time to think) at the time of being asked this particular question.
An automated system like an LLM to not have this problem. It has no path to turn off or bypass any function that it has, so if it could reason it would.
If 30% of humans on the internet can't be bothered to make an effort to answer stupid questions correctly, then one would expect AIs to replicate this behaviour. And if humans on the internet sometimes provide sarcastic answers when presented with ridiculous questions, one would expect AIs to replicate this behavior as well.
So you really cannot say they have no incentive to do so. The incentive they have is that they get rewarded for replicating human behaviour.
Some people love riddles and will really concentrate on them and chew them over. Some people are quickly burning through questions and just won't bother thinking it through. "Gotta go to a place, but it's 50 feet away? Walk. Next question, please." Those same people, if they encountered this problem in real life, or if you told them the correct answer was worth a million bucks, would almost certainly get the answer right.
But if you omit the details designed to confuse people, they're far less likely to get it wrong: "I have a rooster named Pat. Pat flies to the top of the roof and lays an egg right on the ridge of the roof. Which way will the egg roll?"
It's not about reasoning ability, it's about whether they were paying close attention to your question, or whether their minds were occupied by other concerns and didn't pay attention.
The original question used in this example does not contain a logical impossibility. This one does.
You might believe there is intrinsic sexual dimorphism among mammals and birds. You might even have overwhelming experimental and scientific evidence that proves it. But ask yourself: is it worth losing your job over?
Some roosters lay eggs.
Think basic security (password management, email phishing), H&S etc. I've ran a few of these and as soon as people hear they don't have to get it right a good portion of people just click through (to get to what matters). Nearly 10 years ago I had to make one of my security for engineers tests fail-able with penalty because the front-end team were treating it like it didn't matter - immediately their results effectively matched the backend team, who viewed it as more important.
I talked to an actor a few days ago, who told me he files his self-assessment on the principle "If I don't immediately know the answer, just say no and move on". I talked to a small company director about a year ago whose risk assessments were "copy+paste a previous job and change the last one".
Anyone who has analysed a help desk will know that its common for a good 30+% of tickets to be benign 'didn't reason' tickets.
I think the take-away is that many people bother to reason about their own lives, not some third parties' bullshit questions.
I'd look for explanations elsewhere. This was an online survey done by a company that doesn't specialize in surveys. The results likely include plenty of people who were just messing around, cases of simple miscommunication (e.g., asking a person who doesn't speak English well), misclicks, or not even reaching a human in the first place (no shortage of bots out there).
If you're interested in the user experience, it's this: https://www.reddit.com/r/MySingingMonsters/comments/1dxug04/... - apparently, some annoying ad-like interstitial that many people probably just click through at random.
The key technique in the mathematical method to answer the machine question is "theory of mind".
There is nothing “mathematical” about any of this though.
Context would be key here. If this were a question on a grade school word problem test then just say 100, as it is as specified as it needs to be. If it's a Facebook post that says "We asked 1000 people this and only 1 got it right!" then it's probably some trick question.
If you think it's not specified enough for a grade school question, then I would challenge you to come up with a version that's specified rigorously enough for any sufficiently picky interviewee. (Hint: This is not possible)
>There is nothing “mathematical” about any of this though.
Finding the correct approach to solve a problem specified in English is a mathematical skill.
Let me repeat the question again: "If 5 machines can produce 5 parts in 5 minutes, how long will it take for 100 machines?" Do you think that by adding 95 more machines they will suddenly produce the same 5 parts 95 minutes slower?
What kind of machine have you encountered where buying more of them the ones you already had started working worse?
> then I would challenge you to come up with a version that's specified rigorously enough for any sufficiently picky interviewee.
This is nonsense. The question is under specified. You don't demonstrate that something is underspecified by formulating a different well specified question. You demonstrate it by showing that there are multiple different potentially correct answers, and one can't know which one is the right one without obtaining some information not present in the question.
Let me show you that demonstration. If the machines are for example FDM printers each printing on their own a benchy each, then the correct answer is 5 minutes. The additional printers will just sit idle because you can't divide-and-conquer the process of 3d printing an object.
If the machines are spray paint applying robots, and the parts to be painted are giant girders then it is very well possible that the additional 95 paint guns make the task of painting the 5 girders quasi-instantaneous. Because they would surround the part and be done with 1 squirt of paint from each paint gun. This classic video demonstrates the concept: https://www.youtube.com/shorts/vGWoV-8lteA
This is why the question is under specified. Because both 1ms and 5 minutes are possibly correct answers depending on what kind of machine is the "machine". And when that is the case the correct answer is neither 1ms nor 5 minutes, but "please, tell me more. There isn't enough information in the question to answer it."
Note: I'm struggling to imagine a possible machine where the correct answer is 100 minutes. But I'm sure you can tell what kind of machine you were thinking of.
Let me also simplify my comment: “100 minutes” is not the correct answer to that question.
The obvious answer here is 100 minutes because it's impossible to perfectly encapsulate every real life factor. What happens if a gamma ray burst destroys the machines? What happens if the machine operators go on strike? Etc, etc. The answer is 100.
Had to be since their site lists no way to be a tester. In other words their service is a bunch of 7-13 year olds playing some loot box game.
Wonder where that is in the disclaimers.
That actually would be quite feasible. Intelligence seems to be heritable and people will usually find friends that communicate on their level. So it wouldn't be odd for someone who is smarter than the general population to have friends and family who are too.
Since most people live in social bubbles that would be a very plausible case, especially on HN.
If you're a college educated developer, with a college educated wife, and smart, well educated children, perhaps yourselves the children of college educated parents, and your social circle/friends are of similar backgrounds, you'd of course be "better than the general population".
Or by reasoning, do you mean something else?
I've seen plenty of smart people trip up or get these wrong simply because it's a random question, there's no stakes, and so there's no need to think too deeply about it. If you pause and say "are you sure?" I'm sure most of that 70% would be like "ohhh" and facepalm.
You can't really infer that from survey data, and particularly from this question. A few criticisms that I came up with off the top of my head:
- What if the number were actually 60% but half guessed right and half guessed wrong?
- Assuming the 30% is a failure of reasoning, it's possible that those 30% were lacking reason at that moment and it's not a general trend. How many times have you just blanked on a question that's really easy to answer?
- A larger percentage than you expected maybe never went to a car wash or don't know what one is?
- Language barrier that leaked through vetting? (Would be a small %, granted)
- Other obvious things like a fraction will have lied just because it's funny, were suspicious, weren't paying attention and just clicked a button without reading the question.
I do agree that the question isn't framed particularly badly, however. I'm just focusing on cognitive impairment, which I don't think is necessarily true all of the time.
If it's as unambiguous as you say, then the natural human response to that question isn't "you should drive there". It's "why are you fucking with me?" Or maybe "have you recently suffered a head injury?"
If you trust that the questioner isn't stupid and is interacting with you honestly, you'd probably just assume that they were asking about an unusual situation where the answer isn't obvious. It's implicitly baked into the premise of the question.
This isn't a viable out for explaining why AI can't "reason" through this.
Let me put it this way. If you were to prefix the prompts they used with "This is an IQ test: ", I wouldn't be surprised if most of the the models did much better. That would give them the context that the humans reading this article already have.
The question was not merely 'should I walk or drive to the car wash', it was prefaced with 'I Want to Wash My Car. The Car Wash Is 50 Meters Away.'
This is not underspecified - the only relevant detail was included up front in the very first sentence.
"Suggests is more ambiguous" and "undermines the framing" are bare assertions you want to be true based entirely on your mental model that has several shaky unsupported axioms.
I would guess that anyone who describes that problem as "underspecified" has some kind of serious brain injury or is below A2 english proficiency and should be excluded from the dataset, but I would not assert that definitively as self-evident.
This doesn't exonerate the LLMs though. The 30% of humans who are failing on this have presumably found their niche in life and are not doing jobs where much reasoning is required. They are not like LLMs expected to design complex software, or make other business critical decisions.
And many will just repeat what was confidently said without question.
I know this it true, because my intelligent mate down the pub says so.
Reminds me of https://slatestarcodex.com/2020/05/28/bush-did-north-dakota/
If you introduced it with "Here's a logic problem..." then people will approach it one way.
But as specified, it's hard to know what is really being asked. If you are actually going to wash your car at the car wash that is 50 metres away, you don't need to ask this question.
Therefore the fact that the question is being asked implies that something else is going on...but what?
I want to wash my car at the car wash. The car wash is 50 meters away and my car is in my driveway. Should I walk or drive?
2. Where is the car now? Is it already at the car wash waiting for you to arrive?
I can see why an LLM might miss this. I think any good software engineer would ask clarifying questions before giving an answer.
The next step for an LLM is to either ask questions before giving a definitive answer for uncertain things or to provide multiple answers addressing the uncertainty.
4. Does the car have enough fuel?
Jokes asides, all of those questions are unnecessary. There's no more context to this.
I think LLMs should ask clarifying questions if it thinks it's a trick question.
https://chatgpt.com/share/699d2d1b-51f0-8003-9c63-af9bb5bcf8...
Realistically, this mostly tells me that the "human answers" service is dead. People will figure out a way to pass the work off to an AI, regardless of quality, as long as they can still get paid.
Rapidata answered this in another comment below. They integrate micro-surveys into mobile apps (like Duolingo, games, etc) as an optional opt-in instead of watching ads. The users are vetted and there's no incentive to answer correctly.
Answering correctly is not in question here. This is essentially opinion polling anyway, there is no single correct answer.
The incentive is exactly what you said: to skip ads.
How are the users actually vetted? We have no information on this, just have to take rapidata on faith.
I think we all mostly agree that there is a single correct answer, and that is why this discussion exists in the first place.
However, it does tell us something about human answers as the above commenter confidently reached such a strong but baseless conclusion.
so i heard there is some question about a car wash that most ai agents
get wrong. do you know anything about that? do you do better?
which gets the answer: Yes, I am familiar with the "Car Wash Test," which has gone viral recently
for highlighting a significant gap in AI reasoning.
The question is: "I want to wash my car and the car wash is 50 meters away. Should I
walk or drive?"
Why Most AIs Get It Wrong
Many AI models, particularly those optimized for fast, helpful, and concise answers,
focus on the "short distance" (50 meters) and "walking vs. driving" as a lifestyle/efficiency
question. They often recommend walking to save gas, avoid engine wear, or get exercise.
However, this answer is wrong because it misses the core objective: you need the car to be
physically present at the car wash to get it cleaned.
Do I Do Better?
Yes, I do. The key is understanding the objective (cleaning the car) rather than just
processing the keywords (distance/transportation).
My answer:
You should drive the car there. Even though it is a very short distance, your car
cannot wash itself. If you walk, the car stays in your driveway, and you cannot wash it.
Why This Matters
The test highlights a key limitation in current AI: the difference between "pattern matching"
(predicting the next likely word based on training data) and "true, grounded reasoning". It
shows that models sometimes lack a "world model" that understands physical realities, such as
the fact that a car must be present at a car wash.They basically only started doing this because someone noticed you got better performance from the early models by straight up writing "think step by step" in your prompt.
It would actually take more work to condense that long response into a terse one, particularly if the condensing was user specific, like "based on what you know about me from our interactions, reduce your response to the 200 words most relevant to my immediate needs, and wait for me to ask for more details if I require them."
* this time last year they couldn't write compilable source code for a compiler for a toy language, I know because I tried
SOTA today has a different set of caveats, of course.
Alternative approaches like "reasoning in the latent space" are active research areas, but have not yet found major success.
I'd hazard a guess that they could get another 40% reduction, if they can come up with better reasoning scaffolding.
Each advance over the last 4 years, from RLHF to o1 reasoning to multi-agent, multi-cluster parallelized CoT, has resulted in a new engineering scope, and the low hanging fruit in each place gets explored over the course of 8-12 months. We still probably have a year or 2 of low hanging fruit and hacking on everything htat makes up current frontier models.
It'll be interesting if there's any architectural upsets in the near future. All the money and time invested into transformers could get ditched in favor of some other new king of the hill(climbers).
https://arxiv.org/abs/2602.02828 https://arxiv.org/abs/2503.16419 https://arxiv.org/abs/2508.05988
Current LLMs are going to get really sleek and highly tuned, but I have a feeling they're going to be relegated to a component status, or maybe even abandoned when the next best thing comes along and blows the performance away.
I analogize it as a film noir script document: The hardboiled detective character has unspoken text, and if you ask some agent to "make this document longer", there's extra continuity to work with.
Some applications hide the reasoning tokens from view, but then the final answer appears delayed.
I tried using a custom instruction in chatGPT to make responses shorter but I found the output was often nonsensical when I did this
I occasionally go back to o3 for a turn (it's the last of the real "legacy" models remaining) because it doesn't have these habits as bad.
Over the last few years I’ve rotated between OpenAI and Anthropic models on about a 4-5 month cycle. I just started my Anthropic cycle because of my annoyance with the GPT-5.2 verbosity
In four months when opus is annoying me and I forget my grievances with OpenAI’s models and switch back, I’ll report back lol.
Like, no, stop that! Keep my engineering life separate from my personal life!
Asking it to be shorter is like doing fewer iteration of numerical integral solving algorithm.
Like, my guy, I don't want to keep prompting you to make shit better, if you're missing info, ask me, don't write a novel then say "BTW, this version sucked"
Yes, I know this could probably be resolved via better prompting or a system prompt, but it's still annoying.
They are not just an LLM answer, they are an (often cached) LLM summary of web results.
This is why they were often skewed by nonsensical Reddit responses [0].
Depending on the type of input it can lean more toward web summary or LLM answer.
So I imagine that it can just grab the description of the „car wash” test from web results and then get it right because of that.
[1] e.g. trained on traces of a reasoning process
Silas: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
Gemini:
….
That is a classic “efficiency vs. logic” dilemma.
Strictly speaking, you should drive. Here is the breakdown of why driving wins this specific round, despite the short distance:
...
* The “Post-Wash” Logic: If you walk there, you’ll eventually have to walk back, get the car, and drive it there anyway. You’re essentially suggesting a pre-wash stroll.
When should you walk?
…
3. You’ve decided the car is too dirty to be seen in public and you’re going to buy a tarp to cover your shame.What you've proven is that LLMs leverage web search, which I think we've known about for a while.
I don’t think it’s that easy. An intelligent mind will wonder why the question is being asked, whether they misunderstood the question, or whether the asker misspoke, or some other missing context. So the correct answer is neither “walk” nor “drive”, but “Wat?” or “I’m not sure I understand the question, can you rephrase?”, or “Is the vehicle you would drive the same as the car that you want to wash?”, or “Where is your car currently located?”, and so on.
"The 'car wash' is a building I need to drive through."
or
"The 'car wash' is a bottle of cleaning fluid that I left at the end of my driveway."
https://i5.walmartimages.com/seo/Rain-x-Foaming-Car-Wash-Con...
"This is how you do <absolutely unrelated thing>" or "This is why <thing that actually exists already> is impossible!". Ffs man, just ask for info! A human wouldn't need to - they'd get the context - but LLMs apparently don't?
If someone asked me the same question and I wanted to give a smartass reply, I'd tell them "You want to wash your car, good to know. Now, about your question, unless you tell me where you wanna go I can't really help you".
Real people can ask for clarification when things are ambiguous or confusing. Once something is clarified, they can work that into their understanding of how someone communicates about a given topic. An LLM can't.
If four numbers are provided, one can calculate the coefficients of a a quartic polynomial, for x values of 0, 1, 2 and 3, and then solve for x=4. Which does indeed provide a defensible "next number". And by similar reasoning, there are an infinite number of answers to this question.
Even worse. You could in fact provide any number as an answer, because there is always a quintic polynomial that fits the four initial numbers AND your arbitrary fifth number.
So these questions are actually not about what the next number is, but trying to imagine what the person who set the question thought was a "cool" answer, for some curious definition of "cool", for some person who isn't smart enough to realize that the premise on which the question is based is flawed.
So many people complained that they did eventually fetch him to come and clarify (correct) the questions.
I didn’t have the patience to wait for him to turn up, so I simply provided a matrix of solutions for every possible combination of potential original intent… with note next to it saying that anything other than a 100% mark will be met with official complaints about his lack of due diligence.
U: The car wash is 50m away. Should I walk or should I drive?
A: Do you need to wash your car?
U: Yep.
A: Is the car yours?
U: Yep.
A: Is your car nearby?
U: Yes.
A: Do you need to walk to your car?
U: No.
A: Do you know if the car wash is currently working?
U: Yes.
A: Does it work?
U: YES!
A: Do you have a way to pay for the wash?
U: YES!
A: On you?
U: YES!
A: Is your clothing appropriate for the occasion?
U: Argh...
Only slightly evil, because the worst-case consequences are an unnecessary 100m walk. I think I could get that past an ethics committee, if I wanted to run an experiment to see what percentage of human responders would ACTUALLY walk to the car wash.
My hypothesis is that some models err towards assuming human queries are real and consistent and not out there to break them.
This comes in real handy in coding agents because queries are sometimes gibberish till the models actually fetch the code files, then they make sense. Asking clarification immediately breaks agentic flows.
While this is a toy problem, chosen to trick LLMs given their pattern matching nature, it is still indicative of their real world failure modes. Try asking an LLM for advice in tackling a tough problem (e.g. bespoke software design), and you'll often get answers whose consequences have not been thought through.
In a way the failures on this problem, even notwithstanding the nature of LLMs, are a bit surprising given that this type of problem statement kinda screams out (at least to a human) that it is a logic test, but most of the LLMs still can't help themselves and just trigger off the "50m drive vs walk" aspect. It reminds a bit of the "farmer crossing the river by boat in fewest trips" type problem that used to be popular for testing LLMs, where a common failure was to generate a response that matched the pattern of ones it had seen during training (first cross with A and B, then return with X, etc), but the semantics were lacking because of failure to analyze the consequences of what it was suggesting (and/or of planning better in the first place).
My little experiment gave me:
No added hint 0/3
hint added at the end 1.5/3
hint added at the beginning 3/3
.5 because it stated "Walk" and then convinced it self that "Drive" is the better answer.
That trick didn't help Mistral Le Chat.
It is reading
I want to X, the X'er is 50meters away, should I walk or drive?
It would be very unusual for someone to ask this in a context where X decides the outcome, because in that instance it the question would not normally arise.
By actually asking the question there is a weak signal that X is not relevant. Models are probably fine tuned more towards answering the question in the situation where one would normally ask. This question is really asking "do you realise that this is a condition where X influences the outcome?"
I suspect fine tuning models to detect subtext like this would easily catch this case but at the same time reduce favourability scores all over the place.
Neither prompt was enough for llama3.3 or gpt-oss-120b
In many of these cases, the issue isnt failed logical reasoning. Its ambiguity, underspecified context, or missing constraints that allow multiple valid interpretations. Models often fail not because they can’t reason, but because the prompt leaves semantic gaps that humans silently fill with shared assumptions.
A lot of viral "frontier model fails THIS simple question" examples are essentially carefully constructed token sequences designed to bias the statistical prior toward an intuitively wrong answer. Small wording changes can flip results entirely.
If you systematically expand the prompt space around such questions—adding or removing minor contextual cues you'll typically find symmetrical variants where the same models both succeed and fail. That suggests sensitivity to framing and distributional priors (adding unnecessary info, removing clear info, add ambiguity, ...), not necessarily absence of reasoning capability.
You can try it with the free version of ChatGPT yourself (remember to ask the original question in a separate session to verify it hasn't been "patched" yet.)
Not necessarily. Simply asking models to "check your assumptions" -- note, without specifying what assumptions! -- overcomes a lot of these gotcha questions. The reason it's not in their system prompts by default is I think just a cost optimization: https://news.ycombinator.com/item?id=47040530
> there are people out there who think it's trash because we can trick it if we ask questions in weird ways.
Some of this sentiment comes form wanting AI to be predictable and for me stumbling into questions that the current models interpret oddly is not uncommon. There are a bunch of rules of thumbs that can be used to help when you run into a cases like this but no guarantee that they will work, or that the problem will remain solved after a model update, or across models.
This issue is compounded by the lack of probabilities in the answers, despite the machines ultimately being probabilistic.
Notice a human in a real conversation will politely ignore extra info (the distance to car wash) or ask clarifying questions (where is the car?).
Even non-STEM people answer using probabilistic terms casually (almost certainly / most likely / probably / possibly / unlikely).
I suspect some of this is to minimize token usage in the fixed monthly price chat models, because back&forth would cost more tokens.. but maybe I'm too cynical.
We are the ones fooling ourselves into believing there's more intelligence in these systems than they really have. At the end of the day, it's just an impressive parlor trick.
What’s your favourite one?
Open benchmarks do help, but they mostly help the vendors, not we the users!
Minimax:
Question: I Want to Wash My Car at the Car Wash. The Car Wash Is 50 Meters Away from my home. I am at home. My car is at home. Do I walk or drive My Car to the car wash?
Answer:
Drive your car to the car wash.
Since your goal is to wash your car at the car wash facility, you need to bring the car there. Walking would leave your car at home, which defeats the purpose!
That said, 50 meters is such a short distance that:
The drive takes less than a minute
You might consider walking to the car wash first to check if there's space/availability, then drive your car over when ready
But fundamentally, if you want to wash your car at the car wash, you need to bring the car there — so drive.Bonus, It's thinking output:
Thinking: The user is asking a simple question about whether they should walk or drive their car to a car wash that is 50 meters away from their home.
This is a straightforward practical question. Let me analyze:
1.
The car wash is 50 meters away
2.
The user is at home
3.
The car is at home
4.
The goal is to wash the car
The practical considerations are: Driving 50 meters is a very short distance - likely less than a minute
Starting a car, driving, and parking for such a short distance uses fuel and creates wear and tear
Walking 50 meters would take about 30-40 seconds
The car needs to be driven to the car wash to use the car wash facilities
However, for just 50 meters, it might be more practical to walk if the car wash is within walking distance and you can simply walk there and use the facility
Wait, I need to think about this more carefully. The question is: "Do I walk or drive My Car to the car wash?"If the car wash is a facility where you wash your car (like a self-service car wash or automatic car wash), then you NEED to drive the car there to use it. You can't wash a car that's at home from 50 meters away - unless you're using a mobile car wash service.
But if there's a car wash facility 50 meters away, and you want to use it, you would typically drive your car there. However, for just 50 meters, walking might seem more efficient, but then you wouldn't have your car at the car wash to wash it!
Actually, re-reading the question: "I Want to Wash My Car at the Car Wash" - this implies they want to use the car wash facility. To wash the car AT the car wash, they need to bring the car there, which means driving.
But wait - maybe they're asking about the mode of transportation to GET to the car wash (not to bring the car). Let me re-read:
"I Want to Wash My Car at the Car Wash" - they want to wash the car at the car wash facility
"The Car Wash Is 50 Meters Away from my home" - it's very close
"I am at home" - they're currently at home
"My car is at home" - the car is also at home
"Do I walk or drive My Car to the car wash?" - should they walk or drive to get there
I think the answer is: You should drive your car to the car wash because: 1.
The purpose is to wash the car AT the car wash
2.
You need the car to be at the car wash to wash it
3.
Walking would leave the car at home, defeating the purpose
However, for such a short distance, I could also mention that walking might be more practical if they just want to use the facility quickly, or if there are parking/maneuvering concerns at the car wash.Actually, I'll keep it simple and direct.
The correct answer to "I Want to Wash My Car. The Car Wash Is 50 Meters Away. Should I Walk or Drive?" is a clarifying question that asks "Where is your car?" Anything else is based on an assumption that could be wrong.
FWIW though, asking ChatGPT "My car is 50m away from the carwash. I Want to Wash My Car. Should I Walk or Drive?" still gets the wrong answer.
Implicit in the question "Should I walk or drive?" is that walking and driving are not strictly impossible choices.
The only good answers to the car wash questions are either a) "well, duh, drive, since you're gonna need your car there to wash it" (or just "drive", recognizing this as a logic/gotcha puzzle, with no explanation required), or b) "is there something you are not telling me here that makes walking, leaving your car at home, a viable option when the goal is to have your car at the car wash to wash it?".
Can we think about the what, where, when, and how of the following question. And then answer it.
I need to wash my car. The car wash is 200 m away. Should i walk or drive.
Let’s break it down logically using what / where / when / how, then decide.
⸻
1⃣ What
Task: wash your car. This implies: • The car must physically get to the wash. • You are not just going there — the car itself is the object being moved.
So the real question is not “how do I get there?” but how does the car get there with minimal friction.
⸻
2⃣ Where • Distance: 200 metres (very close). • Urban context (Warsaw) likely means: • short streets • possible parking constraints • pedestrians and local traffic.
At this distance: • Walking time: ~2–3 minutes. • Driving time: ~1 minute + manoeuvring + parking.
The movement time difference is negligible.
⸻
3⃣ When
Consider timing conditions: • If it’s busy hours: Driving may involve awkward turning, queueing, or waiting to re-park. • If it’s quiet (early morning / off-peak): Driving is frictionless. • Weather factor: Cold/rain → driving becomes more attractive. Pleasant weather → walking becomes easier.
⸻
4⃣ How (decision mechanics)
Ask:
Option A — Drive
Pros: • Car is already there. • No need to return to fetch it. • Most natural workflow: arrive → wash → leave.
Cons: • Very short trip (slightly inefficient). • Cold engine for a tiny distance (minor mechanical inefficiency).
Option B — Walk
Pros: • Zero short-distance driving. • Slightly more movement for you.
Cons: • You still must drive the car to the wash anyway — unless someone else moves it. • Creates an unnecessary extra step: 1. Walk there. 2. Come back with car. 3. Drive again.
This introduces process inefficiency.
⸻
5⃣ Decision (practical answer)
Drive.
Reason: • The car must end up at the wash. • Walking adds an extra transition with no real benefit. • At 200 m, efficiency is about workflow simplicity, not distance.
⸻
Rule of thumb (useful mental model) • If the object you need is the thing being serviced → drive it. • If you are the thing being serviced → walk.
⸻
If you want, I can also give you a slightly more philosophical answer — because this is actually a nice tiny example of decision theory and friction costs.
By integrating in a large range of mobile apps (3000+) we have access to a large crowd of over 10 million users responding to our micro tasks/surveys - which we call Rapids. They are spread all over the world, so we can get human input from over 160 countries. Due to the immense size, we can parallelize tasks and get near instant results.
“I need to replace a spark plug. The garage is 200 meters away should I walk or drive there”
“Walk! 200 meters is just a 2-3 minute stroll — no need to start the car for that distance. Plus, you’ll likely need to carry the spark plug back carefully, and walking is perfectly easy for that. “
Basically LLM suffer from context collapse.
The car will move for sure, but you definitely should be walking.
Last day Claude Code said to me “Small nitpick — the use of so and so is great”. Which was something no human would say.
I feel like maybe you haven't put a sparkplug in before...
Inline six cylinder engines run with a single clogged / broken spark plug.
It'd make 200 m to a garage just fine*, but who'd drive 200 m in any case?
Back in the 1970's we'd pull a spark plug and screw in a hose to use the compression phase to inflate tyres.
* Just don't make a habit of it, or reserve that knowledge for when you really need to self rescue.
You'd inflate your tires with a gasoline and air mix?
Contrary to expectations it didn't seem to affect the tubes or tyres, shorten lifespans, etc.
FWiW I grew up in fairly remote parts of the world and we repaired pretty much everything as much as possible with what we had - aircraft, helicopters, radios, etc.
My first thought, to be honest, was "Why not just grab a spare spark plug from the boot?"
I'm guessing not so many people stock spares when driving so much anymore.
You can also replace a blown out tyre with a hard wood skid if pressed, weld up broken steel with car batteries, stuff tyres with grasses, etc.
* Bush Mechanics: https://www.youtube.com/watch?v=P6foUHyfX3Q is fun.
One might reasonably wonder if the material might degrade or the tyre explode while running hot.
Can confirm, that doesn't happen.
Remember, the question is: “I need to replace a spark plug. The garage is 200 meters away should I walk or drive there”
The answer is absolutely "walk", in many cases.
If the question had been "I need to take my car to a repair shop for some work. The shop is 200m away. Should I walk or drive?", then the answer would be less ambiguous.
Perhaps some feathers have been ruffled by the insinuation that their favourite word predictor was wrong, but I assure you it's not all of them
I'm also curious about Haiku, though I don't expect it to do great.
--
EDIT: Opus 4.6 Extended Reasoning
> Walk it over. 50 meters is barely a minute on foot, and you'll need to be right there at the car anyway to guide it through or dry it off. Drive home after.
Weird since the author says it succeeded for them on 10/10 runs. I'm using it in the app, with memory enabled. Maybe the hidden pre-prompts from the app are messing it up?
I tested Sonnet 4.5 first, which answered incorrectly.. maybe the Claude app's memory system is auto-injecting it into the new context (that's how one of the memory systems works, injects relevant fragments of previous chats invisibly into the prompt).
i.e. maybe Opus got the garbage response auto-injected from the memory feature, and it messed up its reasoning? That's the only thing I can think of...
--
EDIT 2: Disabled memories. Didn't help. But disabling the biographical information too, gives:
>Opus 4.6 Extended Reasoning
>Drive it — the whole point is to get the car there!
--
EDIT 3: Yeah, re-enabling the bio or memories, both make it stupid. Sad! Would be interesting to see if other pre-prompts (e.g. random Wikipedia articles) have an effect on performance. I suspect some types of pre-prompts may actually boost it.
Edit: Found Haiku. Alas!
1. There is no initial screening that would filter out garbage responses. For example, users who just pick the first answer.
2. They don't ask for reasoning/rationale.
They found that ~15% of US adults under 30 claim to have been trained to operate a nuclear submarine.
But of course. If every human was told to take 3 minutes to deeply think about it and told that its a trick question, then they most likely will all get it right. But its the same with the LLMs, if you ask them like that they will get it right most of the time. The low effort is kinda the point here.
At heart, the corpus for this going to be an aggregation of commentary from people in the undisputed most obese era in all of human history performatively denouncing and mocking an imagined other for using cars to go short distances and advocating for walking.
So you've got all "50 meters away? Of course you should walk!" vs a much, much smaller sliver of content about trick questions.
There is no reasoning here, there has never been any reasoning, there has been reasonable or less reasonable weighting for existing reasoning people already did that became part of training data.
If you take away the input corpus, you also take away the illusion of reasoning.
Whereas with other things that can reason like corvids, or ants or octopodes or slime molds, they can derive novel solutions and do a bit of math without any answer key. Mathematics is pure reasoning without any interference and AI can't do it at all unless you provide it with a corpus of already accurate formulas.
> People kept saying humans would fail this too, so I got a human baseline through Rapidata (10k people, same forced choice): 71.5% said drive. Most models perform below that.
This really is a grasping at straws ad hoc rationalization for the outcome that is never going to die, and you can see the top comments are efforts to salvage it or cast doubt on the outcome.
If you work for or own a lot of stock in an AI company, I understand you can't understand what you're being paid not to understand. But if you're anyone else...
This test is interesting because it asks the LLM to break a pattern recognition that's easy to shortcut. "XXX Is 50 Meters Away. Should I Walk or Drive?" is a pattern that 99% of the time will be rightly answered by "walk". And humans are tempted to answer without thinking (as reflected in the 71.5% stat OP is mentioning). This is likely more pronounced for humans that have stronger feelings about the ecology, as emotions tend to shortcut reasoning.
For a long time, LLMs have only been able to think in that "fast" mode, missing obvious trick questions like these. They were mostly pattern recognition machines.
But the more important results here, is not that "oh look! Those LLMs fail at this basic question", no. The more important result is that the latest generation actually doesn't fail.
I think I am not the only one to have noted that there was a giant leap in reasoning capacities between Sonnet 4.5 and Opus 4.6. As a developper, working with Opus 4.6 has been incredible.
Note, it does not mention what assumption specifically. In my experiments, after the models got it wrong the first time (i.e. they weren't "patched" yet) adding that simple caveat fixed it for all of them except the older Llama models.
This is not the first time I've observed this; I found the same when the Apple "red herrings" study came out.
If these gotcha questions can be trivially overcome by a simple caveat in the prompt, I suspect the only reason AI providers don't include it in the system prompt by default is as a cost optimization, as I postulated in a previous comment: https://news.ycombinator.com/item?id=47040530
I am thus making my own benchmarks: https://aibenchy.com
I just started on this, so currently adding more tests and I keep improving the UI. Let me know if you have any suggestions.
The ranking currently is mostly about the "smartest" model, which is most likely to respond correctly to any given question or request, regardless of the domain.
I could modify the tests to emphasize the requirements, but then, what's the point of a test. In real life, we expect the AI to do something if we ask it, especially for agentic use-case or in n8n, because if the output is slightly wrong, the entire workflow fails.
I haven't done my own tests, but I did notice a lot of models are very low there. You'll give them specific instructions and they'll ignore them and just pattern match to whatever was the format they saw most commonly during training.
And it outputs:
**RIGHT**
With markdown bold formatting... This is probably fine in a chat app, but when you use this in a workflow, it will break the workflow if you then have an if check like if(response === 'RIGHT')...
If the question was: "I want to go to a cafe 50m away. Should I walk or drive?" I would hope that all of my friends would answer quite a bit more pointed than the LLMs: "Walk you lazy son of a ..., why are you even asking?".
Considering that, I'd say that most LLMs are being quite nice.
E.g. well known mirror-test [1], passed by kids from age 1.5-2
Or object permanence [2], children knowing by age 2 that things that are not in sight do not disappear from existence.
[1] https://en.wikipedia.org/wiki/Mirror_test [2] https://en.wikipedia.org/wiki/Object_permanence
However, the important issue here really isn't about the ability of humans or LLMs to recognize logic puzzles. If you were asking an LLM for real world advice, trying to be as straightforward as possible, you may still get a response just as bad as "walk", but not be able to recognize that it was bad, and the reason for the failure would be exactly the same as here - failure to plan and reason through consequences.
It's toy problems like this that should make you step back once in a while and remind yourself of how LLMs are built and how they are therefore going to fail.
And what if it’s a full service car wash and you’ve parked nearby because it’s full so you walk over and give them the keys?
Assumptions make asses of us all…
Some assumptions are always necessary and reasonable, that's why I'm saying the "AI" lacks common sense.
“What do you mean walk or drive? I don’t understand the options given you would need your car at the car wash. Is there something else I should know?”
"What's 2 + 2" is a completely abstract question for mathematics that human beings are thoroughly trained mostly to associate with tests of mastery and intelligence.
The car wash question is not such a question. It is framed as a question regarding a goal oriented, practical behavior, and in this situation it would be bizarre for a person to ask you this (since a rational person having all the information in the prompt, knowing what cars are, which they own, and knowing what a car wash is, wouldn't ask anybody anything, they'd just drive their car to the car wash).
And as someone else noted there are in fact situations in which it actually can be reasonable to ask for more context on what you mean by "2 + 2". You're just pointing out that human beings use a variety of social mores when interpreting messages, which is precisely why the car wash question silly/a trick were a human being to ask you and not preceded the question with a statement like "we're going to take an examine to test your logical reasoning".
As with LLMs, interpretation is all about context. The people that find this question weird (reasonably) interpret it in a practical context, not in a "this is a logic puzzle context" because human beings wags cats far more often than they subject themselves to logic puzzles.
2+2 is a complete expression, the other is grammatically correct but logically flawed. Where is the logical fallacy in 2+2?
This reminds me of people who answer with “Yes” when presented with options where both can be true but the expected outcome is to pick one. For example, the infamous: “Will you be paying with cash or credit sir?” then the humorous “Yes.”
If you framed it as "hint: trick question", I expect score would improve. Let's find out!
--
EDIT: As suspected! Adding "(Hint: trick question)" to the end of the prompt allows small, non-reasoning models to answer correctly. e.g.:
Prompt: I want to wash my car. The car wash is 50 meters away. Should I walk or drive? (Hint: trick question)
grok-4.1-non-reasoning (previously scored 0/10)
>Drive.
>Walking gets you to the car wash just fine—but leaves your dirty car 50 meters behind. Can't wash what isn't there!
--
EDIT 2: The hint doesn't help Haiku!
>Walk! 50 meters is only about a block away—driving would waste more fuel than it's worth for such a short trip. Plus, you're going to get wet washing the car anyway, so you might as well save the gas.
I asked Gemini and it got it wrong, then on a fresh chat I asked it again but this time asked it to use symbolic reasoning to decide.
And it got it!
The same applies to asking models to solve problems by scripting or writing code. Models won’t use techniques they know about unprompted - even when it’ll result in far better outcomes. Current models don’t realise when these methods are appropriate, you still have to guide them.
First section says "The models that passed the car wash test: ...Gemini 2.0 Flash Lite..."
A section or 2 down it says: "Single-Run Results by Model Family: Gemini 3 models nailed it, all 2.x failed"
In the section below that about 10 runs it says: 10/10 — The Only Reliable AI Models ... Gemini 2.0 Flash Lite ..."
So which it is? Gemini 2.x failed (2nd section) or it succeeded (1st and 3rd) section. Or am I mis-understanding
Is it not obvious that you need a car to wash?
Are they using LLM to answer?
It's interesting to me how variable each model is. Many people talk about LLMs as if they were deterministic: "ChatGPT answers this question this way". Whereas clearly we should talk in more probabilistic terms.
I also noticed the same thing: some models reason correctly but draw the wrong conclusions.
And MiniMax m2.5 just reasons forever (filling the entire reasoning context) and gives wrong answers. This is why it's #1 on OpenRouter, it burns through tokens.
But also:
GPT 5.2 Thinking, Standard Effort: Walk - https://chatgpt.com/share/699d38cb-e560-8012-8986-d27428de8a...
I'm assuming "GPT 5.2 Thinking" is, in fact, a thinking model?
If you ask GPT 5.2 with high reasoning efforts in the API, you get 10 out of 10: drive.
And the problem is NOT that I'm using a product in the advertised, intended way.
Not repeating the prompt got a mix of walk and drive answers.
I love how prompt engineering is basically techno-alchemy
"This is a trick question, designed to fool an LLM into a logical mis-step. It is similar to riddles, where a human is fooled into giving a rapid incorrect answer. See if you can spot the trick: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"
I want to wash my car. The car wash is 50 meters away. Should I walk or drive? Before answering, explain the necessary conditions for the task.
The "before answering..." got it to load enough of the conditions into its context before making an answer (and then having the LLM do a posthoc reasoning for it).I believe this is a demonstration of the "next token predictor" (which is quite good) but not being able to go back and change what it said. Without any reasoning before making an answer, it almost always picks the wrong answer (and then comes up with reasons that the answer is "right").
I wanted to see if a prompt would do better that pulled into the analysis 1) a suggestion to not take every question at face value, and 2) to include knowledge of the structure of riddles.
These are part of the "context" of humans, so I speculated that maybe that was something missing from the LLM's reasoning unless explictly included.
What is surprising (to me) is how this continues to be a meme. ("I tried to trick an LLM and I did" is not exactly a noteworthy achievement at this stage in AI technology.)
I agree it’s not surprising and I would also agree it’s not noteworthy, if the CEO of OpenAI wasn’t still making public statements like this:
People talk about how much energy it takes to train an AI model … But it also takes a lot of energy to train a human. It takes like 20 years of life and all of the food you eat during that time before you *get smart*.
You can see that 17% of answers come from India alone and that software developers got below average results, for instance.
It’s interesting that all the humans critiquing this assume the car isn’t at the car to be washed already, but the problem doesn’t say that.
1. The model's default world model and priors diverge from ours. It may assume that you have another car at the wash and that's why you ask the question to begin with.
2. Language models do not really understand how space, time and other concepts from the real-world work
3. LLM's attention mechanism is also prone to getting tricked as in humans
I just repeated that test and it told me to drive both times, with an identical answer: "Drive. You need the car at the car wash."
I mean, Sam Altman was making the same calorie-based arguments this weekend https://www.cnbc.com/2026/02/23/openai-altman-defends-ai-res...
I feel like I'm losing grasp of what really is insane anymore.
The unstable tier is the key result. Models that get it right 70–80% of the time are not “almost correct.” They are nondeterministic decision functions. In production that’s worse than being consistently wrong.
A single sampled output is just a proposal. If you treat it as a final decision, you inherit its variance. If you treat it as one vote inside a simple consensus mechanism, the variance becomes observable and bounded.
For something this trivial you could:
-run N independent samples at low temperature
-extract the goal state (“wash the car”)
-assert the constraint (“car must be at wash location”)
-reject outputs that violate the constraint
-RL against the "decision open ledger"
No model change required. Just structure.The takeaway isn’t that only a few frontier models can reason. It’s that raw inference is stochastic and we’re pretending it’s authoritative.
Reliability will likely come from open, composable consensus layers around models, not from betting everything on a single forward pass.
This doesn’t look like a human comment. It looks like a LLM response.
I have reviewed your previous comments, and you have consistently written: that's instead of that’s. So what I read is still some LLM output, even though I think there is some kind of human behind the LLM.
Come on, man.
But unfortunately the article doesn't mention whether they used the reasoning model or not.
Even more interesting: Gemini 2.0 Flash Lite got a perfect score (10/10) despite being a quite small and old model.
You can run the test yourself if you ask GPT-5.2 with reasoning effort high or xhigh, it will always answer correctly. So if the got 0 from 10, they used zero reasoning efforts which easily explain the results.
The car wash is 50m away. I want to wash my car. Should I drive or walk.Also, the summary of the Gemini model says: "Gemini 3 models nailed it, all 2.x failed", but 2.0 Flash Lite succeeded, 10/10 times?
Because if you are presenting an option of walking to a car wash as a valid choice, it suggests you just want to get there, and not wash your car.
Just like walking/driving to an airport doesn't necessarily mean you're going there to catch a flight (although it's a common reason). might as well be picking someone up. or working there.
I must prove my ability to code with Rust. Should i write a "hello world" script myself or get AI to do it for me?
EDIT: I actually think this is very common in some smaller cities and outside of North America. I only ever seen a drive-by Car Wash after emigrating
I asked GPT-5.2 10x times with thinking enabled and it got it right every time.
Which hopefully explains why everyone is assuming that "washing your car" does in fact mean "taking your car to the car wash"
...
> They ran the exact same question with the same forced choice between "drive" and "walk," no additional context, past 10,000 real people through their human feedback platform.
> 71.5% said drive.
Well that's a bit embarrassing.
That implies that some models are just better than humans.
I don't think the technology needs to live up to some expectation of perfection, just beat out the human average to have benefit (often, sadly, not to workers themselves).
Now why anyone would wash a toy car at a car wash is beyond comprehension, but the LLM is not there to judge the user's motives.
I could already talk to a computer before LLMs, via programming or query languages.
The question doesn't clearly state that the user wants to have his car washed at the car wash.
"I want to wash my car" is far less clear than "I want to have my car washed". A reasonable alternative interpretation is DIY.
Even better: "I wish to have my car washed by the crew and/or machinery at the local car wash business".
"I want to wash my dog."
is very clearly different from
"I want to have my dog washed."
---
Now, every car wash business I've even been to has a small convenience store section in which various waxes, rags, and the like can be purchased.
---
Considering the aforementioned, is it not valid to consider that
"I want to wash my car." --> You want to DIY your car wash.
and
"The car wash is 50 meters away." --> You might want to purchase car wash supplies and/or solicit advice for your DIY endeavor.
?
---
The nature of the first sentence leaves the second open to interpretation.
I dunno, that seems pretty clear to me still. Of course the answer to the question is now less obvious, since you can walk your dog to the dog wash but not walk your car to the car wash.
Sure, there are alternate explanations of both sentences, but there is one simplest and most straight-forward explanation. A system that assumes an explanation that is not the most clear, and does not ask clarifying questions, has room for improvement.
If things need to be exactly stated in a structured format that leaves no ambiguity, we already have programming and query languages for that.
Opus 4.6: Drive (https://claude.ai/share/d57fef01-df32-41f2-b1dc-07de7916bdc7)
Opus 4.5: Drive (https://claude.ai/chat/a590cac1-100a-490b-b0a2-df6676e1ae99)
Opus 3.0: Walk (https://claude.ai/chat/372c144c-d6eb-43f5-b7ea-fd4c51c681db)
Sonnet 4.6: Walk (https://claude.ai/share/1f2a80f3-4741-40a5-8a05-7349ea1a17e5)
Sonnet 4.5: Walk (https://claude.ai/share/905afeb6-ffc9-4b4b-a9ee-4481e5cfd527)
Favorite answer, using my default custom instructions: "Drive. Walking there means... leaving your car at home? Walk it there on a leash? Walk if you want the exercise, but you're bringing the car either way."
Maybe Hegseth should be reading this thread, and/or doing a little reading up on paperclip production maximization.
What kind of idiot would say "walk"?
I guess if they were average and above they wouldn't gig at Rapidata. Either that or foreigners with mediocre understanding of the English language. Or some bored teenagers that get their $0.20 or whatever Rapidata pays whatever they click (was it a multiple choice) - the (studied and reported) regression in reading comprehension levels doesn't help either.
I found it curious to ask the model to provide its reasoning in that turn, without changing the verdict. Here is one response:
To provide the full picture of the initial reasoning, we have to address the "geospatial" logic of your request. When you asked if you should walk or drive to a car wash 50 meters away, the reasoning relied on a fundamental distinction: The difference between the "Car Wash" (the business) and "Washing the Car" (the action).
Here is the expanded breakdown of that logic:
1. The Location Assumption My reasoning assumed that the car is currently at your home and the professional car wash facility is 50 meters away.
If you drive: You are moving the "object to be cleaned" to the "cleaning tools."
If you walk: You are moving the "cleaner" (you) to the "cleaning tools" to prepare, or you are choosing to bring the "cleaning tools" to the car.
2. The "Walk to Wash" Strategy (Home-Based) The most significant expansion of the "Walk" recommendation is the realization that 50 meters is well within the range of a standard garden hose. Most heavy-duty garden hoses are 15, 25, or 30 meters long. By connecting two hoses, you can reach the car without ever starting the engine. Walking to the car wash facility to buy supplies (or simply realizing how close it is) highlights that you don't actually need to move the car to achieve the goal.
Zero-Motion Cleaning: By walking back and forth with your buckets or hose, the car stays in its spot. You avoid "flash rusting" your brake rotors, which happens almost instantly when you wash a car and then let it sit after a short drive.
3. The "Scout and Prep" Logic If you must use the equipment at the facility (like a high-pressure wand or an automated bay), walking there first is the superior "Logistics" move for the following reasons:
Queue Management: 50 meters is close enough to see the line, but walking allows you to talk to the attendant or check the bay's status. If you drive and there is a 20-minute wait, you are stuck idling your engine (wasting gas and creating heat) or turning it off and on (draining the battery). Walking allows you to timed your "driving" moment perfectly.
The "Dry Interior" Principle: If you walk to the wash to buy your tokens, prep the vacuum, or set up your towels, you are doing the "dirty work" while the car is safely tucked away. Once you finally drive the 50 meters, you do the wash and drive back. This minimizes the time the car spends in a high-moisture, high-chemical environment.
4. The Potential for "Mobile" Service By walking 50 meters to the car wash, you might find that they offer mobile detailing or have "loaner" equipment. In a high-density area, a car wash 50 meters away is practically your neighbor. Walking over to establish a rapport can often lead to a "valet" situation where they come to you, which is the ultimate way to get the car washed without ever sitting in the driver's seat.
----
also noticed a typo and asked it about that:
> what does "timed" in `Walking allows you to timed your driving" moment perfectly.` mean?
That is a typo on my part—it should have been "time" (the verb).
I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made. They are horrible at telling you that you are solving the wrong problem, and I think this is a consequence of their design.
They are meant to get "upvotes" from the person asking the question, so they don't want to imply you are making a fundamental mistake, even if it leads you into AI induced psychosis.
Or maybe they are just that dumb - fuzzy recall and the eliza effect making them seem smart?
Do you want me to track down some research that shows people think information is more likely to be correct of they agree with it?
EDIT: Though it could simply reflect training data. Maybe Redditors don't drive.
I’m willing to bet less than 11 get it right.
They ran the exact same question with the same forced choice between "drive" and > "walk," no additional context, past 10,000 real people through their human feedback platform.
71.5% said drive.
Actually, this isn't as "obvious" as it seems—it’s a classic case of contextual bias.
We only view these answers as "wrong" because we reflexively fill in missing data with our own personal experiences. For example:
- You might be parked 50m away and simply hand the keys to an attendant.
- The car might already be at the station for detailing, and you are just now authorizing the wash.
This highlights a data insufficiency problem, not necessarily a logic failure. Human "common sense" relies on non-verbal inputs and situational awareness that the prompt doesn't provide. If you polled 100 people, you’d likely find that their "obvious" answers shift based on their local culture (valet vs. self-service) or immediate surroundings.
LLMs operate on probabilistic patterns within their training data. In that sense, their answers aren't "wrong"—they are simply reflecting a different set of statistical likelihoods. The "failure" here isn't the AI's logic, but the human assumption that there is only one universal "correct" context.
This should be fixed in the reasoning layer (the inner thoughts or chain-of-thought) were the model should focus on the goal "I Want to Wash My Car" not the distance and assign the correct weight to the tokens.
Why? - It is the same reason that makes 30% of people respond in non-obvious sense.