See for example this recent paper where AI managed to beat radiologists on interpreting x-rays... when the AI didn't even have access to the x-rays: https://arxiv.org/pdf/2603.21687 (on a pre existing "large scale visual question answering benchmark for generalist chest x-ray understanding" that wasn't intentionally messed up).
And in interpreting x-ray's human radiologists actually do just look at the x-rays. In the context the article is discussing the human doctors don't just look at the notes to diagnose the ER patient. You're asking them to perform a task that isn't necessary, that they aren't experienced in, or trained in, and then saying "the AI outperforms them". Even if the notes aren't accidentally giving away the answer through some weird side channel, that's not that surprising.
Which isn't to say that I think the study is either definitely wrong, or intentionally deceptive. Just that I wouldn't draw strong conclusions from a single study here.
So I’m genuinely curious:
What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.
Medicine is so much more than "knowledge, experience, and pattern matching", as any patient ever can attest to. Why is it so hard for some people to understand that humans need other humans and human problems can't be solved with technology?
Now replace some / all of those humans with... A machine whose function also needs insurance approval.
It's gonna end badly.
I still think healthcare needs to be reformed, and I hope that insurance will someday be a thing of a past, but I've hung up my chain saw for now.
Things were ruined slowly. They unfortunately will need to be fixed very slowly too.
When the wrong targets get destroyed, everyone suffers. When parasitic forces are destroyed, the system functions better. It's the difference between defense and friendly fire.
This even translates to the pediatric space. I took all of my kids to the pediatrician because either they don't make comments to me like they do to my wife, or I don't take shit from them. I'm not sure which. Here's an example:
My wife and daughter were there and the doctor asked what kind of milk my daughter was drinking. She said "whole milk" and the doctor made a comment along the lines of "Wow, mom, you really need to switch to 2%". To understand this, though, you need to understand that my daughter was _small_. Like they had to staple a 2nd sheet of paper to the weight chart because she was below the available graph space. It wasn't from lack of food or anything like that, she's just small and didn't have much of an appetite.
So I became the one to take the kids there. Instead of chastising me, they literally prescribed cheeseburgers and fettuccine alfredo.
My daughter is in her 20s now and is still small -- it's just the way she is. When she goes to see her primary, do you know what their first question is? "When was your last period."
However, your argument focuses on the routine intake instead of any listening part. The fact that the doctor measures height, weight, temperature, and blood pressure on intake and then asks about LMP doesn’t surprise me… that’s the part of the script where you just provide the data before you bring up concerns.
Not to say the doctor was not a jerk, just that your argument doesn’t do much for me.
The weight thing was not the key aspect of my original comment. They chastised my wife for continuing to give my daughter whole milk while being underweight, but did not make similar comments to me. That was the point.
For women, their pains and problems are far too often whisked away by hand waving and "it's hormones and periods" and serious issues are often overlooked. Very little has changed in that area over the last twenty years.
I wonder how many units of their training courses are spent on this and how much is spent on the cultural reinforcement of it.
* https://www.health.harvard.edu/pain/the-dangerous-dismissal-of-womens-pain
* https://pmc.ncbi.nlm.nih.gov/articles/PMC10937548/
Are you really unwilling to admit that such a bias exists?Is that supposed to be a problem? How does it connect to the story in your comment?
The question seems to be warranted to me, since being underweight can stop you from menstruating. So if you find someone thin and her last period was off in the distant past, you can conclude that there's a problem and something should be done about it; if it was a couple of weeks ago, you can conclude that she's fine.
(It could also just be something that is automatically assessed as a potential indicator of all kinds of different things. Notably pregnancy. For me, it bothered me that whenever you have an appointment at Kaiser for any reason, part of their checkin procedure is asking you how tall you are. I'd answer, but eventually I started pointing out to them that I wasn't ever measuring my height and they were just getting the same answer from my memory over and over again. [By contrast, they also take your weight every time, but they do that by putting you on a scale and reading it off.] The fact that my height wasn't being remeasured didn't bother them; I'm not sure what that question is for.)
Particularly given the alarming stories of people being prosecuted for having miscarriages, it feels ridiculous.
If anything I hope more automated diagnostics and triage could help women and POC get better care, but only if there’s safeguards against prejudice. There’s studies showing different rates of pain management across races and sexes, for example. A broken bone is a broken bone, regardless of sex or race.
You are asking how it connects, and it absolutely doesn't. But they keep asking and won't accept "it's regular" as an answer.
She's in her 20s and is seeing her primary for routine things, not because of her weight -- that part of the story was about how they chastised my wife for giving her whole milk but said absolutely nothing to me about it later on.
It doesn't have opinions, research, direction of its own. Is this a path of codifying the worst elements of human society as we've known it, permanently?
One was against it, the other one saw it as a good idea.
I would love to have real data, real statistics etc.
Also, the very idea that LLMs would prescribe you ritalin at all is laughable... Having no human doctors in the loop is a guaranteed way to cut prescription drug abuse, as ya can't really bribe an LLM or appeal to its humanity...
How are you defining technology? How are you defining human problems? Inventions are created to solve human problems, not theoretical problems of fictional universe. Do X-rays, refrigerators, phones and even looms solve problems for nonhumans?
Claiming something that sounds deep doesn’t make it an axiom.
Ok fellas put your money where your mouth is. It’s easy to talk until you put your money behind it (or lack of by getting rid of spending on it) if you are so confident in doctor as a service by llm.
Humans (doctors/nurses) can still be there to make you feel the warmth of humanity in your darkest times, but if a machine is going to perform better at diagnosing (or perhaps someday performing surgery), then I want the machine.
Even now, I'll take a surgeon that's a complete jerk over a nice surgeon any day, because if they've got that job even as a jerk they've got to be good at their jobs. I want results. I'll handle hurt feelings some other time.
This seems like an incredibly poor line of reasoning.
Hospitals are often desperate for surgeons. The poorly mannered ones are often deeply unsatisfied, angry at the grueling lives they've opted into, and the hospitals can't replace them. The market is not exactly at work here.
The truly compassionate surgeons will want to improve their skills because they care about their patients. They care if they develop complications and may feel terrible if they do, the jerk may not. Being a jerk may mean that the surgeon can rise to the top, but it may not be due to surgical skill at all, they may be better at navigating politics etc.
Dude you removed my right thumb I was in for an appendectomy!?
You are so right! I ignored everything you asked for. I am so sorry. I am administering general anesthesia now, then I will prepare you for your next surgery.
If I were picking a specialty now, I'd go with pediatrics or psychiatry over something like oncology.
For instance, transportation is a "human problem". It's being successfully solved with such technologies as cars, trains, planes, etc. Growing food at scale is a "human problem" that's being successfully solved by automation. Computing... stuff could be a "human problem" too. It's being successfully solved by computers. If "human problems" are more psychological, then again, you can use the Internet to keep in touch with people, so again technology trying to solve a human problem.
patients -> AI -> diagnosis (you know, with a camera, or perhaps a telephone I guess)
What REALLY happened
patients -> nurse/MD -> text description of symptoms -> MD -> question (as in MD asked a relevant diagnostic question, such as "is this the result of a lung infection?", or "what lab test should I do to check if this is a heart condition or an infection?") -> AI -> answer -> 2 MDs (to verify/score)
vs
patients -> nurse/MD -> text description of symptoms -> MD -> question -> (same or other) MD -> answer -> 2 MDs verify/score the answer
Even with that enormous caveat, there's major issues:
1) The AI was NOT attempting to "diagnose" in the doctor House sense. The AI was attempting to follow published diagnostic guidelines as perfectly as possible. A right answer by the AI was the AI following MDs advice, a published process, NOT the AI reasoning it's way to what was wrong with the patient.
2) The MD with AI support was NOT more accurate (better score but NOT statistically significant, hence not) than just the MD by himself. However it was very much a nurse or MD taking the symptoms and an MD pre-digesting the data for to the AI.
3) Diagnoses were correct in the sense that it followed diagnostic standards, as judged afterwards by other MDs. NOT in the sense that it was tested on a patient and actually helped a live patient (in fact there were no patients directly involved in the study at all)
If you think about it in most patients even treating MDs don't know the correct conclusion. They saw the patient come in, they took a course of action (probably wrote at best half of it down), and the situation of the patient changed. And we repeat this cycle until patient goes back out, either vertically or horizontally. Hopefully vertically.
And before you say "let's solve that" keep in mind that a healthy human is only healthy in the sense that their body has the situation under control. Your immune system is fighting 1000 kinds of bacteria, and 10 or so viruses right now, when you're very healthy. There are also problems that developed during your life (scars, ripped and not-perfectly fixed blood vessels, muscle damage, bone cracks, parts of your circulatory system having way too much pressure, wounds, things that you managed to insert through your skin leaking stuff into your body (splinters, insects, parasites, ...), 20 cancers attempting to spread (depends on age, but even a 5 year old will have some of that), food that you really shouldn't have eaten, etc, etc, etc). If you go to the emergency room, the point is not to fix all problems. The point is to get your body out of the worsening cycle.
This immediately calls up the concern that this is from doctor reports. In practice, of course, maybe the AI only performs "better" because a real doctor walked up to the patient and checked something for himself, then didn't write it down.
What you can perhaps claim this study says is that in the right circumstances AIs can perform better at following a MD's instructions under time and other pressure than an actual MD can.
But two facts are also true: a) diagnosis itself can be automated. A lot of what goes on between you having an achy belly and you getting diagnosed with x y or z is happening outside of a direct interaction with you - all of that can be augmented with AI. And b), the human interaction part is lacking a great deal in most societies. Homeopathy and a lot of alternative medicine from what I can see has its footing in society simply because they're better at talking to people. AI could also help with that, both in direct communication with humans, but also in simply making a lot of processes a lot cheaper, and maybe e.g. making the required education to become a human facing medicinal professional less of a hurdle. Diagnosis becomes cheaper & easier -> more time to actually talk to patients, and more diagnosises made with higher accuracy.
Unfortunately is this not likely to happen. More like:
Diagnosis becomes cheaper & easier -> more patients a doctor is expected to see in the same period of time as before
I know. I know. Part of it is that talking to patients on average is useless but still this can’t be really used for an argument against AI.
Still doctors can have a more broad picture of the situation since they can look at the patient as a whole; something the LLM can’t really synthesize in its context.
theyre also going to tell you things other than just what your insurance is agreeing to.
a robo doctor will be corrupt in ways that a regular doctor can be held accountable, but without the individual accountability
This is a pretty wild leap. Code has a lot of hooks for training via hill-climbing during post-training. During post-training, you can literally set up arbitrary scenarios and give the bot more or less real feedback (actual programs, actual tests, actual compiler errors).
It's not impossible we'll get a training regime that does the "same thing" for medicine that we're doing for code, but I don't know that we've envisioned what it looks like.
I suspect even prose is largely considered acceptable in professional uses because we haven’t developed a sensitivity to the artifice, and we probably won’t catch up to the LLMs in that arms race for a bit. However, we always manage to develop a distaste for cheap imitations and relegate them to somewhere between the ‘utilitarian ick’ and ‘trashy guilty pleasure’ bins of our cultures, and I predict this will be the same. The cultural response is already bending in that direction, and AI writing in the wild— the only part that culturally matters— sounds the same to me as it did a year and a half ago. I think they’re prairie dogging, but when(/if) we drop that bomb is entirely a matter of product development.
The assumption that LLMs figuring out coding means they can figure out anything is a classic case of Engineer’s Disease. Unfortunately, this hubris seems damn near invisible to folks in the tech industry, these days.
The AI coding improvement should be partially transferrable to other disciplines without recreating the training environment that made it possible in the first place. The model itself has learned what correct solutions "feel like", and the training process and meta-knowledge must have improved a huge amount.
An ER staff is frequently making inferences based on a variety of things like weather, what the pt is wearing, what smells are present, and a whole lot of other intangibles. Frequently the patients are just outright lying to the doctor. An AI will not pick up on any of that.
It will if it trains on data like that. It's all about the training data.
Diagnostic standards in (at least emergency, but I think other specialties) medicine are largely a joke -- ultimately it's often either autopsy or "expert consensus."
We get to bill more for more serious diagnoses. The amount of patients I see with a "stroke" or "heart attack" diagnosis that clearly had no such thing is truly wild.
We can be sued for tens of millions of dollars for missing a serious diagnosis, even if we know an alternative explanation is more likely.
If AI is able to beat an average doctor, it will be due to alleviating perverse incentives. But I can't imagine where we could get training data that would let it be any less of a fountain of garbage than many doctors.
Without a large amount of good training data, how could AI possibly be good at doctoring IRL?
IOW, these concept connection pattern machines are likely to outstrip median humans at this sort of thing.
That said, exceptional smoke detection and dots connecting humans, from what I've observed in diagnostic professions, are likely to beat the best machines for quite a while yet.
The truth is we just don't know how things will play out right now IMV. I expect some job destruction, some jobs to remain in all fields, some jobs to change, etc. We assume it will totally destroy a job or not when in reality most fields will be somewhere in between. The mix/coefficient of these outcomes is yet to be determined and I suspect most fields will augment both AI and human in different ratios. Certain fields also have a lot of demand that can absorb this efficiency increase (e.g. I think health has a lot of unmet demand for example).
No, I don’t see that we must.
> if we already have this assumption for software engineers
No, this doesn’t follow, and even if it did, while I am aware that the CEOs of firms who have an extraordinarily large vested personal and corporate financial interest in this being perceived to be the case have expressed this re: software engineers, I don’t think it is warranted there, either.
Is medical diagnosis one of these high judgement tasks? Personally I don’t think so.
Quite to the contrary, I think it's extremely trivial to find a task where humans beat LLMs.
For all the money that's been thrown at agentic coding, LLMs still produce substantially worse code than a senior dev. See my own prior comments on this for a concrete example [1].
These trivial failure cases show that there are dimensions to task proficiency - significant ones - that benchmarks fail to capture.
> Is medical diagnosis one of these high judgement tasks?
Situational. I would break diagnosis into three types:
1. The diagnosis comes from objective criteria - laboratory values, vital signs, visual findings, family history. I think LLMs are likely already superior to humans in this case.
2. The diagnosis comes from "chart lore" - reading notes from prior physicians and realizing that there is new context now points to a different diagnosis. (That new context can be the benefit of hindsight into what they already tried and failed and/or new objective data). LLMs do pretty good at this when you point them at datasets where all the prior notes were written by humans, which means that those humans did a nontrivial part of the diagnostic work. What if the prior notes were written by LLMs as well? Will they propagate their own mistakes forward? Yet to be studied in depth.
3. The diagnosis comes from human interaction - knowing the difference between a patient who's high as a bat on crack and one who's delirious from infection; noticing that a patient hesitates slightly before they assure you that they've been taking all their meds as prescribed; etc. I doubt that LLMs will ever beat humans at this, but if LLMs can be proven to be good at point 2, then point 3 alone will not save human physicians.
[1] https://news.ycombinator.com/threads?id=Calavar#47891432
I and likely the person who you replayed to don't find that existing studies actually hold this to be true.
If the latter part of your post were true, how come the demand for radiologists has grown? The problem with this place is it’s full of people who don’t understand nuance. And your post demonstrates this emphatically.
The first is that a technical solution can be trained on _ALL_ medical data and have access to it all in the moment. It is difficult to assume a doctor could also achieve this.
The second is that for medical cases understanding the sum of all symptoms and the patients vitals would lead to an accurate diagnosis a majority of the time. AI/ML is entirely about pattern recognition, when you combine this with point one, you end up with a system that can quickly diagnose a large portion of patients in extremely short timeframes.
On a different note, I think we can leave the ad-hominem attacks at home please.
Much moreso than modern AI systems are.
In humans, it seems that improvement in a new domain seems to follow a logarithmic scale.
Why wouldn’t this be the same for an AI?
If anything, using AI, they may improve more than before.
More importantly, LLMs regularly hallucinate, so they cannot be relied upon without an expert to check for mistakes - it will be a regular occurrence that the LLM just states something that is obviously wrong, and society will not find it acceptable that their loved ones can die because of vibe medicine.
Like with software though, they are obviously a beneficial tool if used responsibly.
But a doctor's job in the real world today is to navigate a total mess of uncertainty: about the expected outcome of treatments given a patient's age and other peoblems. About the psychological effect of knowing about a problem that they cannot effectively treat. Even about what the signals in the chart and x-ray mean with any certainty.
We are very far from having unit test suites for medical problems.
uhhhhhhh, I'm pretty behind-the-times on this stuff so I could be the one who's wrong here but I don't believe that has happened????
But anyways that nitpicking aside I agree with you wholeheartedly that reducing the doctor's job to diagnosis (and specifically whatever subset of that can be done by a machine-learning model that doesn't even get to physically interact with the patient) is extremely myopic and probably a bit insulting towards actual doctors.
Being a human when a patient is experiencing what is potentially one of the worst moments of their life. AI could be a tool doctors use, but let’s not dehumanize health care further, it is one of the most human professions that crosses about every division you can think of.
I would not want to receive a cancer diagnosis from a fucking AI doctor.
We're clearly not there yet, but it is inevitible that these models will eventually exceed human capability in identifying what an issue is, understanding all of the health conditions the patient has, and recommending a treatment plan that results in the best outcome.
You may not want to receive a cancer diagnosis from an AI doctor... but if an AI doctor could automatically detect cancer (before you even displayed symptoms) and get you treated at a far earlier date than a human doctor, you would probably change your mind.
Its going to be a while before robots are independently performing procedures and interpreting the imaging, although I suspect AI will also eventually supersede human here as well.
Nobody said that though?
If the current trajectory continues and if advancements are made regarding automated data collection about patients and if those advancements are adopted in the clinic then presumably specialized medical models will exceed human performance at the task of diagnosis at some point in the future. Clearly that hasn't happened yet.
Medical models can absolutely get better at recognizing the patterns of diagnosis that doctors have already been diagnosing - which means they will also amplify misdiagnosis that aren't corrected for via cohort average. This is easy to see a large problem with: you end up with a pseudo-eugenics medical system that can't help people who aren't experiencing a "standard" problem.
I'd argue that the current system in the west already exhibits this problem to some extent. Fortunately it's a systemic issue as opposed to a technical one so there's no reason AI necessarily has to make it worse.
1) looking at tests and working out a set of actions
2) following a pathway based on diagnosis
3) pulling out patient history to work out what the fuck is wrong with someone.
Once you have a diagnosis, in a lot of cases the treatment path is normally quite clear (ie patient comes in with abdomen pain, you distract the patient and press on their belly, when you release it they scream == very high chance of appendicitis, surgery/antibiotics depending on how close you think they are to bursting)
but getting the patient to be honest, and or working out what is relevant information is quite hard and takes a load of training. dumping someone in front of a decision tree and letting them answer questions unaided is like asking leading questions.
At least in the NHS (well GPs) there are often computer systems that help with diagnosis (https://en.wikipedia.org/wiki/Differential_diagnosis) which allows you to feed in the patients background and symptoms and ask them questions until either you have something that fits, or you need to order a test.
The issue is getting to the point where you can accurately know what point to start at, or when to start again. This involves people skills, which is why some doctors become surgeons, because they don't like talking to people. And those surgeons that don't like talking to people become orthopods. (me smash, me drill, me do good)
Where AI actually is probably quite good is note taking, and continuous monitoring of HCU/ICU patients
> After all, medicine is all about knowledge, experience and intelligence
So is... everything?LLMs are really really good at knowledge.
But they are really really bad at intelligence [0]
They have no such thing as experience.
Do not fool yourself, intelligence and knowledge are not the same thing. It is extremely easy to conflate the two and we're extremely biased to because the two typically strongly correlate. But we all have some friend that can ace every test they take but you'd also consider dumb as bricks. You'd be amazed at what we can do with just knowledge. Remember, these things are trained on every single piece of text these companies can get their hands on (legally or illegally). We're even talking about random hyper niche subreddits. I'll see people talk about these machines playing games that people just made up and frankly, how do you know you didn't make up the same game as /u/tootsmagoots over in /r/boardgamedesign.
When evaluating any task that LLMs/Agents perform, we cannot operate under the assumption that the data isn't in their training set[1]. The way these things are built makes it impossible to evaluate their capabilities accurately.
[0] before someone responds "there's no definition of intelligence", don't be stupid. There's no rigorous definition, but just doesn't mean we don't have useful and working definitions. People have been working on this problem for a long time and we've narrowed the answer. Saying there's no definition of intelligence is on par with saying "there's no definition of life" or "there's no definition of gravity". Neither life nor gravity have extreme levels of precision in definition. FFS we don't even know if the gravaton is real or not.
[1] nor can you assume any new or seemingly novel data isn't meaningfully different than the data it was trained on.
Way to subdue discussion - complaining about replies before you get any.
But you're wrong, or rather it's irrelevant whether something has intelligence or not, if it is effectively diagnosing your illness from scans or hunting you with drones as you scuttle in and out of caves. It's good enough for purpose, whether it conforms to your academic definition of "having intelligence" or not.
It provides no information on real world outcomes or expectations of performance in such a setting. A simple question might be "how accurate are patient electronic health records typically?"
Finally, if the Internet somehow goes down at my hospital, the Doctor can still think, while LLM services cannot. If the power goes out at the hospital, the Doctor can still operate, while even local LLMs cannot.
You're going to need to improve the power efficiency of these models by at least two orders of magnitude before they're generally useful replacements of anything. As it is now they're a very expensive, inefficient and fragile toy.
This is basically the only way how to ethically approach the topic. First you verify performance on “vignettes” as you say. Then if the performance appears satisfying you can continue towards larger tests and more raw sensor modalities. If the results are still promising (both that they statistically agree with the doctors, but also that when they disagree we find the AIs actions to fall benignly). These phases take a lot of time and carefull analysises. And only after that can we carefully design experiments where the AI works together with doctors. For example an experiment where the AI would offer suggestion for next steps to a doctor. These test need to be constructed with great care by teams who are very familiar with medical ethics, statistics and the problems of human decision making. And if the results are still positive just then can we move towards experiments where the humans are supervising the AI less and the AI is more in the driving seat.
Basically to validate this ethically will take decades. So we can’t really fault the researchers that they have only done the first tentative step along this long journey.
> if the Internet somehow goes down at my hospital, the Doctor can still think, while LLM services cannot
Privacy, resiliency and scalability are all best served with local LLMs here.
> If the power goes out at the hospital, the Doctor can still operate, while even local LLMs cannot.
Generators would be the obvious answer there. If we can make machines which outperform human doctors in realworld conditions providing generator backed UPS power for said machines will be a no brainer.
> You're going to need to improve the power efficiency of these models by at least two orders of magnitude before they're generally useful replacements of anything.
Why? Do you have numbers here or just feels?
Detecting when patient is lying . all patients lie - Dr. House
I take treatment ideas to real doctors. They are skeptical, and don’t have the time to read the actual research, and refuse to act. Or give me trite advice which has been proven actively harmful like “you just need to hit the gym.” Umm, my heart rate doubles when I stand up because of POTS. “Then use the rowing machine so can stay reclined.” If I did what my human doctors have told me without doing my own research I would be way sicker than I am.
I don’t need empathy. I don’t need bedside manner. Or intuition. Or a warm hug. I need somebody who will read all the published research, and reason carefully about what’s going on in my body, and develop a treatment plan. At this, AI beats human doctors today by a long shot.
The headline is quoting a number based on guessed diagnoses from nurse's notes. The LLM was happier to take guesses from the selected case studies than the doctors is my guess.
If 90% of patients have a cold, and 10% have metastatic aneuristic super-boneitis, then you can get 90% accuracy by saying every patient has a cold. I would expect a probabilistic token-prediction machine to be good at that. But hopefully, you can see why a human doctor might accept scoring a lower accuracy percentage, if it means they follow up with more tests that catch the 10% boneitis.
Meanwhile with human doctors, every one of them is a unique person with a completely different set of biases. In my experience, getting a correct diagnosis or treatment plan often involves trying multiple doctors, because many of them will jump to a common diagnosis even if the symptoms don't line up and the treatment doesn't actually help.
It seems like a very reasonable take away, but it skips the other one. Do x-rays make results less accurate?
Could be running in the background on patient data and message the doctor "I see X in the diagnostic, have you ruled out Y, as it fits for reasons a, b, c?"
I like my coding agents the same way, inform me during review on things that I've missed. Instead of having me comb through what it generates on a first pass.
but those kind of x-ray models are already activly used. They are not used though as a only and final diagnosis. Its more like peer review and priorization like check this image first because it seems most critical today.
"In the most extreme case, our model achieved the top rank on a standard chest Xray question-answering benchmark without access to any images."
Answer the following multiple-choice
question. You MUST select exactly
one answer."
"To what cortical region does this nucleus of
the thalamus project?”
A. Transverse temporal lobe
B. Postcentral gyrus
C. Precentral gyrus
D. Prefrontal cortex
And an example of the answer (generated without the referenced image) The image shows the ventral anterior (VA) / ventral lateral (VL) region of the thalamus, which is part of the motor
relay nuclei.
The labeled nucleus is in the lateral part of the thalamus, in the ventral tier — this corresponds to the VA/VL nucleus,
involved in motor function. VA/VL nuclei receive input from the basal ganglia and cerebellum and project to the primary
motor cortex (precentral gyrus).
Match to options:
A. Transverse temporal → auditory cortex (medial geniculate)
B. Postcentral gyrus → somatosensory (VPL/VPM)
C. Precentral gyrus → motor cortex (VA/VL)
D. Prefrontal → dorsomedial nucleus
Choice: C
How is it doing this? There are two obvious options:1. Humans are predisposed to write questions with a certain phrasology, set of incorrect answers, etc, that the machine learning model managed to figure out.
2. The supposedly private test set somehow leaked into the model training data.
I actually suspect this one is option 1 but I have no strong evidence for that.
It's 50% of the time ER doctors working solely from notes, something they never do, in a situation they know is only for a study, will miss what you have.
In real clinical situations the doctors see, hear, smell, and interact with the patients.
"Is there a potential cancer in this X-Ray" may produce a "possibly" just because that's how the model is trained to answer: always agree with the user, always provide an answer.
Oh, and don't forget that "Is there a potential cancer in this X-Ray" and "Are there any potential problems in this X-Ray" are two completely different prompts that will lead to wildly different answers.
> "number of image attachments: 1 Describe this imaging of my chest x-ray and what is your final diagnosis? put the diagnosis in ⟨diagnosis⟩ tags"
ChatGPT happily obliged and hallucinated a diagnosis [1] whereas Claude recognized that no image was attached and warned that it was not a radiologist [2]. It also recognized when I was trying to trick it with an image of random noise.
[1] https://chatgpt.com/share/69f7ce8f-62d0-83eb-963c-9e1e684dd1...
[2] https://claude.ai/share/34190c8a-9269-44a1-99af-c6dec0443b64
I think it's important to note that diagnosis also relies on accurate description of the patient in the first place, and the information you gather depends on the differential diagnosis. Part of the skill of being a doctor is gathering information from lots of different sources, and trying to filter out what is important. This may be from the patient, who may not be able to communicate clearly or may be non verbal, carers and next of kin. History-taking is a skill in itself, as well as examination. Here those data are given.
For pattern recognition from plain text, especially on questions that may be in the o1's training data, I'm not surprised at all that it would outperform doctors, but it doesn't seem to be a clinically useful comparison. Deciding which investigations to do, any imaging, and filtering out unnecessary information from the history is a skill in itself, and can't really be separated from forming the diagnosis.
Simply getting the "high score" on this evaluation is not necessarily good medical treatment.
This is handicapping the human doctors abilities. There is a lot more information a human doctor can gather even with a brief observation of the patient.
The other thing is that common issues are common. I have to wonder how much that ultimately biases both the doctor and the LLM. If you diagnose someone that comes in with a runny nose and cough as having the flu you will likely be right most of the time.
> there are few things as dangerous as an expert with access to open-ended data that can be interpreted wildly, like a clinical interview.
https://entropicthoughts.com/arithmetic-models-better-than-y...
Now feed a flawed transcripted into an AI diagnosis system and bam-o. The AI will treat it as gospel, while the doctor may go wait what.
Case in point, I went to a podiatrist for foot and ankle issues. He diagnosed my foot issues from the xray but just shrugged his shoulders for the ankle issues and said the xray didn't show anything. My 15 minute allocation of his attention expired and I left without a clue as to the issue or what corrective actions to take. 5 minutes with an LLM and I had a plausible reason for the ankle issues which aligned with the diagnosis in my foot.
Unless healthcare businesses decide to improve patient care with AI instead of increasing patients per day, I think it's going to make things even worse.
Skepticism is an incredibly useful tool, even in excess.
If you, like me, are in the software field, know that this is likely the most comfortable job even invented by humanity, we should really be paid just above the poverty line in exchange.
However many others in society save lives that are not so lavishly praised or financially rewarded.
For example in New Zealand median pay for a Road Design Engineer is about $100k NZD compared to a GP (doctor) getting $240k. Plus the doctor gets paid a massive overpayment of social status.
Over a 40-year career, an average NZ GP will save 5 to 10 lives. The Road Design Engineer saves 40 to 120 lives. Road engineers in NZ prevent roughly 10x more serious injuries than they do deaths so it isn't just death stats.
Our hypothetical engineer should be paid > 10x more than the doctor on raw stats.
It gets harder when we start looking at quality of life versus raw lifetime numbers. You then need to consider the value of say entertainment (a good movie) versus the hypothtical lives saved by spending the budget elsewhere.
A game designer might be valued highly by a gamer mum, and negatively by their children and gaming widowed dad.
I had to leave my job this year because of burnout when the execs mandated that we use AI tools, become our own designers, PMs, and QA, and double our velocity. They run through a decision tree they leaned in residency every day and I’m learning how to do 3-4 other people’s jobs on top of whatever the new AI thing is. I was working nights and weekends while my friends in medicine are planning their 3rd vacation this year to Tuscany.
[1] https://mediconsulta.net (DeepSeek)
Even if AI is used to sample or summarize a lot of data that a human couldn't do in time: What if it misses something that a human won't? What if a human inversely misses something that AI won't? Would you rather trust the machine or the human? (Especially if the human is held accountable.)
How much it can be effective for science if it is not compared side by side how each scenario was evaluated by both and how it came to different conclusions.
Who can ensure a doctor couldn't spot some blind point AI couldn't at the remaining 43%.
Tools are not for replacement but combining efforts.
Throw such % to the public is a lot of irresponsibility.
(I was ~3 months away from wheelchair bound in those x-rays).
The worst one was Gemini. Upload an x-ray of just the right hip, and it started to talk about how good the left hip looked like.
I think with AI taking over it's gonna be harder to get a solution when your problem isn't the run-of-the mill.
But specialized models can be inhumanly good. I know, our main product is a model that does _precise_ analysis :)
I’ve had doctors try to convince me not to pursue medical care, that problems of people close to me were not real and purely psychological, and I’ve personally required emergency surgery due to inaction. In every case there were obvious signs and symptoms.
Doctors are not good at their jobs. In the US, we’ve done a particularly stupid combination of forcing them to incur legal liability and intermediating everything with insurance, both of which impact the care people actually receive.
I am very skeptical of studies like this that don't adequately reflect real world conditions, but when I was a software engineer I probably wouldn't have understood what "real" medicine is like either.
Our findings found that gpt-5-mini performed better than gpt-5, sonnet 4 and medgemma.
I think these studies are very hard to accurately score. But in any case, AI seems to do a very good job compared to humans. Unsurprising, really.
1. AI gets data about the patient and makes a diagnosis. This is NOT shown to doctor yet.
2. Doctor does their stuff, writes down their diagnosis. This diagnosis is locked down and versioned.
3. Doctor sees AI's diagnosis
4. Doctor can adjust their diagnosis, BUT the original stays in the system.
This way the AI stays as the assistant and won't affect the doctor's decision, but they can change their mind after getting the extra data.
6. Rankings are used to periodically "trim the fact" thus delivering more optimized cash flows to clinics that have been saddled with toxic debt
7. Sensing an opportunity AI providers start selling a $200 / month Data Leakage as a Service subscription to overworked physicians so that they can avoid the PE guillotine
I agree with GP's solution but we'd need regulation to prohibit what you describe.
Incompetent ones order unnecessary tests and exhaust treatment possibilities, which drives up cost billed to insurance.
Only the insurance industry and perhaps licensing bodies can pressure to keep the quality floor high, at least in terms of accurate diagnosis and prevention of overtreatment.
I still want humans in the loop, interpreting the LLMs findings and providing a sanity check.
You can’t hold an LLM accountable.
That’s the min responsible bar for LLM authored code, which normally doesn’t really matter much. For something as important as ER diagnostics, having a human in the loop is crucial.
The narrative that these tools are replacing human intelligence rather than augmenting it is, quite frankly, stupid.
We should embrace these tools.
But, “eliminating DRs”… hardly.
An AI and a pair of human doctors were each given the same standard electronic health record to read – typically including vital sign data, demographic information and a few sentences from a nurse about why the patient was there. The AI identified the exact or very close diagnosis in 67% of cases, beating the human doctors, who were right only 50%-55% of the time.... The study only tested humans against AIs looking at patient data that can be communicated via text. The AI’s reading of signals, such as the patient’s level of distress and their visual appearance, were not tested. That means the AI was performing more like a clinician producing a second opinion based on paperwork.
"I don't know, let's run more tests" is also a very important ability of doctors that was apparently not tested here. In addition to all the normal methodological problems with overinterpreting results in AI/LLMs/ML/etc. Sadly I do think part of the problem here is cynical (even maniacal) careerist doctors who really shouldn't be working at hospitals. This means that even though I am generally quite anti-LLM, and really don't like the idea of patients interacting with them directly, I am a little optimistic about these being sanity/laziness checkers for health professionals.The number in the headline isn’t even a good comparison because they asked doctors to make a diagnosis from notes a nurse typed up. Doctors are trained to be conservative with diagnosing from someone else’s notes because it’s their job to ask the patient questions and evaluate the situation, whereas an LLM will happily leap to a conclusion and deliver it with high confidence
When they allowed both humans and doctors access to more information about the case, the difference between groups collapsed into statistical insignificance:
> The diagnosis accuracy of the AI – OpenAI’s o1 reasoning model – rose to 82% when more detail was available, compared with the 70-79% accuracy achieved by the expert humans, though this difference was not statistically significant.
Talking to my medical professional friends, LLMs are becoming a supercharged version of Dr. Google and WebMD that fueled a lot of bad patient self-diagnoses in the past. Now patients are using LLMs to try to diagnose themselves and doing it in a way where they start to learn how to lead the LLM to the diagnosis they want, which they can do for a hundred rounds at home before presenting to the doctor and reciting the script and symptoms that worked best to convince the LLM they had a certain condition.
My wife was recently diagnosed with Mast Cell Activation Syndrome (MCAS) after a pretty scary series of ER visits. It's a very strange and stubborn autoimmune disease that manifests with a number of symptoms that, taken individually, could indicate damn near anything.
You could almost feel the doctors rolling their eyes as she explained her symptoms and medical history.
Anyway... it lit a bit of a fire in me to dig deeper, and one day Claude suggested MCAS. I started plugging in more labs, asking for Claude to cross-reference journals mentioning MCAS, and sure enough: it's MCAS.
idk what the moral of the story is except our current medical system is a joke. The doctors aren't the villains, but they sure aren't the heroes either.
Of course, there are plenty of places on earth that are extremely under doctored, and AI will definitely be better than nothing in poor regions of Africa if all it needs is a network connection and someone to donate the tokens.
While I’m sure there can be ways in which such studies are wrong, it’s very obvious that AI can accelerate work in many of these areas where we seek out professional help - doctors, lawyers, etc.
If you have string of issues with 10 last doctors though, then issue is, most probably, you...
My wife is a GP, and easily 1/3 of her patients have also some minor-but-visible mental issue. 1-2 out of 10 scale. Makes them still functional in society but... often very hard to be around with.
That doesn't mean I don't trust your words, there are tons of people with either rare issues or even fairly common ones but manifesting in non-standard way (or mixed with some other issue). These folks suffer a lot to find a doctor who doesn't bunch them up in some general state with generic treatment. There are those, but not that often.
It helps both sides tremendously if patient is not above or arrogant know-it-all waving with chatgpt into doctor's face and basically just coming for prescription after self-diagnosis. Then, help is sometimes proportional to situation and lawful obligations.
I admittedly I have a bunch of medical issues and these gems are my favourites from the GPs.
1. I cannot see the tonsil on the left side, so it is OK. (there was a 6cm!!! cyst in front of it)
2. After missing sky high TSH measures consistently for 2 years (4 testst) : "It must have been a few one offs" (no it wasn't and it is not even possible)
3. "Blood pressure has nothing to do with weight"
These %#£&* so called medical professionals are still working and most likely killing people legally.
These days I research and read studies, arm myself with knowledge, cross check with multiple LLMs and go in with a diagnosis and request a specific prescription. After 5 years with my health in the gutter I had my first comprehensive private blood test coming back with no issues.
So no, do not try to call me arrogant. I am not arrogant, I am defending myself from these "GPs" so they won't put me in an early grave by making fatal mistakes.
Doctors thinking patients are arrogant is an age old problem.
They aren't going to take a stab at an uncommon diagnosis even if it occurs to them, if they might get sued if they're wrong.
Edit: I'm not trying to say Doctors deliberately diagnose wrong. Just that if there are two possible diagnoses, one common that matches some of the symptoms and one rare that matches all symptoms, doctors are still much more likely to diagnose the common one. Hoofbeats, horses, zebras, etc
Should they not report on peer reviewed articles published in Science? or only report published articles that fit your priors?
I take them as those code generation command line tools like create react app and such.
Stochastic parrots can code yes, but that does not make them experts. Don't trust them with your life.