Last update 28 NOV 2023
There seem to be thousands of peer-reviewed medical papers since the 2000s about using AI software to screen X-Rays and CT/MRI scans for tuberculosis, pneumonia, COVID, and other conditions, and they always glowingly report that the AIs beat recently-graduated radiologists and equal the performance of experienced radiologists.
But now there are a growing number of papers that report that those glowing results fail to reproduce when brought to the real world. Turns out it's because often the AIs were cheating by latching on to irrelevant aspects of the training images that happened to correlate with the disease only in the training dataset, but not in the real world, as summarized in this article.
This means the AIs are completely useless for real-world diagnosis. Yet somehow they not only made it past peer-review, but were showered with funding and praise as the next medical revolution.
In this document, we will examine some of these failures in detail to find out what happened, then see what bigger lessons we should learn from this experience.
Some of the examples of these "shortcuts" (aka "spurious correlations") are pretty shocking. Each paper below presents a different category of problem with AI image screening (there's not just one AI problem)...
When they hacked the healthy images to include the same ink markers, the AI started diagnosing those images as cancer too!
Winkler JK, Fink C, Toberer F, et al. Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition. JAMA Dermatol. 2019;155(10):1135–1141. doi:10.1001/jamadermatol.2019.1735
This heat map from the paper shows the ink marks around the lesions (top row), and then shows how much attention (bottom row, red means more) the AI was giving to the ink marks relative to the skin lesion itself when making a diagnosis:
colour calibration charts (elliptical, coloured patches) [that] occur only in benign images and not in malignant ones. Our methodology artificially inserts those patches and uses inpainting to automatically remove patches from images to assess the changes in predictions. We find that our standard classifier partly bases its predictions of benign images on the presence of such a coloured patch. More importantly, by artificially inserting coloured patches into malignant images, we show that shortcut learning results in a significant increase in misdiagnoses, making the classifier unreliable when used in clinical practice.Read Paper
Nauta M, Walsh R, Dubowski A, Seifert C. Uncovering and Correcting Shortcut Learning in Machine Learning Models for Skin Cancer Diagnosis. Diagnostics. 2022; 12(1):40. https://doi.org/10.3390/diagnostics12010040
Turns out it was because each hospital, for economic or other reasons, had different percentages of diseased patients, and the AI was not actually recognizing the disease, but rather which hospital and department the scan came from, and was thus able to guess the probability of disease much better!
Patients who received X-rays from a portable X-ray machine (which is used at a hospital ICU) are more likely to have pneumonia than those with other X-ray machines (found in non-emergency clinics) and the AI was able to recognize 100% of the time whether the X-ray came from the portable machine or not—not whether the patient had pneumonia.
Note that the medical staff did not tell the AI which hospital/department/machine each scan came from—the data was not labeled that way. Instead the AI was sneakily finding correlations in the image pixels with particular hospitals or scanning machines instead of (or more strongly than) correlations with actual disease. This all happened without the medical staff being aware of it.
Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, et al. (2018) Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine 15(11): e1002683. https://doi.org/10.1371/journal.pmed.1002683Here again we can see a heat map showing which parts of the image the AI was using to diagnose pneumonia...and it appears to have found the lung disease pneumonia in the shoulder bone protrusion known as the acromion process, hmm. In reality, the AI was obsessing over the mark that shows over in the shoulder with the portable X-ray machine:
This NPR interview also talks about the AI identifying the portable X-ray machines using text around the edge of the image:
When they realized that the AI was using the specific scanner machine as a big factor in the diagnosis, they decided to check how well the AI could simply identify the hospital of each medical image. They found that the AI was able to identify which hospital the image came from with 100% accuracy, and even if they cropped the image down to a tiny 8x8-pixel speck of the original image, the AI could still identify which hospital the image came from with 99.6% accuracy!
These results further show that hospital-speciﬁc signal is deeply encoded in chest x-rays (enough that models can pick up on it even from a very small patch of an image), explaining why CNNs trained for disease prediction are so prone to learning hospital-label shortcuts.Read Paper
Compton, Rhys & Zhang, Lily & Manas Puli, Aahlad & Ranganath, Rajesh. (2023). When More is Less: Incorporating Additional Datasets Can Hurt Performance By Introducing Spurious Correlations.This result is really important, because it means that the AI can and does latch onto non-medically-relevant features of the image that humans (even medical experts) can neither see nor remove.
So solving this AI problem is not as simple as just "cropping out the ink markers and text on the edge of the image."
This means it is critical for Medical AI systems to be able to clearly show the medical staff which source image features it is using to come up with each diagnosis (so-called Explainable AI (XAI)), so that a human can verify that the features are medically meaningful at every stage of the medical tool development process.
XAI is something that the current big AI companies can do, but often deny is possible, because admitting XAI is possible will blow the lid off the millions and billions in profits they can make off of their ongoing copyright infringement free-for-all of human-created text and images in systems like ChatGPT and DALL-E.
While our saliency maps sometimes highlight the lung fields as important (Fig. 2a), which suggests that our model may take into account genuine COVID-19 pathology, the saliency maps concerningly also highlight regions outside the lung fields that may represent confounds. The saliency maps frequently highlight laterality markers (Fig. 2a and Supplementary Fig. 5), which differ in style between the COVID-19-negative and COVID-19-positive datasets, and similarly highlight arrows and other annotations that are uniquely found in the GitHub-COVID data source20 (Supplementary Fig. 6), which aligns with a previous study finding that ML models can learn to detect pneumonia based on spurious differences in text on radiographs34Read Paper
AI for radiographic COVID-19 detection selects shortcuts over signal Alex J. DeGrave, Joseph D. Janizek, Su-In Lee medRxiv 2020.09.13.20193565; doi: https://doi.org/10.1101/2020.09.13.20193565Here again we see a heat map showing the regions of the image that most affected the AI's COVID diagnosis (right column, more red is more significant). Note that the most red areas are generally not actually in the lung at all, including our friend the "laterality marker" in the top image:
In another example we see the AI focusing on bits of text at the edge of the image:
They then used GANs to have the AI generate a variation on a real medical image, asking "show me how this image would be different if the patient actually had/didn't have COVID" and:
the generative networks frequently add or remove laterality markers and annotations (Fig. 2b, solid red boxes), reinforcing our observation from saliency maps that these spurious confounds also enable ML models to differentiate the COVID-19 positive and COVID-19 negative radiographs. The generative networks additionally alter the radiopacity of image borders (Fig. 2b, dashed red boxes), supporting our previous assertion that systematic, dataset-level differences in patient positioning and radiographic projection provide an undesirable shortcut for ML models to detect COVID-19.
Makino, T., Jastrzębski, S., Oleszkiewicz, W. et al. Differences between human and machine perception in medical diagnosis. Sci Rep 12, 6877 (2022). https://doi.org/10.1038/s41598-022-10526-z
It turns out that even with "proper" training, the AI still tends to underdiagnose "under-served populations such as female patients, Black patients, or patients of low socioeconomic status" because, again, the AI is just reflecting biases in the training dataset, and the AI can tell (without being told) which training image comes from which hospitals, or which sexes, or which races, and reflect the biases that the human doctors training the AI have against those groups:
Seyyed-Kalantari, L., Zhang, H., McDermott, M.B.A. et al. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med 27, 2176–2182 (2021). https://doi.org/10.1038/s41591-021-01595-0
Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging
Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, Christopher Ré
Roberts, M., Driggs, D., Thorpe, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell 3, 199–217 (2021). https://doi.org/10.1038/s42256-021-00307-0
It's because the AI is doing exactly what it was designed to do: find correlations between the image pixels and the diagnostic outcome from the images in the training set.
But, despite the name, AI is not intelligent: it finds those correlations without regard to whether or not they are medically relevant. AI "knows" nothing about medicine, or disease, or the importance of human life. It just finds the quickest, shortcut trick to match up the training inputs with the diagnoses, without regard to what that shortcut "means."
Instead, the real problem is us. There are a variety of root issues:
#1. Irrational exuberance: The biggest problem is that various involved parties, from radiologists to hospital managers to programmers to tech bros, really really want to believe that AI is the magic bullet (in terms of specificity/sensitivity, cost savings, glorifying the tech for its own sake, and pushing tech IPOs, respectively) and so fail to critically analyze the tech.
#2. Humans cannot remove all confounds: The second problem is that even when you make the "obvious" training corrections, such as manually masking out all the pen marks and annotations and text and other parts of the scanned image other than the body organ being tested (and from reading other papers, a lot of hospitals are masking right now—there's even other papers about how to use AI again to properly mask out the non-lung pixels in lung images!), the problem persists, because it isn't just the marker rings or text at the edges of the film that cause the problem. The hapless AI is insatiably hungry to find correlations (relevant or not), and it will just as easily find correlations in human-invisible, scanner-specific, hospital-specific, nurse-specific noise present in every pixel of the image as we discovered above. So "just cleaning up the training data" will never be enough.
The AI just finds correlations—including ones humans cannot see, and without regard to whether the correlations are medically useful or not. If the AI finds cancer, it is just a coincidence without careful human evaluation of the training sets, and as the last paper above shows, the currently available training sets (that all the programmers and tech bros use) are crap. Several of the papers above offer checklists of what researchers need to do to avoid the "spurious correlations," but the checklists are incomplete and not always practical.
#3. AI Is not Explainable: So that leads to the third problem, which is that doctors and hospital managers don't demand Explainable AI (XAI) when they should be demanding it. They don't even know that's something they can ask the tech bros for.
And that connects with Google/OpenAI/etc. continuing to claim that "there's no technical way to map ChatGPT and DALL-E results back to the original copyrighted works"—bullshit! Of course there is. They just don't want to admit it can be built, because then they miss out on billions in profits off free copyright infringement.
In the medical imaging context, XAI is AI that should directly show the radiologist which parts of the image, and which features in those parts, led to the AI making a particular diagnosis, as seen in several of the papers above. XAI is fully doable from a programming standpoint. The basic design of neural networks allows for this (the fact that everything is weight-based makes it eminently practical to trace a diagnosis back to the training set, though one needs to store more state about the network to make it work).
The result of XAI will be that many doctors will see the AI emperor has no clothes and is using meaningless features to find cancer and other diseases.
Whether in the domain of ChatGPT/DALL-E or in the medical context, big tech has a vested interest in making us believe that XAI is impossible.
#4. AI perpetuates existing biases while shifting blame: As if 1-3 are not enough, the fourth problem is a non-technological one that has plagued medicine for centuries: tools which can perfectly diagnose a disease in one group (hospital, geographic, economic, ethnic, whatever) are often miserably bad at helping different groups or wider sets of patients, and the AI is completely oblivious to this. Since white male upper-class US-based tech bros can sell their products to US hospitals based on white male middle-class Christian patient results, not only do people not notice the misdiagnoses, they don't care. AI hides and perpetuates these long-standing injustices under a false veneer of tech "objectivity."
So you would think "ok, just give the AI a bigger training set and that will eliminate the biases." But when you do that, as the various papers above explain, either
Either way, we need to slow the fuck down in our frenzy to hump the medical AI dream before more people get killed.
While helpful, these checklists are necessarily insufficient, because the AI finds "spurious correlations" in noise that medical professionals can neither see nor remove.
To really solve the problem,
As a reality-check, consider ChatGPT and DALL-E and related services. The tech giants who created these services spent billions of dollars and years of labor with huge teams of (often underpaid) humans to check and rate tens or hundreds of thousands of AI responses. And that is for services that do not serve a life-or-death role like medical AI does. It was that costly just for Big Tech to create a system that wouldn't go Nazi all the time or hand out recipes for anthrax. And the systems still frequently "hallucinate" fabricated information!
But before dumping another billion dollars into medical AI, it's critical to ask another question nobody is asking: do we actually want to use AI?
But a huge problem that happens with every new technology is that when tech types and the public get in a frenzy about the latest hot tech trend, with new click-garnering news hottakes coming out every day and venture capitalists showering any company using that tech with billions of dollars, we tend to assume that this new sliced bread can, and should, be used for all applications.
For example, the narrative we often see for any application of AI is:
After the big tech companies poured enough money in, AIs were able to beat humans at Chess, then Go, then ace the SAT/GRE/MCAT and pass the Bar.or
So, that means if big tech pours enough money into medical diagnosis, then AI will be a good tool for that too.
Since a "properly-trained" AI can beat the MCAT, that means obviously a properly-trained AI would be able to diagnose diseases in medical images, if we just spent enough development resources.or
ChatGPT is already pretty good at medical diagnosis, so we just need to invest a bit more to clean up those "few edge cases" where it gives the wrong answer.This is an incorrect and extremely dangerous way to think. Just because we have a hammer doesn't mean everything is a nail.
One application might be well-suited for AI and other applications might not.
When AI aces the MCAT, we literally don't care why—even if AI succeeded for some spurious reason like repeated MCAT questions/question patterns. We're just in it for the lolz, and/or a few hundred million more in funding.
When AI makes a medical diagnosis, lives literally depend on the correlations found by a medical AI during training being medically relevant, and AI makes it much harder than pre-AI methods to tell what features of the data are even generating a given diagnosis, and how to fix it when the AI makes dangerous errors.
The Explainable AI (XAI) movement is trying to help solve this major shortcoming of AI, and we need to help increase awareness of XAI and demand XAI in medical applications. And we mentioned above that tech bros and hospital administrators need to invest a ton more labor than they do currently so that medical expert humans can check and improve the AI model's performance and safety.
But even with Explainable AI and sufficient funding, we still might be using the wrong, less effective and less cost-effective tool for the job. We might waste massive resources creating a barely-enough medical AI when other non-AI technologies could do the job better and cheaper.
So it's not clear that even billion-dollar AI is a good and safe tool for the job. I'm not saying it is or it isn't: I'm saying that right now we are blindly assuming that AI is our new Jesus and failing to even consider if there might be a safer and better way to create non-human medical assistants to do specific tasks.
That is because we are ignoring a basic reality of different applications:
Recent AIs became famous because they provide shockingly good average-case performance for a class of non-life-or-death problems that previous technologies weren't as good at.
But for medicine, we're talking about people's health and survival.
Recent AIs have been good at gnarly problems requiring "predictive analysis" for problems that have many dimensions of inputs where it's not obvious how to combine those inputs to produce the desired kind of output.
But AIs are not the only tool for this. They are simply one recent tool with good average-case performance and absolutely abysmal, unpredictable, and uncharacterizable worst-case performance.
For life-or-death applications like medicine, it would be better to have a system that had worse average-case performance, but in return they can offer us upper/lower bounds on worst-case performance. Since we actually understand what they are doing inside their computer code (unlike AI), we can reason and come up with meaningful limits that can save lives, like "this system will generate an error rate of less than 0.0001% for this range of conditions, but its accuracy becomes dangerously low when X or Y or Z happens in the input, so don't use it for that."
It's great that ChatGPT can do such amazing things, but its technology is not suitable for every application.
Some doctors point to articles saying human medical errors are the third leading cause of death (though these clickbait-friendly headlines are highly dubious when you look in more detail), and then they note that ChatGPT, when given a proper prompt with the relevant patient symptoms (which is a big ask for the general public) consistently gives more accurate diagnoses than the average human doctor, especially in countries where doctor training is not so good.
But even if we ignore the problems with the premise and the small sample size, these doctors are only looking at ChatGPT's average-case performance.
It's likely that ChatGPT will occasionally tell the patient to do something truly destructive and fatal. Something that the AI-hopeful doctors never noticed during their anecdotal tests. And, importantly, something that their fellow doctors, even those with worse average-case diagnostic skills than ChatGPT, would never tell a patient to do.
The harsh reality we need to face is: These errors are a fundamental and unfixable property of AI's quite reckless approach at solving problems.
The technical people who make AI have no understanding whatsoever of medicine, or cats, or chemistry, or biological weapons, or of any particular application. They just built a gigantic correlation-finding machine that is as unaware of what is important, or what is safe, or what is ethical, as a mechanical cash register.
AI is, fundamentally, a cheat. When it doesn't work, we computer programmers don't know why, so we throw in more layers/nodes/parameters, and sometimes it works better, but we still don't know why.
We can throw more billions at AI to reduce the frequency of its absurd and extreme failures, but since we do not and cannot understand how AI works inside, we will never be able to eliminate them completely or even characterize when and how often they will happen.
You never get something for nothing. Whenever people take shortcuts, we need to pay the price (in this case, unbounded worst-case performance).
If you don't believe me, you can watch countless interviews with the leaders and technical experts at OpenAI, Google, Meta, and other companies where they readily admit this reality of AI. Many of them admit it is their greatest fear as AI becomes more and more powerful.
So I think it's important to push back on AI as the magic bullet for all of today's problems, and to be clear on the dangers of AI's lack of bounds on worst-case performance (which is improved, but not completely solved, by Explainable AI).
AI is fundamentally a black box, while other methods are not.
There are many, many excellent other ways of solving gnarly problems like medical image analysis that have existed for a long time and that allow us to clearly and confidently reason about when they will work and when they will fail, and how to fix them if they do fail. They might have worse average-case performance than currently-fashionable AI, but in the end it is quite likely that they will offer a balance of performance that saves a lot more lives.
We must at least try to see which is better—not blindly follow the latest AI tech fad off a cliff.
Oh, and for those hospital managers—it will also save a lot of cost and a lot of lawsuits too.
medical/tuberculosis/COVID/... AI image "shortcut" medical/tuberculosis/COVID/... AI image "spurious correlation" "tuberculosis" "spurious correlation'" by AI using wrong featureIt took me a really long time to find viable Google search keywords because there are so so many breathless pro-AI papers that it is actually hard to find keywords that match only the critical papers.
There must be 400 papers that are just "systematic review of techniques for machine learning ____ diagnosis" always finding ONLY glowing results, none of them seeming to question whether there are spurious correlations.
|Copyright||All text and images copyright 1999-2023 Chris Pirazzi unless otherwise indicated.|