Some medical takeaways from Nobel-laureate Daniel Kahneman’s Noise: A Flaw in Human Judgement:
The large role of noise in error contradicts a commonly held belief that random errors do not matter, because they “cancel out.” This belief is wrong. If multiple shots are scattered around the target, it is unhelpful to say that, on average, they hit the bull’s-eye.
I like this line. Some radiologists, for example, over-call questionable findings while others are too cavalier and miss subtle features. They do not cancel out.
In Noise, Kahheman breaks noise down into three big categories: Level Noise, Pattern Noise, and Occasion Noise (each with its own causes and with its own mitigation strategies).
- Level noise: The deviation between a single judge from the average judger. For example, some teachers are tough graders.
- Pattern noise: The deviation of judges related to a unique or specific situation. For example, a teacher is generally an easy grader but really really likes Oxford commas and tends to grade harsher than average for students who fail to use them.
- Occasion noise: Variability related to random irrelevant/undesirable factors (weather, time of day, mood, recent performance of a local sport’s franchise). For example, a teacher grades harsher when finishing up their work from home.
Some doctors prescribe more antibiotics than others do. Level noise is the variability of the average judgments made by different individuals. The ambiguity of judgment scales is one of the sources of level noise. Words such as likely or numbers (e.g., “4 on a scale of 0 to 6”) mean different things to different people.
A massive problem, to be sure, and the reason why radiology trainees hate reading degenerative spine cases (no matter how you grade neural foraminal stenosis, it feels like you’re always “wrong”).
When there is noise, one physician may be clearly right and the other may be clearly wrong (and may suffer from some kind of bias). As might be expected, skill matters a lot. A study of pneumonia diagnoses by radiologists, for instance, found significant noise. Much of it came from differences in skill. More specifically, “variation in skill can explain 44% of the variation in diagnostic decisions,” suggesting that “policies that improve skill perform better than uniform decision guidelines.” Here as elsewhere, training and selection are evidently crucial to the reduction of error, and to the elimination of both noise and bias.
Algorithms are powerful, but for those that assume that checklists and knee-jerk medicine can provide equivalent outcomes, apparently not.
There is variability in radiologists’ judgments with respect to breast cancer from screening mammograms. A large study found that the range of false negatives among different radiologists varied from 0% (the radiologist was correct every time) to greater than 50% (the radiologist incorrectly identified the mammogram as normal more than half of the time). Similarly, false-positive rates ranged from less than 1% to 64% (meaning that nearly two-thirds of the time, the radiologist said the mammogram showed cancer when cancer was not present). False negatives and false positives, from different radiologists, ensure that there is noise.
The massive amount of noise in diagnostic medicine is one of several reasons why “AI” is so enticing. Essentially no one chooses their radiologists, and radiologists are often an out-of-sight/out-of-mind commodity. With our fee-for-service system combined with corporatized profit-seeking and a worsening radiologist shortage, it seems–at least anecdotally–that quality may be falling. These factors all combine to pave the way to make AI tools look even better in comparison.
Later, they go on:
Pattern noise also has a transient component, called occasion noise. We detect this kind of noise if a radiologist assigns different diagnoses to the same image on different days.
This definitely happens. Consistency is hard.
A separate study discusses another human foible, occasional noise related to the time of day:
But another study, not involving diagnosis, identifies a simple source of occasion noise in medicine—a finding worth bearing in mind for both patients and doctors. In short, doctors are significantly more likely to order cancer screenings early in the morning than late in the afternoon. In a large sample, the order rates of breast and colon screening tests were highest at 8 a.m., at 63.7%. They decreased throughout the morning to 48.7% at 11 a.m. They increased to 56.2% at noon—and then decreased to 47.8% at 5 p.m. It follows that patients with appointment times later in the day were less likely to receive guideline-recommended cancer screening.
How can we explain such findings? A possible answer is that physicians almost inevitably run behind in clinic after seeing patients with complex medical problems that require more than the usual twenty-minute slot. We already mentioned the role of stress and fatigue as triggers of occasion noise (see chapter 7), and these elements seem to be at work here. To keep up with their schedules, some doctors skip discussions about preventive health measures. Another illustration of the role of fatigue among clinicians is the lower rate of appropriate handwashing during the end of hospital shifts. (Handwashing turns out to be noisy, too.)
Taking a human factors engineering approach, we know that both patients and doctors will be better off in a system designed with human limitations in mind. For example, not just a deluge of interrupting EHR reminders to ignore, but a system that allows for the right kind of low-friction actionable prompts to be delivered at a useful time during a clinical encounter that is already scheduled in a way to allow for real-time documentation completion without running behind. Wouldn’t that be something?
Concerning metrics:
Focusing on only one of them might produce erroneous evaluations and have harmful incentive effects. The number of patients a doctor sees every day is an important driver of hospital productivity, for example, but you would not want physicians to focus single-mindedly on that indicator, much less to be evaluated and rewarded only on that basis.
See: Goodhart’s Law and patient satisfaction.
Discussion of job interviews and candidate selection has obvious parallels with the residency selection process:
If a candidate seems shy and reserved, for instance, the interviewer may want to ask tough questions about the candidate’s past experiences of working in teams but perhaps will neglect to ask the same questions of someone who seems cheerful and gregarious. The evidence collected about these two candidates will not be the same.
One study that tracked the behavior of interviewers who had formed a positive or negative initial impression from résumés and test scores found that initial impressions have a deep effect on the way the interview proceeds. Interviewers with positive first impressions, for instance, ask fewer questions and tend to “sell” the company to the candidate.
This is an incredibly on-point summary of how most institutions conduct interviews. Those candidates who are good on paper and not painfully awkward during the initial pleasantries basically get a pass. Even when given questions, those answers are often contextualized within the pre-formed opinions. This focus on “selling the program” would even be reasonable if the metrics and data that programs receive were actually helpful at predicting residency success.
Kahneman and his team offer a lot of advice on how to conduct better interviews in the book. Some of it I suspect is too inefficient and awkward for the residency process, but what a lot of programs do (subjectively grade an applicant on a few broad metrics during a committee meeting and then pretend the process is objective) is a bit of a farce.
Summary: highly recommended reading.