By Jan Bowers, contributing writer
Dermatologists could be forgiven for feeling a little blindsided when a Stanford University study published in Nature (2017:542(7639):115-18) garnered widespread publicity in the health care and consumer media. A typical headline: “Stanford’s Artificial Intelligence Is Nearly as Good as Your Dermatologist” (Fortune, Jan. 26, 2017). Other diagnosticians have been feeling the heat as well. In an article entitled “A.I. vs. M.D.,” published April 3, 2017, in The New Yorker, computer scientist and deep learning expert Geoffrey Hinton said radiologists are “like Wile E. Coyote…You’re already over the edge of the cliff but you haven’t looked down.” And Healthcare IT News put it even more bluntly with a story in its May 15, 2017, issue entitled “Machine learning will replace human radiologists, pathologists, maybe soon.”
Is machine learning, also known as deep learning (a subset of artificial intelligence), an imminent threat to the livelihood of these specialists? Dermatologists who are knowledgeable about the technology say no, but maintain it will certainly have an impact on the practice of medicine. “I don’t think it’s even a question that at some point, computers will be as good as today’s best diagnosticians, in terms of looking at something and classifying it,” said Allan C. Halpern, MD, chief of the dermatology service at Memorial Sloan Kettering Cancer Center. “I don’t know if that will happen in two years or 10 years. But the real question is, what’s the best way to implement this in clinical care that allows us to find bad actors without doing a lot of harm? The only way the profession and the public at large come out ahead is if we embrace this as an opportunity and not a threat.”
From dog breeds to skin lesions
The impetus for the Stanford study came from a dermatologist who was intrigued by the pioneering research into machine learning taking place in Stanford’s computer science department. “They were studying the ability of algorithms to look at images of dogs and identify the breed,” said Roberto Novoa, MD, clinical assistant professor of dermatology and pathology at Stanford University School of Medicine. “I was amazed. These were very specific, like distinguishing a Belgian Malinois from a German shepherd, and the algorithms performed better than a human who had spent 200 hours studying the breeds. So I thought, maybe it could learn to do this for skin cancer.”
Dr. Novoa reached out to Sebastian Thrun, an adjunct professor in the Stanford Artificial Intelligence Laboratory, who had become interested in applying machine learning technology to cancer diagnosis. “I met with his graduate students, who had started compiling Internet images from 2,000 disease categories, but weren’t quite sure what to make of them.” The algorithm that would eventually learn to diagnose skin cancer was pretrained in image recognition using 1.28 million images from Google’s ImageNet database of everyday objects, said Brett Kuprel, a PhD student in Thrun’s laboratory and a co-author of the Nature study. The algorithm, also known as a convolutional neural network, “is more like human intuition than like memory,” Kuprel explained. “It has a lot of parameters that are like tuning knobs that can be adjusted. When it looks at a raw image, it guesses what it is. At first, it’s completely random, and when we tell it it’s wrong (say, if it sees a cat and says it’s equally likely to be a dog, cat, or turtle) it automatically tunes the knobs to increase the probability assigned to the correct label, which is cat.” The more raw data it’s fed, the more sophisticated the algorithm becomes.
The interdisciplinary research team set out to test the algorithm’s ability to distinguish keratinocyte carcinomas from benign seborrheic keratoses, and malignant melanomas from benign nevis. “We gathered datasets of biopsy-proven images from the Internet, including a large one from the International Skin Imaging Collaboration (ISIC) and a separate set from the University of Edinborough,” said Dr. Novoa. The 130,000 images, representing more than 2,000 different skin diseases, were divided into separate sets used for training and validation (testing). Their brightness, contrast, and size were automatically adjusted to improve the algorithm’s ability to analyze them, Dr. Novoa said.
The team recruited 21 board-certified dermatologists to test their skills in evaluating lesions from a single image only, without context or clinical information. The dermatologists and the algorithm examined two sets of standard digital images (135 epidermal and 130 melanocytic) and one set of 111 melanocytic dermoscopy images. “We asked the dermatologists two questions: whether a lesion was malignant or benign, and whether they would biopsy or not,” said Dr. Novoa. “The algorithm was only looking at malignant vs. benign; its output was expressed as a probability.” The researchers plotted the results on a sensitivity-specificity curve and found that “the algorithm performed on par with the majority of dermatologists, and did better than the average dermatologist,” Dr. Novoa said.
That said, they learned that the algorithm could be fooled in unexpected ways. “For example, if we had a ruler in the image, the algorithm was much more likely to call it malignant,” he noted. “Why is that? Because on average, in our dataset, lesions with rulers were being measured and monitored by dermatologists, and were more likely to be malignant. The algorithm is looking at the whole image and will take whatever clues it can find. It can be biased by features like the ruler, and you won’t know it.” Another image that might trip up the algorithm would be that of an unusual combination like a benign nevus colliding with a seborrheic keratosis, which could closely mimic a melanoma, “but you may not know that until you’ve collected a lot of those images.”
A second component of the study, less widely publicized, tested the algorithm’s ability to correctly place a clinical image from any one of 2,032 diseases into one of nine broad categories in a taxonomy constructed by the dermatologists on the research team. The categories included malignant or benign melanocytic, epidermal or dermal lesions; inflammatory lesions; genodermatosis; or cutaneous lymphoma. Two dermatologists also agreed to take the test. “Both the dermatologists and the algorithm got it right around 50 to 55 percent of the time,” said study co-author Justin M. Ko, MD, MBA, clinical associate professor of dermatology at Stanford. “That’s much harder than choosing cancer or no cancer. It’s also not as clinically relevant. But the performance was really remarkable, and allows for interesting applications beyond cancer classification.”
The next stage
Without minimizing their accomplishment, Dr. Ko admits that “the hype is ahead of the reality. We set up an artificial construct — it’s like we took clinicians and we tied both hands behind their backs. You can’t look at their skin or other moles, you can’t ask them where they grew up, you can’t do any of the things we do in 30 seconds when we walk in the room and have that conversation. With that clinical context, the clinician would be superior.” A next step, he said, would be to feed the algorithm that kind of clinical information. That’s possible with the current deep learning technology, he noted, but “it’s going to take a long time. It will be a very, very resource-intensive effort, to get from where we are to there.”
In the meantime, Dr. Ko is testing the algorithm in his clinical practice “with a huge grain of salt, both in explaining to my patients what I’m doing and also with myself. I want to understand where we need to get better at this, and what are the usability issues. Like, does it matter how far away from the lesion you are; does the angle matter? The testing was based on textbook images, and those are different from what you would get in a real-world clinical setting.” Dr. Novoa said the next steps would include validating the algorithm in a prospective clinical trial at Stanford, expanding the trial to other institutions, and making the technology widely available once it’s validated.
Long-term role in clinical practice
Both Dr. Novoa and Dr. Ko envision their machine learning technology as expanding access to dermatology via a smartphone app targeted to patients. “It’s capable of analyzing smartphone images now, but we don’t know yet what are the limits of its accuracy,” said Dr. Novoa. “Just because it can do it doesn’t mean it’s getting it right.” Dr. Ko agreed, noting that “if we don’t do this right, if we released this and everyone starts downloading the app, we might falsely reassure people who potentially have something dangerous that it’s not dangerous. Or, we could have people diagnosing skin cancer who have nowhere to go to get it taken care of. How do they get surgery? Will we overburden the system?” Disseminating the technology in a careful, controlled way is a process that will unfold over years, Dr. Ko said, “and that’s one reason I don’t think physicians have much to worry about.”
Dr. Halpern shares Dr. Ko’s concern that widespread adoption of a machine learning app by consumers could unleash a flood of real and potential skin cancers into the health care system, especially as the population ages. And dermatologists, rather than being put out of business, will be busier than ever. “By moving technology closer to patients, there’s no question that skin cancer will become an even bigger issue; we’re talking about millions of cases a year,” Dr. Halpern said. “What’s not clear is what percentage of cancer cases can be left alone. Assuming that there are a lot of cases that right now go undiagnosed, if all of a sudden artificial intelligence can bring all those into the health care sphere, it’ll be enormous.” He added that as dermatologists are faced with a burgeoning number of potential skin cancers, they will increasingly turn to new and emerging non-invasive technologies “that will hopefully achieve diagnoses without innumerable biopsies.” Dr. Halpern serves on the organizing committee for the ISIC Challenge. Now in its third year, the challenge invites developers worldwide to use dermoscopic images in the public domain to build algorithms targeted to the detection of melanoma.
A former AAD president with a background in computer science believes that machine learning and other new technologies will dramatically change the way dermatologists diagnose pigmented lesions in the next five to 10 years. “There’s a bunch of technologies out there — mole-mapping, temporal serial imaging, Melafind, confocal microscopy, optical coherence tomography,” said Darrell S. Rigel, MD, clinical professor of dermatology at New York University Medical Center. “They’re all attacking the problem in a different way, but it’s the same problem. You have a patient who walks in with 100 moles, which ones do you biopsy? Or, can you obviate the biopsy? A couple of these technologies will win, and they will take us to the next level, beyond a dermatoscope.” Dr. Rigel pointed to a potential regulatory issue for machine learning algorithms that hampered the development of Melafind: the algorithm learns and changes continuously as it absorbs new data, but the U.S. Food and Drug Administration requires device developers to re-submit their devices for approval if they’re changed significantly. “Although Melafind was designed to get better and better, the FDA made it almost impossible to take advantage of that capability,” Dr. Rigel said. An article published online by The Atlantic (www.theatlantic.com/technology/archive/2017/10/algorithms-future-of-health-care/543825/) explores the regulatory situation for machine learning in depth and notes that the FDA is working with developers to update the agency’s regulatory processes for new digital health technologies.
Dr. Ko and Dr. Halpern reiterated that machine learning is an opportunity for dermatologists, not a threat. “We have a responsibility to be thinking about how to use this to genuinely improve the delivery of care,” said Dr. Halpern. “We need to understand that there are real threats associated with this, but they’re not to the dermatologist per se; they’re to making mistakes of implementing it ineffectively. In the right hands, used under the right models, this should upscale the value of the dermatologist while improving early detection and preventing some of the unnecessary surgery that’s going on. But that’s a really tall order.” Dr. Ko pointed out that “detection of the melanoma is great, but as we know, medicine is not just about diagnosis. Getting that patient through the stages of counseling, treatment, follow-up, that’s the doctoring. There will soon be things that do better at the rote stuff. And we should embrace that. I would gladly welcome a virtual clinical assistant to help me augment and extend the care I can deliver.”