Currently viewing a development environment

Can AI help diagnose depression? It's a long shot

At the moment, machine intelligence is just as subjective as human intelligence

Alejandra Canales

Neuroscience and Biochemistry

University of Wisconsin - Madison

Despite being one of the most common mental disorders, depression is still not well-understood in both research and clinical practice settings. Not all patients present with the same symptoms, which can make it a difficult illness to diagnose. While scientists are hopeful that artificial intelligence can make some order out of the jumble of subjective criteria used to diagnose and treat depression, to date, computational studies still have limitations that have held up the application of machine learning methods in the clinic.

"Diagnostic heterogeneity," meaning the broad, non-specific symptoms patients present, has been a long-standing criticism of the American Psychiatric Association's diagnostic tool, the DSM-V, as well as the various scales used to measure depression severity. For example, the DSM-V allows for a high degree of symptom overlap across multiple disorders, which means that a certain combination of symptoms could be diagnosed as two different disorders, a situation clinicians call comorbidity. However, two individuals could also share the same diagnosis with little – if any – symptom overlap. This raises concerns about the validity of saying they have the same condition, especially since finding the right treatment for an individual with depression is done on a trial-and-error basis that can take months.

In psychiatric research, machine learning algorithms are being used to better define depression and to make predictions about which patients might respond to a given treatment. By "mining" data from larger datasets, researchers have been trying to find biomarkers – measurable biological indications – of depression. The thought is that researchers could teach a computer how to identify patterns in data from patient-reported surveys, demographic data, cognitive assessments, and even neuroimaging studies correlating blood oxygenation levels to brain activity in specific regions. 

To do this, scientists first input a subset of patient data and adjust their algorithm to reliably distinguish healthy versus control subjects or, in the case of treatment outcomes, responders from non-responders. They can then figure out which features in the data best help the computer "learn," make sure that their algorithm only incorporates those data features, and validate their method by testing how accurately it can make predictions about the rest of the patients, whose data it has not yet taken into account.

This approach has yielded some promising results. Several neuroimaging studies have claimed to have found subtypes of depression. Most recently, scientists from the Okinawa Institute of Science and Technology Graduate University showed that they could identify depression subtypes based on the regions in the brain that have altered activity levels in depressed patients (compared to healthy subjects), combined with the patients' scores on a survey called the Childhood Abuse and Trauma Scale. Other studies have attempted to tease apart the qualities in patients who respond to treatment with antidepressants or cognitive behavioral therapy.

a woman's eye in sharp focus with the rest of her faded out

Each person experiences depression differently, and responds to different treatments

 Photo by Mathieu Stern on Unsplash

However, like any lab experiment, computational studies can be poorly designed, limiting the credibility of their findings. As scientists led by Russell Poldrack from Stanford University argue, many studies have too few patients to confidently see an effect in the data, and replication studies have been few and far between. This can magnify other problems with machine learning algorithms. For example, scientists do not necessarily have standardized software packages for these computational studies or even standard methods for processing and analyzing images (in the case of neuroimaging studies). This leaves room for intentional research misconduct, an inadequate understanding of statistics, or software errors to drive a study's results, leading to misleading conclusions. These mistakes can also overestimate the predictive power of an algorithm. Now with conversations of reproducibility happening more openly, more researchers are pre-registering their studies as well as publishing their datasets and code to address these concerns.

Researchers have other factors to consider when it comes to data from patient surveys and demographics. While easier and less costly to administer in the clinic than a brain scan, patient-reported data assumes that the patient is reliably reporting symptoms and their severity. Socio-demographic factors also need to be carefully considered to avoid bias. For example, how does the finding from researchers at Yale University that race, education levels, and employment status are among the top predictors of remission of depressive symptoms following antidepressant treatment really help patients?

Additionally, it is no secret that individuals experience depression differently, so regardless of the data source it is not clear yet whether the patient cohorts included in all of these computational studies are representative enough for results to be broadly applicable to all depressed individuals.

Until these bigger issues surrounding reproducibility and feasibility get sorted out, artificial intelligence will likely not make psychiatric evaluation less subjective.

Comment Peer Commentary

We ask other scientists from our Consortium to respond to articles with commentary from their expert perspective.

Lily Toomey


Curtin University

This was a really interesting piece! I’ve read quite a few studies that talk about successfully using machine learning and AI to predict psychiatric and treatment outcomes for disorders such as schizophrenia, but I haven’t come across it so much for depression. However, you raised some interesting points regarding replicability. I was left wondering how big the methodological discrepancy actually is between studies. Would there be a best practice way of standardising the field in your opinion to prevent these differences or is it too complex to have a standardised way of analysing the data? Also, would you suggest that we should be screening for depression subtypes and/or socioeconomic status of participants in order to limit sampling bias for depression studies?

Alejandra Canales responds:

Thanks so much for your feedback! There’s definitely literature describing the discrepancy between studies. For a very recent example, this manuscript describes how different analysis workflows using the same fMRI dataset affect the results. That’s definitely where preregistering studies and sharing software/code would really help standardize analysis. Resources like OpenNeuro and NeuroVault are steps in that direction, and it seems like they’re catching on.

I find the idea of different depression subtypes interesting since it could be a neat way to explain all that heterogeneity. I’m not sure how often–if ever–this done in practice in the clinic, especially since these results need further validation. The findings related to socioeconomic status aren’t surprising, but screening for socioeconomic status could further limit access to treatment options. As it is, we already know–at least in the U.S.–that clinical trials are not representative enough.

Dori Grijseels


University of Sussex

This great piece makes a really important point about using AI in healthcare, and especially in this specific case of depression. We often think of AI as a magic tool that can do anything, but in the end, it needs to be trained by people, so it is only ever as good as the people that make and train the AI. Just by using an algorithm instead of judgements by specialist, doesn’t necessarily make the diagnosis less subjective. I think this article is a must-read for anyone interested in AI, and especially its applications to healthcare, thanks for a great piece! 

Kelsey Lucas

Physiology, Marine Biology, and Ecology

University of Michigan

As a depression sufferer who has bounced between treatments, I’m intrigued by the idea of using machine learning to improve the diagnostic process. I expected that there would be a lot of complexity here just because symptoms are so diverse and any stats modeling approach has pitfalls. But, I was surprised to see how much more complicated this particular challenge is - not only are there biological and methodological issues, but the intersectionality you’ve covered at the end is a major barrier for success. And, it’s a great example of why we need diversity in science; we need voices from groups typically underserved by science, medicine, etc. to address these problems. We’ll need a multi-disciplinary, multi-faceted approach that addresses science as well as economics, etc, to get to a successful, automated diagnostic system, and I’d be curious to learn more about partnerships that do this. 

David Baranger


University of Pittsburgh

This is an awesome piece! I 100% agree with concerns about the reproducibility of results. One thing to add, it’s pretty well-established that the performance of these AI methods decreases with larger samples. A small sample might pick up on what looks like signal, but is actually noise, while larger samples typically show the ability of AI to predict depression is still abysmally low. Without direct replication, it’s impossible to tell how well any of them truly perform. Which is where your excellent comments on the issues of software/data availability, and the degree to which researchers are willing to share their methods, come in. If the entire point of this research is to improve diagnosis and treatment, what benefit is there to methods that can’t be reproduced or used by others?

Marnie Willman


University of Manitoba Bannatyne and National Microbiology Laboratory

I always find it funny that if we don’t understand something fully, the first solution brought to the table is often “well, let’s get a machine to do it”. While AI is a wonderful technology, it has to be told exactly what to do, and machine learning (as designed by humans) is far from perfect. Depression is a complex, little understood disorder with a lot of factors playing into effect, many of which we still don’t know. I think you raise some very good points about replicability, and the abilities of this technology to actually function as a clinical or diagnostic tool. Very eloquently said, and certainly a conversation researchers and engineers are going to have to keep having until we get it right.