Bluesky

Facebook

Credit: Marko Nikolic/Alamy

Dubious data sets are being used to train artificial-intelligence models that are designed to predict people’s risk of stroke and diabetes, researchers report in a preprint1 on medRxiv. Some of the models seem to have been used in clinical settings although it’s not clear whether this has led to flawed diagnoses. At least two journals are investigating studies that used these data sets.

Adrian Barnett, a statistician at the Queensland University of Technology in Brisbane, Australia, and his colleagues identified 124 peer-reviewed papers that report using one of two open-access health data sets to train machine-learning models that provide little information about where the data came from.

An analysis revealed multiple oddities that would not be expected for data from real people, leading Barnett and his colleagues to suspect that the data could have been fabricated. “It was an enormous surprise to come across something like that,” Barnett says.

At least two of the models have been used in in hospitals in Indonesia and Spain. One has also been documented in a medical-device patent application filed in 2024, and two are publicly available web tools that allow people to check their risk level by uploading information about themselves.

“Prediction models trained on provenance-unknown data have no place in clinical decision-making. They are intrinsically unreliable,” says Soumyadeep Bhaumik, a public-health researcher at the George Institute for Global Health in Sydney, Australia. If the tools do not use real-world data, they are likely to make incorrect predictions and lead clinicians to make inappropriate decisions, such as prescribing treatments unnecessarily or not prescribing them when it is needed, he says.

Institutions and funders must insist that researchers disclose the source of data used to train AI models for medical applications, and journals should reject papers that fail this requirement, says Bhaumik. Barnett says that the data sets flagged in the study should now be taken down to prevent further studies from using them.

Data sharing

The two data sets investigated in the study, which has not yet been peer reviewed, were uploaded to Kaggle, a platform that developers can use to access data sets for building machine-learning models.

The first, labelled Stroke Prediction Dataset, was uploaded with the description “11 clinical features for predicting stroke events”. It contains health information from 5,110 people, including data on risk factors such as history of heart disease, marital status, average blood glucose level and body mass index (BMI). But when the researchers plotted the average blood glucose level against participant identifiers, they found several irregularities.

One was that very few data points were missing, by contrast with real data, which tend to have gaps because some participants miss follow-ups, leave the study or die, says Barnett. “No data set collected in the real world is fully complete,” he says.

Barnett and his colleagues found that 104 research articles had used this data set to create stroke-prediction models, including one used in a hospital in Indonesia and one that was tested on a handful of people. A third study, from the United States, suggests that the model is being deployed in a “local heart clinic”.

The stroke data set was uploaded by Federico Soriano Palacios, a data scientist in Madrid, and has been downloaded more than 288,000 times. In the discussion section of the data set on Kaggle, Palacios states that the data come from a confidential source and that they should be used only for educational purposes. Palacios did not respond to Nature’s questions about where the data came from.

More unreliable data

The second data set, labelled Diabetes prediction data set, is described as “A Comprehensive Dataset for Predicting Diabetes with Medical & Demographic Data”. It includes information on 100,000 people, including their BMI, smoking history and blood glucose levels. But Barnett’s team found that the data included just 18 discrete values for blood glucose across all of those supposed participants, which Barnett says is impossible given the huge variety that exists among people. The team also says that it identified thousands of values that seemed to be duplicated.

Barnett and his team found 21 studies that had used this data set to make diabetes-prediction models: none of the models has so far been used in clinical settings. One study used both data sets.

The diabetes data set was uploaded by Mohammed Mustafa, a data engineer in Chennai, India, who states on Kaggle that the data came from aggregated electronic health records. In response to a user’s question in the discussion section, Mustafa notes that “due to confidentiality reasons or other restrictions, I’m unable to disclose the specific source of the diabetes prediction data set”. Must

Dozens of AI disease-prediction models were trained on dubious data