MLOps Live

Join our webinar on Improving LLM Accuracy & Performance w/ Databricks - Tuesday 30th of April 2024 - 12 noon EST

Top 27 Free Healthcare Datasets for Machine Learning

Alexandra Quinn | October 31, 2023

Machine Learning is revolutionizing the world of healthcare. ML models can help predict patient deterioration, optimize logistics, assist with real-time surgery and even determine drug dosage. As a result, medical personnel are able to work more efficiently, serve patients better and provide higher quality healthcare.

When developing and training machine learning models for healthcare, open and free datasets are an essential starting point for data scientists and engineers, and they can be hard to come by. Here are 22 excellent open datasets for healthcare machine learning:

General Healthcare, Medical and Life Sciences Datasets

1. WHO

Global Health Observatory (GHO) resources by the WHO (World Health Organization). The GHO includes data sets and reports from 194 countries on a wide variety of topics. Health topics include mortality, child nutrition, water and sanitation, HIV/AIDS, health systems, injuries, and more.

2. DHS Program 

Medical datasets from the DHS (Democratic and Health Services) Program spanning multiple topics. These datasets include data from around the globe, both from individual countries as well as cross-country comparisons. They are based on surveys, biomarker testing and geographic data.

3. HealthData.gov

The official US government healthcare website, which includes multiple datasets of the US population. Dataset topics range from COVID-19 to health equity, and more.

4. Life Science Database Archive

A life science dataset from Japan, gathered by life scientists over long periods of time. Includes datasets about organs, antigens, chemicals and more.

5. Data.gov.au

The official source of Australian open government data. Includes all Australian datasets, healthcare and beyond.

6. Kent Ridge Biomedical Datasets

Datasets from the biomedical field. This database includes journal-published data.

7. CDC WONDER

A US CDC (Centers for Disease Control and Prevention) database named WONDER (Wide-ranging Online Data for Epidemiological Research). Contains public health information around topics like mortality, natality, cancer, vaccinations and more. 

8. openFDA

APIs and raw download access to structured datasets courtesy of the FDA. Data includes adverse events of drug use, drug product labeling and recall enforcement reports.

9. The Big Cities Health Inventory Data Platform

150,000 data points from 35 large US cities across 120 health-related metrics pertaining to life expectancy and deaths, access to health services, mental health and substance use, chronic health conditions, infectious diseases, maternal and child health, violence and injury, demographics, physical and built environment, social and economic  factors, and poisoning.

10. OECD Health Statistics 2023

Comparable statistics on health and health systems across OECD countries.

New call-to-action

Image Datasets for Life Sciences, Healthcare and Medicine

11. Oasis

OASIS (Open Access Series of Imaging Studies) provides neuroimaging data sets of the brain. The healthcare dataset currently contains 1098 subjects across 2168 MR Sessions and 1608 PET sessions.

12. OpenNeuro

716 public datasets from 27,482 participants with MRI, PET, MEG, EEG and iEEG data.

13. ADNI

The Alzheimer’s Disease Neuroimaging Initiative (ADNI) free public dataset provides MRI and PET images, genetics, cognitive tests, CSF and blood biomarkers collected by Alzheimer researchers. Data pertains to disease patients, mild cognitive impairment subjects and elderly controls.

14. Deep Lesion

An enormous images set of more than 32,000 CT images from the US National Institutes of Health Clinical Center.

Mortality and Diseases Datasets

15. Human Mortality Database

A global database with mortality and population estimation rates in developed countries, including Spain, Canada, Czechia, the US, Japan, Ireland and more.

16. Chronic Disease Data

An open dataset by the US CDC (Centers for Disease Control and Prevention). The dataset contains 124 indicators of chronic disease data, collected from various states and territories.

17. CHDS

Datasets by CHDS (Child Health and Development Studies) that help investigate how health and disease are passed on between generations. This includes genes as well as social, personal, and environmental factors.

Genome Datasets

18. 1000 Genomes Project

Datasets from the international collaboration that enabled completing the most detailed catalog of human genetic variation. The datasets include SNPs, structural variants and haplotype context. Since the project has been completed, the data is now available without embargo.

Hospital and Healthcare Services Datasets

19. Healthcare Cost and Utilization Project (HCUP)

An official US Department of Health and Human Services website, intended for identifying, tracking, and analyzing trends in healthcare utilization, access, charges, quality, and outcomes.

20. Medicare

337 datasets from Medicare providers in the US. Includes data about facilities, doctors, hospitals, suppliers, rehabilitation, and more.

21. OECD Hospital Performance

National and hospital level data on 30-day AMI mortality using linked and unlinked data along with hospital characteristics.

22. FY 2024 HHS Contingency Staffing Plan

Contingency plans for HHS operations in the absence of appropriations, expected to lead to 60% of HHS employees being retained and 40% furloughed.

Cancer Datasets

23. CT Medical Images

A dataset of CT images for trend examination while referring to contrast and patient age.

24. Broad Institute Cancer Program Datasets

Datasets related to tumor types, cells,gene expression patterns and more. 

25. SEER Cancer Incidence

Data sets from the US national cancer institute related to race, gender, and age. 

New call-to-action

X-Ray Datasets

26. COVID-19 X-Ray Dataset 

6500 images of AP/PA chest X-Rays with pixel-level polygonal lung segmentations. 517 of the cases have COVID-19.

27. NIH Database of 100,000 Chest X-Rays

112,000+ chest X-ray images from 30,000+ unique patients that were labeled with NLP. According to the community comments, it seems like there is some discrepancy among labels.

The Future of Healthcare Data for Machine Learning

The pressure of the global pandemic, strained public resources and a growing population requires efficiency from healthcare institutions and governments. ML can help bridge that gap, by providing technological solutions that can improve healthcare processes’ efficiency and quality.

This has been exemplified in many cases. For example, Sheba hospital used MLOps to deliver real-time AI services to the hospital floor, like predicting and mitigating complications such as COVID-19 patient deterioration, to aid decision making during surgery and to orchestrate and optimize the patient’s journey. To learn more about ML and healthcare, click here.