Top 22 Free Healthcare Datasets for Machine Learning
Alexandra Quinn | July 13, 2022
Machine Learning is revolutionizing the world of healthcare. ML models can help predict patient deterioration, optimize logistics, assist with real-time surgery and even determine drug dosage. As a result, medical personnel are able to work more efficiently, serve patients better and provide higher quality healthcare.
When developing and training machine learning models for healthcare, open and free datasets are an essential starting point for data scientists and engineers, and they can be hard to come by. Here are 22 excellent open datasets for healthcare machine learning:
General Healthcare, Medical and Life Sciences Datasets
1. WHO
Global Health Observatory (GHO) resources by the WHO (World Health Organization). The GHO includes data sets and reports from 194 countries on a wide variety of topics. Health topics include mortality, child nutrition, water and sanitation, HIV/AIDS, health systems, injuries, and more.
2. DHS Program
Medical datasets from the DHS (Democratic and Health Services) Program spanning multiple topics. These datasets include data from around the globe, both from individual countries as well as cross-country comparisons. They are based on surveys, biomarker testing and geographic data.
3. HealthData.gov
The official US government healthcare website, which includes multiple datasets of the US population. Dataset topics range from COVID-19 to health equity, and more.
4. Life Science Database Archive
A life science dataset from Japan, gathered by life scientists over long periods of time. Includes datasets about organs, antigens, chemicals and more.
5. Data.gov.au
The official source of Australian open government data. Includes all Australian datasets, healthcare and beyond.
6. Kent Ridge Biomedical Datasets
Datasets from the biomedical field. This database includes journal-published data.
7. CDC WONDER
A US CDC (Centers for Disease Control and Prevention) database named WONDER (Wide-ranging Online Data for Epidemiological Research). Contains public health information around topics like mortality, natality, cancer, vaccinations and more.
Image Datasets for Life Sciences, Healthcare and Medicine
8. Oasis
OASIS (Open Access Series of Imaging Studies) provides neuroimaging data sets of the brain. The healthcare dataset currently contains 1098 subjects across 2168 MR Sessions and 1608 PET sessions.
9. OpenNeuro
716 public datasets from 27,482 participants with MRI, PET, MEG, EEG and iEEG data.
10. ADNI
The Alzheimer’s Disease Neuroimaging Initiative (ADNI) free public dataset provides MRI and PET images, genetics, cognitive tests, CSF and blood biomarkers collected by Alzheimer researchers. Data pertains to disease patients, mild cognitive impairment subjects and elderly controls.
11. Deep Lesion
An enormous images set of more than 32,000 CT images from the US National Institutes of Health Clinical Center.
Mortality and Diseases Datasets
12. Human Mortality Database
A global database with mortality and population estimation rates in developed countries, including Spain, Canada, Czechia, the US, Japan, Ireland and more.
13. Chronic Disease Data
An open dataset by the US CDC (Centers for Disease Control and Prevention). The dataset contains 124 indicators of chronic disease data, collected from various states and territories.
14. CHDS
Datasets by CHDS (Child Health and Development Studies) that help investigate how health and disease are passed on between generations. This includes genes as well as social, personal, and environmental factors.
Genome Datasets
15. 1000 Genomes Project
Datasets from the international collaboration that enabled completing the most detailed catalog of human genetic variation. The datasets include SNPs, structural variants and haplotype context. Since the project has been completed, the data is now available without embargo.
Hospital and Healthcare Services Datasets
16. Healthcare Cost and Utilization Project (HCUP)
An official US Department of Health and Human Services website, intended for identifying, tracking, and analyzing trends in healthcare utilization, access, charges, quality, and outcomes.
17. Medicare
337 datasets from Medicare providers in the US. Includes data about facilities, doctors, hospitals, suppliers, rehabilitation, and more.
Cancer Datasets
18. CT Medical Images
A dataset of CT images for trend examination while referring to contrast and patient age.
19. Broad Institute Cancer Program Datasets
Datasets related to tumor types, cells,gene expression patterns and more.
20. SEER Cancer Incidence
Data sets from the US national cancer institute related to race, gender, and age.
X-Ray Datasets
21. COVID-19 X-Ray Dataset
6500 images of AP/PA chest X-Rays with pixel-level polygonal lung segmentations. 517 of the cases have COVID-19.
22. NIH Database of 100,000 Chest X-Rays
112,000+ chest X-ray images from 30,000+ unique patients that were labeled with NLP. According to the community comments, it seems like there is some discrepancy among labels.
The Future of Healthcare Data for Machine Learning
The pressure of the global pandemic, strained public resources and a growing population requires efficiency from healthcare institutions and governments. ML can help bridge that gap, by providing technological solutions that can improve healthcare processes’ efficiency and quality.
This has been exemplified in many cases. For example, Sheba hospital used MLOps to deliver real-time AI services to the hospital floor, like predicting and mitigating complications such as COVID-19 patient deterioration, to aid decision making during surgery and to orchestrate and optimize the patient’s journey. To learn more about ML and healthcare, click here.