Machine Learning is revolutionizing the world of healthcare. ML models can help predict patient deterioration, optimize logistics, assist with real-time surgery and even determine drug dosage. As a result, medical personnel are able to work more efficiently, serve patients better and provide higher quality healthcare.
When developing and training machine learning models for healthcare, open and free datasets are an essential starting point for data scientists and engineers, and they can be hard to come by. Here are 22 excellent open datasets for healthcare machine learning:
General Healthcare, Medical and Life Sciences Datasets
Global Health Observatory (GHO) resources by the WHO (World Health Organization). The GHO includes data sets and reports from 194 countries on a wide variety of topics. Health topics include mortality, child nutrition, water and sanitation, HIV/AIDS, health systems, injuries, and more.
2. DHS Program
Medical datasets from the DHS (Democratic and Health Services) Program spanning multiple topics. These datasets include data from around the globe, both from individual countries as well as cross-country comparisons. They are based on surveys, biomarker testing and geographic data.
The official US government healthcare website, which includes multiple datasets of the US population. Dataset topics range from COVID-19 to health equity, and more.
A life science dataset from Japan, gathered by life scientists over long periods of time. Includes datasets about organs, antigens, chemicals and more.
The official source of Australian open government data. Includes all Australian datasets, healthcare and beyond.
Datasets from the biomedical field. This database includes journal-published data.
7. CDC WONDER
A US CDC (Centers for Disease Control and Prevention) database named WONDER (Wide-ranging Online Data for Epidemiological Research). Contains public health information around topics like mortality, natality, cancer, vaccinations and more.
Image Datasets for Life Sciences, Healthcare and Medicine
OASIS (Open Access Series of Imaging Studies) provides neuroimaging data sets of the brain. The healthcare dataset currently contains 1098 subjects across 2168 MR Sessions and 1608 PET sessions.
716 public datasets from 27,482 participants with MRI, PET, MEG, EEG and iEEG data.
The Alzheimer’s Disease Neuroimaging Initiative (ADNI) free public dataset provides MRI and PET images, genetics, cognitive tests, CSF and blood biomarkers collected by Alzheimer researchers. Data pertains to disease patients, mild cognitive impairment subjects and elderly controls.
11. Deep Lesion
An enormous images set of more than 32,000 CT images from the US National Institutes of Health Clinical Center.
Mortality and Diseases Datasets
A global database with mortality and population estimation rates in developed countries, including Spain, Canada, Czechia, the US, Japan, Ireland and more.
An open dataset by the US CDC (Centers for Disease Control and Prevention). The dataset contains 124 indicators of chronic disease data, collected from various states and territories.
Datasets by CHDS (Child Health and Development Studies) that help investigate how health and disease are passed on between generations. This includes genes as well as social, personal, and environmental factors.
Datasets from the international collaboration that enabled completing the most detailed catalog of human genetic variation. The datasets include SNPs, structural variants and haplotype context. Since the project has been completed, the data is now available without embargo.
Hospital and Healthcare Services Datasets
An official US Department of Health and Human Services website, intended for identifying, tracking, and analyzing trends in healthcare utilization, access, charges, quality, and outcomes.
337 datasets from Medicare providers in the US. Includes data about facilities, doctors, hospitals, suppliers, rehabilitation, and more.
A dataset of CT images for trend examination while referring to contrast and patient age.
Datasets related to tumor types, cells,gene expression patterns and more.
Data sets from the US national cancer institute related to race, gender, and age.
6500 images of AP/PA chest X-Rays with pixel-level polygonal lung segmentations. 517 of the cases have COVID-19.
112,000+ chest X-ray images from 30,000+ unique patients that were labeled with NLP. According to the community comments, it seems like there is some discrepancy among labels.
The Future of Healthcare Data for Machine Learning
The pressure of the global pandemic, strained public resources and a growing population requires efficiency from healthcare institutions and governments. ML can help bridge that gap, by providing technological solutions that can improve healthcare processes’ efficiency and quality.
This has been exemplified in many cases. For example, Sheba hospital used MLOps to deliver real-time AI services to the hospital floor, like predicting and mitigating complications such as COVID-19 patient deterioration, to aid decision making during surgery and to orchestrate and optimize the patient’s journey. To learn more about ML and healthcare, click here.