MLOps Live

Join our webinar on Improving LLM Accuracy & Performance w/ Databricks - Tuesday 30th of April 2024 - 12 noon EST

27 Best Free Human Annotated Datasets for Machine Learning

Alexandra Quinn | December 15, 2023

Successfully training AI and ML models relies not only on large quantities of data, but also on the quality of their annotations. Data annotation accuracy directly impacts the accuracy of a model and the reliability of its predictions. This is where human-annotated datasets come into play. Human-annotated datasets offer a level of precision, nuance, and contextual understanding that automated methods struggle to match.

In this blog post, we bring the top 23 free human-annotated datasets you can use for your model training and evaluation. To cater to a wide variety of needs, these free datasets for machine learning cover a diverse set of categories and use cases:

  • Sentiment analysis
  • Language and docs
  • For ethical AI use
  • Industries
  • Images and videos

When training and developing your models, don’t neglect the final phase - a deployed LLM (or other type of model). Use MLOps solutions to ensure the process is automated, streamlined, scalable and iterative. This will ensure the successful implementation of your model. To learn more about how to build and scale your pipelines, click here.

What are Human-annotated Datasets?

Human-annotated datasets are data records that have been annotated by humans. This means that humans have added information, like labels or tags. For example, humans can provide inputs for categorization, sentiment analysis, bounding boxes for images, etc.

Human annotation helps advance ML and AI model training and evaluation. By providing the ground truth for models, algorithms can understand patterns and make better predictions on new, unseen data. As such, human annotation is an important step in building successful AI and ML systems.

Now let’s dive into the datasets themselves:

Category #1: Sentiment Analysis

1. Customer Reviews Dataset

A dataset containing more than 1,000 customer reviews and social media posts that are classified by sentiment.

Get the dataset here.

2. Stock Sentiment Analysis Dataset

A dataset with 1,000 stock market tweets with labeled sentiments.

Get the dataset here.

3. Brand Sentiment Analysis Dataset

A dataset including 1000 online conversations of users’ sentiment towards products.

Get the dataset here.

4. HumAID (Human-Annotated Disaster Incidents Data)

A dataset for crisis informatics containing ~77K human-labeled tweets across 19 disaster events that happened between 2016 and 2019. Categories include: caution and advice, displaced people and evacuations, don't know can't judge, infrastructure and utility damage, injured or dead people, missing or found people, not humanitarian, other relevant information, requests or urgent needs, rescue volunteering or donation effort , and sympathy and support.

Get the dataset here.

5. GoEmotions

58,000 Reddit comments labeled with 27 emotion categories: 12 positive, 11 negative, 4 ambiguous, and “neutral”. 

Get the dataset here.

 

Category #2: Language & Docs

6. Search Evaluation Dataset

A dataset with search queries, the intent associated with each search query, URL results and a quality rating.

Get the dataset here.

7. Question-Answering Dataset

A dataset that includes questions and answers from real webpages, news articles and texts.

Get the dataset here.

8. HANNA (Human-ANnotated NArratives for ASG evaluation)

1,056 stories that were generated from 96 prompts by the WritingPrompts dataset. Each story was annotated on relevance, coherence, empathy, surprise, engagement and complexity, by three raters.

Get the dataset here.

9. DocLayNet: A Human-Annotated Dataset for Document-Layout Analysis

A dataset for document layout analysis in COCO format, which can be used for PDF conversions. The dataset contains 80,863 manually annotated pages from a large number and variety of data sources and provides 11 choices of distinct classes for each PDF page.

Get the dataset here.

Category #3: For Ethical AI Use

10. Toxicity Dataset

A large dataset of social media toxicity. This dataset includes hateful speech from Twitter, Facebook, YouTube, Reddit, and more. This dataset can be used for ensuring ethical and fair AI responses.

Get the dataset here.

11. Fake News Dataset

A dataset of fake news in social media posts. The dataset can be used to promote fairness and help prevent the spreading of false information.

Get the dataset here.

12. Facebook Hate Speech Dataset

A collection of hate speech posts on Facebook, including racism, sexism, political attacks, insults, threats and violence. The dataset can be used to help filter hate speech and ensure ethical responses.

Get the dataset here.

13. RLHF Dataset to Reduce Harm

This dataset includes data about helpfulness and harmlessness and contains harmful dialogues. It can be used for training preference models for subsequent RLHF training or for understanding  how crowdworkers red team models.

Get the dataset here.

 

Category #4: Industries

14. Human Resources - Resumes and Job Categorization Dataset

A dataset with CVs. Each resume is classified with the job title, category and other parameters.

Get the dataset here.

15. Finance - Credit Card Transactions Dataset

A dataset containing credit card transactions. The transactions are classified according to intent and financial category.

Get the dataset here.

16. Cybersecurity - Email Spam Dataset

The dataset contains real spam and non-spam emails, including whether or not they were caught by Gmail's spam filters. It can be used to identify fraudulent emails.

Get the dataset here.

 

Category #5: Images and Videos

17. Humans in 3D

A dataset of annotated people including annotation of joints, eyes, ears, nose, ears, nose, shoulders, elbows, wrists, hips, knees, ankles, their 3D pose, visibility boolean, upper clothes, lower clothes, dress, socks, shoes, hands, gloves, neck, face, hair, hat, sunglasses, bag, occluder, and body type (male, female or child).

Get the dataset here.

18. Scanned Images and OCR Texts from Medieval Documents

Scanned images and OCR  texts that are reprints from the Hussite era. Annotations include layout analysis, OCR evaluation and language identification.

Get the dataset here.

19. GAN Images

A dataset containing 600 fake images and 400 real images that were evaluated based on eight attributes. The real images are from the ImageNet dataset and the fake images were generated by a generative adversarial net.

Get the dataset here.

20. Semantic Segmentation of Radishes

Annotated images of radishes collected during the spring of 2017.

Get the dataset here.

21. Era (Event Recognition in Aerial videos) Dataset

2,864 aerial videos with labels from 25 classes. The dataset can be used to help capture dynamic events at scale.

Get the dataset here.

22. ExoNet: Wearable Camera Images of Human Locomotion Environments

Approximately 923,000 human-annotated images from 5.6 million RGB images of indoor and outdoor real-world walking environments. The dataset can be used to help develop robotic leg prostheses and exoskeletons, humanoids, autonomous legged robots, powered wheelchairs, and other mobility assistive technologies.

Get the dataset here.

23. YouTube8M-MusicTextClips

4000 high-quality human text descriptions of audio from YouTube8M video clips. Each file contains the video_id, start time, end time and text description. The dataset is intended mainly for evaluation.

Get the dataset here.

24. Video Sub-shot Segmentation Evaluation

A dataset that annotates sub-shot segmentations for 33 single-shot videos. Sub-shots were divided according to video activity type. Each dataset includes the video ID, filename, Youtube video name, URL and video frame-rate.

Get the dataset here.

25. Fruit Annotations

Bounding box annotations of 11 fruits: apple, avocado, capsicum, mango, orange, cantaloupe, strawberry, blueberry, cherry, kiwi and wheat. (Wheat seems to be included despite it not being a fruit).

Get the dataset here.

26. HAM (Human-annotated Mappings)

A dataset for molecular graph partitioning. The dataset contains mappings of 1206 organic molecules with less than 25 heavy atoms.

Get the dataset here.

27. Relative Human

RGB images with rich human annotations of multi-person in-the-wild. Depth layers include relative depth relationship/ordering between all people in the image, age group classification (adults, teenagers, kids, babies), gender, bounding box, 2D pose.

Get the dataset here.

 

Deploying Trained Models

MLOps pipelines enable and enhance the deployment of trained models by automating and streamlining the process, while eliminating technical and operation silos. This includes preparing the model for deployment, optimizing for performance and compatibility, real-time data processing, serving models through scalable serverless functions, CI/CD, and monitoring and management tools. This comprehensive approach ensures models are not only deployed efficiently but also remain secure, compliant, and high-performing. To see how you can bring your trained models to life, click here.