28 Best Free NLP Datasets for Machine Learning [UPDATED]

Alexandra Quinn | July 29, 2025

NLP is a field of AI that enables machines to understand, interpret, and generate human language in a way that is both meaningful and contextually relevant. Recently, ChatGPT and similar applications have created a surge in consumer and business interest in NLP. Now, many organizations are trying to incorporate NLP into their offerings.

To help with these efforts, we’ve compiled a list of the top NLP datasets for NLP projects that data scientists and data professionals can use for training their models. This list is a starting point for training your NLP models.

The list is divided into a number of groups and types:

Q&A
Reviews and Ratings
Sentiment Analysis
Synonyms
Emails
Long-form Content
Audio

You can use these datasets for a number of use cases, like creating personal assistants, automating customer service, language translation, and more. The sky's the limit!

When planning how to train your NLP models with NLP training datasets, it's important to start with the end in mind — with deployment. To learn more about how to build and scale your NLP pipelines, click here.

Now let’s dive into the list:

Q&A

1. Stanford Question Answering Dataset (SQuAD)

A reading comprehension dataset, comprising pairs of questions and answers based on Wikipedia articles.

Get the dataset here.

2. Jeopardy Questions

A JSON file with 216,930 Jeopardy questions, answers and additional data like the air date.

Get the dataset here.

3. The WikiQA Corpus

Question and answer pairs that link to Wikipedia pages with the answer. The data set comprises 3,047 questions and 29,258 sentences. 1,473 sentences were labeled as answer sentences.

Get the dataset here.

4. Amazon Question/Answer Data

1.4 million answered questions based on Amazon product reviews.

Get the dataset here.

5. Elementary Science Questions

Explanation graphs for elementary science exam questions in the US.

Q&A

1. Stanford Question Answering Dataset (SQuAD)

2. Jeopardy Questions

3. The WikiQA Corpus

4. Amazon Question/Answer Data

5. Elementary Science Questions

Reviews and Ratings

6. Yelp Open Dataset

7. Amazon Reviews

8. Amazon Fine Food Reviews

Sentiment Analysis

9. Movie Review Dataset

10. Movie and Finance Review Dictionaries

11. Sentiment140 (Twitter-based)

12. Twitter US Airline Sentiment

Synonyms

13. WordNet

Emails

14. Enron Email Dataset

Long-Form Content

15. 20 Newsgroups

16. arXiv Papers

17. The Blog Authorship Corpus

18. Legal Case Reports

19. One Week of Global News Feeds

20. Federal Contracts

21. Common Crawl

Audio

22. LibriSpeech

23. Noisy Speech Database

24. TIMIT

25. Free Spoken Digit Dataset

Words & Definitions

26. Urban Dictionary Words And Definitions

Dialogue and Engagement

27. Movie Dialog

28. Jokes

[July 2025 Update: Even more datasets]

Multilingual Mobile App Reviews

Climate Action Social Media Global Trends

fake-and-real-news-dataset

Financial News Market Events

Binary Classification Dataset for Crime Headline

Jerome Powell Press Release Q&A

AI vs Human Content Detection

Human vs AI Generated Essays

4M LLM text logs dataset

Voice Search AI Conversational Queries

Scaling NLP Pipelines

Latest Posts

Using Agentic Frameworks to Build New AI Services

7 RAG Evaluation Tools You Must Know

Introducing MLRun v1.10: New tools for building agents and monitoring gen AI

You Might Also Enjoy

Using Agentic Frameworks to Build New AI Services

7 RAG Evaluation Tools You Must Know

Introducing MLRun v1.10: New tools for building agents and monitoring gen AI