UCI Machine Learning Repository: A collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
Focus: General machine learning
OpenML: An platform for sharing datasets, algorithms, and experiments, facilitating collaboration and reproducibility in machine learning.
Focus: General machine learning
Kaggle Datasets: A platform offering a collection of datasets for machine learning projects involving NLP, computer vision, pre-trained models, and more!
Focus: General machine learning
TensorFlow Datasets: Provides ready-to-use datasets for machine learning, exposing them as tf.data.Datasets or NumPy arrays. It streamlines downloading, preprocessing, and versioning, ensuring efficient pipelines and reproducible research.
Focus: TensorFlow applications, benchmarking, data preprocessing, model training
Google Research Datasets: Offers large-scale datasets across computer science fields like machine learning, NLP, computer vision, and robotics, supporting research in image recognition, language modeling, and autonomous systems.
Focus: Computer Vision, robotics, NLP, simulation, modeling, and multimedia studies
ImageNet: A large image dataset organized by the WordNet hierarchy, focusing on nouns. Each "synset" is represented by hundreds of annotated images. It supports computer vision by offering a high-quality benchmark and data for tasks like object categorization, driving AI research.
Focus: Computer vision
Common Crawl: Provides a massive, open-access archive of web data collected monthly since 2008. With raw web pages, metadata, and text extracts, it's hosted on AWS and academic clouds. Used for AI, NLP, and analytics, it’s free and supports innovative projects worldwide.
Focus: Web Archive, NLP
GitHub Datasets: A large collection of datasets shared by the user community, covering various domains and formats, e.g., awesomedata. awesome-public-datasets. GitHub, version 20230903, https://github.com/awesomedata/awesome-public-datasets.
Focus: Multi-disciplinary