Skip to Main Content

Navigating the Dataset Landscape: Essential Resources for Scientific Research

This resource is designed to supplement instructional sessions and serve as a reference for researchers, students, and faculty members interested in discovering and utilizing datasets in the sciences and engineering fields.

Discipline-Specific Repositories

UCI Machine Learning Repository: A collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
Focus: General machine learning

OpenML: An platform for sharing datasets, algorithms, and experiments, facilitating collaboration and reproducibility in machine learning.
Focus: General machine learning

Kaggle Datasets: A platform offering a collection of datasets for machine learning projects involving NLP, computer vision, pre-trained models, and more!
Focus: General machine learning

TensorFlow Datasets: Provides ready-to-use datasets for machine learning, exposing them as tf.data.Datasets or NumPy arrays. It streamlines downloading, preprocessing, and versioning, ensuring efficient pipelines and reproducible research.
Focus: TensorFlow applications, benchmarking, data preprocessing, model training 

Google Research Datasets: Offers large-scale datasets across computer science fields like machine learning, NLP, computer vision, and robotics, supporting research in image recognition, language modeling, and autonomous systems.
Focus: Computer Vision, robotics, NLP, simulation, modeling, and multimedia studies

ImageNet: A large image dataset organized by the WordNet hierarchy, focusing on nouns. Each "synset" is represented by hundreds of annotated images. It supports computer vision by offering a high-quality benchmark and data for tasks like object categorization, driving AI research.
Focus: Computer vision

Common Crawl: Provides a massive, open-access archive of web data collected monthly since 2008. With raw web pages, metadata, and text extracts, it's hosted on AWS and academic clouds. Used for AI, NLP, and analytics, it’s free and supports innovative projects worldwide.
Focus: Web Archive, NLP

GitHub Datasets: A large collection of datasets shared by the user community, covering various domains and formats, e.g.,  awesomedata. awesome-public-datasets. GitHub, version 20230903, https://github.com/awesomedata/awesome-public-datasets.
Focus: Multi-disciplinary