Naem Azam
Naem Azam
Web Developer Software Developer Linux Admin Researcher
Naem Azam


Top Open Source Free Computer Vision Datasets

Top Open Source Free Computer Vision Datasets

Computer vision is a field of artificial intelligence that trains computers to interpret and understand the visual world. Using digital images from cameras and videos and deep learning models, machines can accurately identify and classify objects — and then react to what they “see.”

COVID-19 X-Ray Dataset (V7)

It is V7’s original dataset containing 6500 images of AP/PA chest X-Rays with pixel-level polygonal lung segmentations. There are 517 cases of COVID-19 among these.

Each image contains:

  • Two “Lung” segmentation masks
  • A tag for the type of pneumonia (viral, bacterial, fungal, healthy/none)
  • If the patient has COVID-19, additional tags stating age, sex, temperature, location, intubation status, ICU admission, and patient outcome.

Lung annotations are polygons following pixel-level boundaries. You can export them in COCO, VOC, or Darwin JSON formats. Each annotation file contains a URL to the original full resolution image and a reduced size thumbnail.

CIFAR-10 & CIFAR-100

The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.

CIFAR-10 contains 60000 32x32 color images with 10 classes (animals and real-life objects). There are 6000 images per class. This dataset has 50000 training images and 10000 test images. The classes are mutually exclusive, without any overlaps.

CIFAR-100 consists of 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.


ImageNet is one of the most popular image databases with more than 14 million hand-annotated images.

This database is organized according to the WordNet hierarchy (currently only the nouns), in which hundreds and thousands of images depict each node of the hierarchy. Object-level annotations provide a bounding box around the (visible part of the) indicated object.


It is a large video dataset consisting of 650,000 clips covering 700 human action classes.

The videos include human-object interactions like playing instruments and human-human interactions like hugging. Each action class has at least 700 video clips, and each clip is annotated with an action class lasting for about 10 seconds.


It’s a large database of handwritten single digits containing 60,000 training images and 10,000 testing images.

It was released in 1999 and is used for classification tasks.


LSUN (The Large-scale Scene Understanding) contains close to one million labeled images for each of 10 scene categories and 20 object categories.

For training data, each category contains from 120,000 to even 300,000,000 images. The validation data includes 300 images, and the test data has 1000 images for each category.

???? Pro tip: Check out The Train, Validation, and Test Sets: How to Split Your Machine Learning Data to learn more.


It is one of the largest publicly available datasets of human faces with gender, age, and name.

It contains 523,051 images in total, with 460,723 face images from 20,284 celebrities from IMDb and 62,328 from Wikipedia.


The MS COCO (Microsoft Common Objects in Context) dataset is consisting of 328K images. It contains annotations for object detection, keypoints detection, panoptic segmentation, stuff image segmentation, captioning, and Dense human pose estimation.

Labeled Faces in the Wild

It is a large-scale database of 13.000 face photographs designed for facial recognition tasks. Each face has been labeled with the person’s name.


Cityscapes is a database containing a diverse set of stereo video sequences recorded in street scenes from 50 different cities. The images were captured over time in various light conditions and weather.

Cityscapes dataset includes semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories. It provides pixel-level annotations of 5000 frames and 20,000 coarsely annotated frames.


This dataset contains 50,000 JPEG images (40,000 for training and 10,000 for testing) with 12 classes. The images are extracted from LabelMe.

Classes include objects such as a car, a person, a tree, or a keyboard. 50% of the images in the training and testing set show a centered object, while the remaining 50% show a randomly selected region of a randomly selected image (“clutter”).

This dataset can be used for object recognition.


Places dataset consists of 2.5 million images (with a category label) and 205 scene categories. There are more than 5,000 images per category. It’s trained using CNNs and can be used for scene recognition tasks.

Places2 (365-Standard)

Another dataset contributed by MIT. There are 1.8 million images from 365 scene categories. The dataset contains 50 images per category in the validation set and 900 in the testing set. Places2 Database can be used for scene recognition and generic deep scene features for visual recognition.


It is a large dataset and knowledge base with 108,077 images with annotated objects, attributes, and their relationships.

Stanford Dogs

This dataset has been built using images and annotations (class labels, bounding boxes) from ImageNet. It is a large-scale dataset containing images of 120 breeds of dogs from around the world. There are 20.580 images and 120 categories.

Stanford Cars

This dataset contains 16,185 images and 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50–50 split.

You have to download the images and their class labels and bounding boxes separately.

Cat Dataset

The CAT dataset includes over 9,000 cat images with annotated facial features. There are annotations of the cat’s head with nine points for each image: two for eyes, one for the mouth, and six for the ears.


CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200.000 celebrity images, each with 40 attribute annotations. The annotations include 10,177 unique identities and five landmark locations per image.

The dataset can be used as training and test sets for face detection, face attribute recognition, localization, and landmark (or facial part) localization.

Face Mask Detection

This dataset contains 853 images belonging to the 3 classes and their bounding boxes in the PASCAL VOC format. The classes include “with mask”, “without mask” and “Mask worn incorrectly”.

Fire and Smoke Dataset

It is a dataset with more than 7000 unique images in HD resolution.

It consists of early fire and smoke images captured using mobile phones in real-world scenarios. The images were captured under a wide variety of lighting conditions and weather. This dataset can be used for fire and smoke recognition, detection, plus anomaly detection.

It also contains various domestic scenes, including garbage and field crop burning, as well as domestic cooking, etc.

FloodNet Dataset

This dataset consists of high-resolution UAS imageries with detailed semantic annotation regarding the damages caused by hurricanes.

The data is collected with a small UAS platform, DJI Mavic Pro quadcopters, after Hurricane Harvey. The whole dataset has 2343 images, divided into training (~60%), validation (~20%), and test (~20%) sets.

???? Let’s be friends! Follow me on Twitter and FaceBook and connect with me on LinkedIn. You can visit My website Too . Don’t forget to follow me here on Medium as well for more technophile content.