Loading Data | Notion

Images Dataset

In order to load a dataset of images, the files should be separated in folders, according to their labels. For instance, images corresponding to dogs should be in a folder named "dog" while images of cats should be in a folder named "cat".
Once the images are correctly organized (as described above), the datasets.ImageFolder() method from the torchvision package can be used to load the images dataset. To learn more about this method, see the documentation: http://pytorch.org/docs/master/torchvision/datasets.html#imagefolder
Having the dataset ready, a data loader can be created, so that it becomes possible to fetch images and matching labels from batches of a given size. Ideally, this data loader shuffles the data every time a batch is returned, to avoid artefacts in the classification. This new data loader object can be used for instance in a for loop or in an iterator. You can learn more about data loaders in this documentation: http://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader
Transformations can be applied to the data when loading it. For instance, since all the images should have the same size, transformations like transforms.Resize() and transforms.CenterCrop() can be applied. It's also needed to convert the images to tensors through the transforms.ToTensor() operation. All desired transformations can then be pipelined using transforms.Compose(). To see all the available transformations, visit the documentation: http://pytorch.org/docs/master/torchvision/transforms.html

# Setting the transformations to apply to the data when loading it. In this example, the images are being resized so that the smallest edge becomes 255, then cropped to the center to become 224 x 224 and finally all images are converted to tensors
transforms = transforms.Compose([transforms.Resize(255),
                                 transforms.CenterCrop(224),
                                 transforms.ToTensor()])

# Loading the data
dataset = datasets.ImageFolder('path/to/data', transform=transforms)

# Defining a data loader to be able to iterate through batches
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

# Looping through the data, getting a batch on each loop 
for images, labels in dataloader:
    pass

# Get one batch
images, labels = next(iter(dataloader))

Data Augmentation

A common strategy for training neural networks is to introduce randomness in the input data itself. For example, you can randomly rotate, mirror, scale, and/or crop your images during training. This will help your network generalize as it's seeing the same images but in different locations, with different sizes, in different orientations, etc.
It's also typical to normalize the data, i.e. subtract the mean and divide by the standard deviation, by using the transforms.Normalize() transformation. Subtracting mean centers the data around zero and dividing by std squishes the values to be between -1 and 1. Normalizing helps keep the network work weights near zero which in turn makes backpropagation more stable. Without normalization, networks will tend to fail to learn.
This random operations don't directly change the dataset size (e.g. a dataset of 32 images still results in 32 images after the transformations). However, as the training passes through different epochs, the images can be different as, in each epoch, each image can or not be changed by each given random transformation. As such, after all the training epochs, one can say that total number of unique images that the model saw is bigger than the original dataset size.

train_transforms = transforms.Compose([transforms.RandomRotation(30),
                                       transforms.RandomResizedCrop(224),
                                       transforms.RandomHorizontalFlip(),
                                       transforms.ToTensor(),
                                       transforms.Normalize([0.5, 0.5, 0.5], 
                                                            [0.5, 0.5, 0.5])])

Validation data

A validation dataset is useful to truthfully measure the model's performance on unseen data (which wasn't used in the training) and, as such, choose a better version of the model. Due to this, overfitting can be avoided or, at least, reduced.
In the example bellow, the CIFAR10 dataset is loaded, setting the data loaders for the train, validation and test subsets.

from torchvision import datasets
import torchvision.transforms as transforms
from torch.utils.data.sampler import SubsetRandomSampler

# number of subprocesses to use for data loading
num_workers = 0
# how many samples per batch to load
batch_size = 20
# percentage of training set to use as validation
valid_size = 0.2

# convert data to a normalized torch.FloatTensor
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

# choose the training and test datasets
train_data = datasets.CIFAR10('data', train=True, download=True, transform=transform)
test_data = datasets.CIFAR10('data', train=False, download=True, transform=transform)

# obtain training indices that will be used for validation
num_train = len(train_data)
indices = list(range(num_train))
np.random.shuffle(indices)
split = int(np.floor(valid_size * num_train))
train_idx, valid_idx = indices[split:], indices[:split]

# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

# prepare data loaders (combine dataset and sampler)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,
    sampler=train_sampler, num_workers=num_workers)
valid_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, 
    sampler=valid_sampler, num_workers=num_workers)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, 
    num_workers=num_workers)