In the previous post, we discussed how to train a segmentation model in Tensorflow. This post will cover how to balance datasets in training a segmentation model in Tensorflow. We can use the same technique to deal with the imbalanced data in a Classification problem. Let us recall our segmentation problem.

1. Problem Description and Dataset

We want to cover a nail semantic segmentation problem. For each image, we want to detect the segmentation of the nail in the image.

Images Masks

Our data is organized as

├── Images
│   ├── 1
│       ├── first_image.png
│       ├── second_image.png
│       ├── third_image.png
│   ├── 2
│   ├── 3
│   ├── 4
├── Masks
│   ├── 1
│       ├── first_image.png
│       ├── second_image.png
│       ├── third_image.png
│   ├── 2
│   ├── 3
│   ├── 4

We have two folders: Images and Masks, each folder has four sub-folders 1, 2, 3, 4 corresponding to four distribution patterns of nails. Images is the data folder and Masks is the label folder, which is the segmentations of input images.

We download data from link and put it in data_root, for example

data_root = "./nail-segmentation-dataset"

2. Data Preparation

Similar to the training pipeline of the previous post, we want to have a CSV file that stores the paths of images and masks

index images
1 path_first_image.png
2 path_second_image.png
3 path_third_image.png
4 path_fourth_image.png

For that we use make_csv_file function in data_processing.py file. What do we need more to balance the data??

We remark that our image data have four subfolders, and the distributions of the coverage segmentation are very different in each folder. Those image quantity are different (skew data).

Folder number of image
0 749
1 144
2 126
3 52
4 34

We want to split the info data frame into some smaller data frame. To do that we use:

def split_data_train(data_root) -> None:
    r"""
    This function is to split the train into some subsets. The purpose of this step is to make the balanced dataset.
    """
    data_root = args.data_root
    path_csv = f"{data_root}/csv_file/train.csv"
    train = pd.read_csv(path_csv)
    train["type"] = train["images"].apply(lambda x: x.split("/")[1])
    for i in train["type"].unique().tolist():
        df = train.loc[train["type"] == i]
        df.to_csv(f"{data_root}/csv_file/train{i}.csv", index=False)

We have five new data frames train_0, train_1, train_2, train_3, train_4. We will use those files in the next step.

We will inherit all things from the previous post (DataLoader, Model, mixed precision, logger). We need to change how we load datasets and how to balance the data when we load data.

3. How to define dataloader

We remark that all functions we will use, have been defined in the previous post.

For more details, we can find the source code at github.

We first define all data frames and load directories of images and masks

train0_csv_dir = f"{data_root}/csv_file/train0.csv"
train1_csv_dir = f"{data_root}/csv_file/train1.csv"
train2_csv_dir = f"{data_root}/csv_file/train2.csv"
train3_csv_dir = f"{data_root}/csv_file/train3.csv"
train4_csv_dir = f"{data_root}/csv_file/train4.csv"

train0_dataset = load_data_path(data_root, train0_csv_dir, "train")
train1_dataset = load_data_path(data_root, train1_csv_dir, "train")
train2_dataset = load_data_path(data_root, train2_csv_dir, "train")
train3_dataset = load_data_path(data_root, train3_csv_dir, "train")
train4_dataset = load_data_path(data_root, train4_csv_dir, "train")

Using tf_dataset we load five datasets and remark that we will not batch in this step, we will concatenate those datasets with weights and batch them when we have the whole dataset.

The cool thing about this method is that we can use different augmentation for different sub-dataset. For example we can apply the train_transform for the first dataset and valid_transform for the second datset.

train0_loader = tf_dataset(
    dataset=train0_dataset,
    shuffle=False,
    batch_size=None,
    transforms=train_transform(),
    dtype=dtype,
    device=args.device,
)
train1_loader = tf_dataset(
    dataset=train1_dataset,
    shuffle=False,
    batch_size=None,
    transforms=train_transform(),
    dtype=dtype,
    device=args.device,
)
train2_loader = tf_dataset(
    dataset=train2_dataset,
    shuffle=False,
    batch_size=None,
    transforms=train_transform(),
    dtype=dtype,
    device=args.device,
)
train3_loader = tf_dataset(
    dataset=train3_dataset,
    shuffle=False,
    batch_size=None,
    transforms=train_transform(),
    dtype=dtype,
    device=args.device,
)
train4_loader = tf_dataset(
    dataset=train4_dataset,
    shuffle=False,
    batch_size=None,
    transforms=train_transform(),
    dtype=dtype,
    device=args.device,
)

Shuffle and repeat each dataset

data_loaders = [
    train0_loader.apply(tf.data.experimental.shuffle_and_repeat(100000, count=epochs)),
    train1_loader.apply(tf.data.experimental.shuffle_and_repeat(100000, count=epochs)),
    train2_loader.apply(tf.data.experimental.shuffle_and_repeat(100000, count=epochs)),
    train3_loader.apply(tf.data.experimental.shuffle_and_repeat(100000, count=epochs)),
    train4_loader.apply(tf.data.experimental.shuffle_and_repeat(100000, count=epochs)),
]

Calculate the weighted sample; here we want each batch; each dataset will be loaded with the same sample.

weights = [1 / len(data_loaders)] * len(data_loaders)

Using tf.data.experimental.sample_from_datasets to balance data.

The input tf.data.experimental.sample_from_datasets function is:

  • datasets: A non-empty list of tf.data.Dataset objects with a compatible structure.
  • weights: (Optional.) A list or Tensor of len(datasets) floating-point values where weights[i] represents the probability to sample from datasets[i], or a tf.data.Dataset object where each element is such a list. Defaults a uniform distribution across datasets.

Returns of tf.data.experimental.sample_from_datasets

  • A dataset that interleaves elements from datasets at random, according to weights if provided, otherwise with uniform probability.
train_loader = tf.data.experimental.sample_from_datasets(data_loaders, weights=weights, seed=None)

We then have the train_loader with balancing data. We only need to batch them before feeding data into the model.

train_loader = train_loader.batch(batch_size)

Once we have train_loader, we define valid_loader, model, as same as the previous post. Finally, we fit the model.

    history = model.fit(
        train_loader,
        steps_per_epoch=steps_per_epoch,
        epochs=epochs,
        validation_data=valid_loader,
        callbacks=callbacks,
    )

where

    steps_per_epoch = (
        int(
            (
                len(train0_dataset[0])
                + len(train1_dataset[0])
                + len(train2_dataset[0])
                + len(train3_dataset[0])
                + len(train4_dataset[0])
            )
            / batch_size
        )
        + 1
    )

For more details, we can find the source code at github

In the next post, we will cover how to train a deep learning in Pytorch Lightning