Hands-on Tutorials

Why we need to build a Live CNN Training Dashboard?
Introduction
Prerequisites
System description
How to create an environment and start training?
Conclusion
References

Why we need to build a Live CNN Training Dashboard?

When I studied in mathematical lyceum, my teacher taught me that the best way to understand something is to visualize it. For example, we had a wooden board, plasticine, and metal wire to be able to visualize stereometry problems. It helped a lot to develop visual thinking and skills in solving challenging tasks.

I truly believe that real data scientists should understand algorithms and have a feeling on how to improve it if something works not fine. Especially in the area of deep learning. In my mind, the best way to develop these skills is to see how the model is trained, what happens when you change hyperparameters. This is the reason why I want to share how to build a simple dashboard for CNN live training with the opportunity to tune a few hyperparameters online.

There is common knowledge that if we choose too big learning rate, we will see how our loss function explodes (our model will not converge); if we choose too small learning rate, the training process can last too long. What about dropout? There is an opinion that dropout reduces overfitting. I get to check everything myself even if I believe, because to know and to believe are different things.

Below is the short demo of my dashboard. Red dots on loss function & accuracy plots represent the training dataset, blue dots represent the test dataset.

Introduction

Dashboard displays the following statistics:

loss function value in time;
accuracy in time;
distribution of activation maps values for the last step;
history of hyperparameters changes (table);

For this task, I am using AlexNet architecture to classify images on 10 classes: Alaskan malamute, baboon, echidna, giant panda, hippo, king penguin, llama, otter, red panda, and wombat. Images are downloaded from the ImageNet. I will not go into details in this post, but you can explore file get_dataset.py. During training, the following parameters can be tweaked:

optimizer;
This parameter determines the algorithm we use to optimize our model. I use only Adam and SGD with Nesterov momentum. If you want to understand the optimization technique more, I encourage you to watch a video from Stanford here. There are many fantastic details about optimization.
learning rate;
This parameter determines how fast we are moving down the slope when we are updating weights. For basic gradient descent formula for weights updates look like this: w := w — lr * dw.
weight decay;
For our case it is simply L2 regularization: R(W) = SUM(W * W). It is considered that weight decay does not make a lot of sense in the context of CNN, but you can see it yourself how it works live. You can read some description of L1 and L2 regularization techniques here.
dropout;
Common regularization strategy for neural network. The idea is randomly set some neurons to zero on each training step. The hyperparameter is the probability to drop each neuron. Common value is 0.5 (50%). We can choose any integer value from 20 to 80. (in %) More details can be watched in the same video that I shared for optimizer.

Script can be easily changed to add additional functionality.

Prerequisites

I assume that you understand what is CNN and have basic knowledge of the following:

PostgreSQL (to store real-time data);
Dash (to build dashboard, https://plotly.com/dash/);
PyTorch (to build CNN models);

System description

There are four main parts of the system: dataset, model, database, and dashboard/UI. These parts interact with each other to successfully run the system. Firstly I will describe each of these parts and after that, I will give a short description of how they interact with each other.

Dataset

For this exercise, I use a dataset from the ImageNet that contains the following ten classes: Alaskan malamute, baboon, echidna, giant panda, hippo, king penguin, llama, otter, red panda, and wombat. To download all images from ImageNet, I can run python board.py from the following location: ../cnn_live_training.

Firstly, I have to find classes ids and save them to some variable:

The ImageNet stores URLs to images. Some URLs/images might not exist anymore. To get these URLs based on class id, I use the following function:

To download all images I use a loop where I download image by image. Below is the function to download image by URL:

The full version of the code can be seen in the file get_dataset.py. You can easily change these classes to other classes or you can even change the ImageNet to your custom dataset.

Model

For the training, I am using by default the AlexNet architecture with Adam or SGD with Nesterov momentum optimizer. Optionally, the VGG16 can be chosen. Models can be imported either from the file models.py or from torchvision.models. The second option has the opportunity to use pre-trained weights. Dataset preparation happens in the file data_preparation.py. The training process happens in the file train.py.

I don’t have the goal to explain in this article how to build a pipeline for training CNN that is why I am not going into detail in this part. But I am happy to recommend the amazing course CS231n from Stanford and particularly HW2(Q4), where you can learn step by step how to build this pipeline. This homework can be found here.

Database

Before running the system, we have to create dl_playground DB in PostgreSQL with the schema cnn_live_training that contains three following tables: parameters, statistics, activations.

parameters
This table contains only one row with current parameters for the training CNN model. When we change any parameters in our dashboard (file board.py), this data will be updated in the parameters SQL table. The table contains the following columns:

optimizer;
Text data type. Can have two values: ‘Adam’ and ‘SGD+Nesterov’.
learning_rate;
Double data type. The values are between 0 and 1 with the 0.00005 step.
weight_decay;
Double data type. The values are between 0 and 1 with the 0.05 step.
dropout;
Integer data type. The values are between 20 and 80. (It is assumed that the values are in %.)
dt_updates;
Timestamp data type. Indicates date and time when data was modified.
stop_train;
Boolean data type. Indicates if we have to stop training.

statistics
This table contains statistics of the training process. Data is updated every --n-print step. The table contains the following columns:

dt_started;
Timestamp data type. Indicates when current training was started.
model_name;
Text data type. In this case, it can be only ‘MyAlexNet’.
epoch;
Integer data type. Indicates the number of training epochs.
step;
Integer data type. Indicates the number of training steps.
optimizer;
Text data type. Can have two values: ‘Adam’ and ‘SGD+Nesterov’.
learning_rate;
Double data type. The values are between 0 and 1 with the 0.00005 step.
weight_decay;
Double data type. The values are between 0 and 1 with the 0.05 step.
dropout;
Integer data type. The values are between 20 and 80.
dt;
Timestamp data type. Indicates date and time when data was modified.
train_loss;
Double data type. The value of loss function for the training dataset on the last step.
train_accuracy;
Double data type. The value of accuracy for the training dataset on the last step.
validate_loss;
Double data type. The value of loss function for the validation dataset on the last step.
validate_accuracy;
Double data type. The value of accuracy for the validation dataset on the last step.

activations
This table contains the current distribution of weights in activation maps for all convolutional and fully connected layers. The table contains the following columns:

nn_part;
Text data type. Can be either ‘features’ or ‘classifier’.
layer_type;
Text data type. Can be either ‘conv’ or ‘fc’.
number;
Integer data type. Indicates the layer number in a ‘nn’ part.
weights;
Double[] data type. Indicates average values of weights in bins.
num_weights;
Integer[] data type. Indicates numbers of values in bins.

Dashboard/UI

The dashboard consists of three main blocks: control panel, loss function & accuracy, and activation maps (distribution). These blocks are built using dash containers.

Control panel contains filters of parameters and “submit parameters” button that can be used to send chosen parameters to described above table “parameters”.
There are four filters: optimizer, learning rate, weight decay, and dropout.

Below is the script, how to create an optimizer filter (other filters are similar):

After that I create a container that contains all four filters:

How to create other parts of the control panel can be found in the file board.py.

Loss function & Accuracy contains a table with the history of used parameters and two plots with train/test loss function and accuracy values in time. Data is updated every one second (time interval can be changed) automatically.

Below is the script on how to create a table and button to stop training in the dashboard (I replaced real styles with short names for reading convenience):

Script to create plot template can be seen below:

Values are uploaded dynamically from PostgreSQL using callbacks (I provide only template for reading convenience):

I need to use a callback here because I want to update the plot and the table every 1 second. So, I have to use this variable as an input.

Activation maps (distribution) contains plots with distribution of activation map for each layer for the last step. Data is updated every one second (time interval can be changed) automatically.

The activations of the first two layers look similar to a normal distribution with the mean value in 0. The reason for this is for the first two layers we apply normalization. To understand more, I encourage you to watch a lecture from Stanford here.

Below is the script to create a container with the plots. It is similar to the previous container with loss function and accuracy plots:

The callback for the activation maps is similar to the “loss function & accuracy”:

How everything works

It’s time to wrap everything up. To recall back, my goal is to train CNN live and being able to control this process by changing hyperparameters. So how does it happen? I have a dashboard where we can see the progress of the CNN training and where we have some filters that we can choose and activate by pushing the button “Submit parameters”.

What happens after that? All these parameters are sent to the table parameters in my database in PostgreSQL, using callback in the file board.py and function update_params:

At the same time, the script train.py connects to a database at the end of each training step, seeking to update the optimizer if parameters get updated:

Every n_step step data from training is saved to statistics and activations tables in database in PostgreSQL:

And this data simultaneously displayed in the dashboard because the script board.py every 1 sec. connects to the same tables:

All parameters are displayed in the table by extracting this information from the table :

If we want to stop training beforehand, we can push the button “Stop Training” below the table. After pushing the button, the callback will change the variable stop_train from False to True in the parameters table in my database:

At the same time, the script train.py check this parameter every training step and if it is True, training will be interrupted.

Without practical recommendations on what parameters to use to start training, this post will not be complete. If you want to see that everything works, but don’t have time for experiments, you can start from the following parameters:

optimizer: Adam;
learning rate: 0.0003;
weight decay: 0;
dropout: 50%;

If you want to see how the model explodes, just increase the learning rate to 0.01. Good luck with your experiments.

How to create an environment and start training?

Virtual environment setting up

I will give a short description for Ubuntu, using a virtual environment (venv).

Install Python 3.8: sudo apt install python3.8-minimal
Install virtual environment with Python 3.8: sudo apt-get install python3.8-venv
Create virtual environment: run from cnn_live_training folder: python3.8 -m venv venv
Activate environment: source venv/bin/activate
Install required packages in the virtual environment:
pip install -r requirements.txt

Collect dataset

Run from the ../cnn_live_training command python get_dataset.py

Start training

Run from the ../cnn_live_training folder two following commands

python board.py
python train.py

Conclusion

In this story, I wanted to share my idea on how to nurture the feeling of training CNN. From one side, the idea is simple: build a training pipeline, create a dashboard and connect them using a database. But there are many annoying details that not possible to put in one small story. All script and additional details can be found in my git repository.

If this post makes someone interested and give additional knowledge, I will become slightly happier because it means that I reached my goal. I will appreciate any comments, constructive criticism, or questions, feel free to leave your feedback below or you can reach me via LinkedIn.

References

[1] L. Fei-Fei, R. Krishna and D. Xu, CS231n: Convolutional Neural Networks for Visual Recognition (2020), Stanford University

[2] A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks (2012), NeurIPS 2012

[3] A. Nagpal, L1 and L2 Regularization Methods (2017), Towards Data Science

<hr><p>Live CNN Training Dashboard: Hyperparameters Tuning was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>