Structuring Machine learning Projects

Updated: Nov 10, 2021

Few ideas may arise after a few experiments on the deep learning projects for improving the performance of the model. Do I

Collect more data

Collect more diverse training set

Train algorithm longer with gradient descent

Try Adam instead of gradient descent

Try bigger network

Try smaller network

Try dropout

Add L2 regularisation

Modify activation functions

Change number of hidden units

In this blog, we will try to make an effective environment that can help to quickly figure out the problems present in the ongoing deep learning projects. Orthogonalization is one such process.


It can be easily understood with the help of the old TV model. If you used one of these then you might be quite familiar with the few knobs provided at the side of the model to tune the picture quality. With the help of these knobs, the picture can be moved in four directions. The same strategy is followed for deep learning projects.

The chain of assumption is as follows:

  • Fit training set well on the cost function

  • Fit validation set well on the cost function

  • Fit test set well on the cost function

  • Problem well in real world

How to move forward when the chain of the assumption is not working?

  • If the model is not fitting well on the training set, a few of the parameters can be tuned such as training a bigger network, optimizers.

  • If the model is not fitting well on the validation set, try regularisation or increasing the training set.

  • If the model is not fitting well on the test set, try increasing the validation set as the model is over-tuned on the validation set.

  • If the model is not working well in the real world, try changing the validation dataset or modifying the loss function.

Try to avoid using early stopping at the beginning of the hyper-parameter tuning experiments as it affects the generalization of the training as well as a validation set.

A Single number evaluation metric approach can be used for evaluating the model performance. Precision and recall are two important evaluation metrics used to check the performance of the deep learning model. Precision metric explains how many predictions are actually correct out of the total correct prediction. Recall explains how many predictions are correct out of actual correct values. There is a trade-off between precision and recall metrics. The F1 score helps to find the best performing deep learning model using the 2/[1/Precision + 1/Recall] formula. It is an average of precision and recalls metric or harmonic mean.

How to perform Train/ Val/ Test data distribution? Choose a validation and test set to reflect data one expects to get in the future and consider it important to do well on. It should be collected from the same distribution. The final dataset needs to be the target for the model fitting. The selection of the data distribution can help to achieve the target in less time. What should be the size of the validation and test set? In the old era of deep learning, there was some thumb rule for dividing the dataset in 70:30 or 60:20:20 ratio for the small dataset. Now, these thumb rules don't work. For the large dataset, it must be split in a 98:1:1 ratio. The test set need to be set big enough to give high confidence in the overall performance of the system. Provide larger negative weight values to the features not needed at the time of prediction. It will help to create a good algorithm using the loss function. The loss can be modified to integrate the negative weight values. The model performance can also degrade in the real work because the real world does not have good quality input data proved. It is better to change your validation and test dataset according to the real-world input dataset.

The deep learning model can be trained to pass human-level performance. If it is not, it can be assisted by humans to improve the model performance such as

  • Get labeled data.

  • Get insight from manual error analysis: Why did a person get this right?

  • better analysis of bias/variance.

Avoidable bias is a final small training error and human (Bayes) error after which it is difficult to make improvements. In case when the training error and human (Bayes) error is very high, one needs to focus on the bias for improving the deep learning model performance. Make improvements on the training set. In some other cases when the difference between the training error and validation error is very high, one needs to focus on the variance. Make improvements on the validation set. This is only valid when all the training and Val/test set are from the same distribution. What if they come from a different distribution such as image data collected from web and phone cameras? In this case, one small set is made using the training data called the training-val set. The deep learning model is tested on this set and the above comparison are made. If there is a large difference between training and training-val set, it's called variance error. If the difference between the training and training-val set is not large, and the training-val and validation error is large, it's called data mismatch error. If the training error, training-val error, and validation error are the same, and the human error is quite small, it is a problem of Avoidable bias. If the training error, the training-val error is the same. The validation error is quite high and the human error is quite small. It is a problem of Avoidable bias as well as a data-mismatch problem. The data-mismatch problems can be solved by collecting more data similar to the val-test set data. Another solution can be using artificial data synthesis in which the real data is added with the noise data such as car sound to create new data. One more metric added to check the performance is test error. The large difference between the validation and test error explains the degree of over-fitting on the validation set. There can be cases where the validation, as well as test performance, is better than the training performance.

Let's perform an Error Analysis. To further improve the deep learning model performance, deep error analysis can be done to understand the error. Try to take a look at validation set examples to evaluate ideas. Collect 100 mislabelled validation set examples. There can be an image that looks similar to the other class of the image. The spreadsheet can be maintained to keep the record of the classified images with the reasons. Deep learning algorithms are quite robust to random errors compared to systematic errors in the training set. Apply the same process to validation and test sets to make sure they continue to come from the same distribution. Also, consider examining examples, the deep learning algorithm got right as well as wrong. What about data collected from different distributions? Now, the train and val/test set data come from slightly different distributions. Try to train the model on the data collected from a different distribution such as from a professional camera, internet, or phone camera. The validation and the test set need to be data coming from the target application where the model will be deployed. The training set can contain only data collected from the web and the val/test set can have data collected from the phone camera as the target user is on the phone.

A lot of different approaches are present to overcome the huge amount of data required for the training to reach a state-of-the-art solution. Transfer learning is one of the very successful approaches for achieving good performance on the deep learning training in less time with fewer data. It learns features from one set of data and transfers it to the deep learning model to learn features of the different sets of data. Multi-task learning is learning more than one task at a time. For example in a self-driving car, the deep learning model is built to learn a few tasks such as recognizing pedestrians, cars, signboards, and traffic lights. The loss can be calculated based on pedestrian, cars, signboards, and traffic light recognition. Multi-task learning is useful when low-level features in the sets are useful for the training on the tasks and when the amount of data present for each task is quite similar. It can also benefit when the big neural network can be trained to do well on all the tasks.

Here is one example of an end-to-end deep learning approach for food recipe suggestions.

Thanks for going through this content. The next section will be more in-depth covering different hyper-parameter tuning.

Running Dog