Clipping, that extra bump for loss criteria

January 22, 2018

For the past couple of days I’ve been writing a transfer-learning workflow for the dog breed identification kaggle competition. This was done solely in pytorch and torchvision, heavily referencing the beginner pytorch tutorials, and you can find them at stites/circus. This code, which doesn’t do anything fancier than a fastai notebook, was able to produce results that placed in the 50th-percentile of the competition. This isn’t enough to qualify for anything like a bronze medal on kaggle (there are ~1k participants), but as a tool to familiarize myself with a CV pipeline from scratch, I would say this is a success.

There are four reasons why this workflow was successful and there are a plethora of improvements to be made. First, reasons how this was successful:

• Transfer learning from Resnet152 was used. Resnet34 alone got the model up to the 75th percentile, Resnet152 trained much faster and finished training (ie, the validation-training loss balance fell out-of-whack) at the 50th epoch. The final architecture consisted of Resnet152, followed by a fully-connected layer (fc layer) of 4096, relu, another fc layer of 120, and finally softmax. Sigmoid was attempted as well, however since the results are looking for a single breed, it is not as performant.

• Data Augmentation was used. This is a given and seems essential to any computer vision pipeline.

• A hacked-together version of Stochastic Gradient Descent with Restarts was used. This basically was just me reinitializing the learning rate scheduler at a regular interval.

• Clipping was used on the final result. Since there were 120 classes, the bounds were $$0.01 / 120$$ and $$0.99 / 120$$. This dropped the softmax-based solution from the 60th to the 50th percentile.

That said, there are a number of places where this could be improved:

• Test-time augmentation could be used.

• I unfroze resnet152 so that it would have a better fit to dogs and not other imagenet images. While this is good theoretically, I think using differential learning rates would have allowed for a better speedup.

• On the note of other fastai improvements – I’m less impressed with the learning rate finder, but it could have been a valuable asset to include.

• I should really work with an ensemble if I want to get better results. Doing a k-folds would be an excellent way to do this. Having each GPU work with its own model would also be adequate.

• I could do that thing where I save the fc-layer, train the full model on all of the validation set, then load the old fc-layer so that I am not overfitting on prediction, but am still able to use validation images to learn better features.

• Using a better architecture would be nice. In particular, I was interested in using Squeeze-and-Excite networks (this year’s ImageNet winner).

• Using Cosine Annealing for the learning rate scheduler would have been nice and would move me from using a hack of SGDR, to real SGDR.

• A lot more data augmentation.

• Actually using Multi Class Log Loss instead of Cross Entropy Loss. This was done out of laziness.

• Using log softmax would make friendlier and faster gradients.

I think the most impressive thing was that, as alluded to in the title, clipping the loss function brought down the kaggle loss from ~1.0 to the current loss (0.5843).