by Adrian Clarke, Brian McMahon, Pranav Menon, Kavan Patel
Over the past few months at Fellowship.AI, our team has worked to identify optimal hyperparameters to achieve near State of the Art (SotA) performance on a variety of datasets while utilizing the fastai v1.0 framework. This research was pursued in order to optimize the hyperparameter settings of platform.ai, which seeks to bring deep learning to an extended (including non-technical) audiences.
In our analysis, we trained a standardized model (by utilizing transfer learning with a pre-trained ImageNet ResNet34) on a set of seven datasets. In this training, we aim to determine two key aspects:
- Clear dataset groupings based on similarities in performance amongst the different datasets; and
- A better understanding of the optimal hyperparameter ranges for each dataset grouping.
From our initial experiments, learning rate was the most significant hyperparameter. In general, the fastai learning rate finder suggests the near-optimal learning rate, with one exception noted being that datasets far from ImageNet, which perform better with learning rates less than that given by the lr_finder (typically within one order of magnitude).
Five hyperparameters were searched independently. Each hyperparameter had a baseline value, and we experimented with a range of values for each hyperparameter while others were held constant. We defaulted to the ResNet34 architecture which is known to be an effective choice in terms of both efficiency and reliability. Batch size was set to 64.
Table 1: Evaluated Hyperparameters
Learning Rate Finder
Our automatic learning rate finder is an implementation of the method originally discussed in Leslie Smith's paper. First we record the validation loss across a sweep of continuously increasing learning rates. Then the learning rate that gives the largest negative slope (sharpest decline) in validation loss is chosen as the default rate. Since the measurements of validation loss are noisy, we smooth them using a digital filter. This prevents us from receiving erroneous gradient measurements from high frequency noise. Learning rate finder was run on each dataset using the frozen network (with only the head trained for the given task).
Our test set of learning rates were in the range of one order of magnitude on either side of the automatic learning rate finder result.
In future work, the results of the learning rate finder on the network with all layers unfrozen (all layers trainable) should be considered to possibly expand upon our findings. It is likely that the unfrozen network may give a different learning rate finder result.
The datasets we chose for our analysis can be segregated into five broad categories:
- Medical Images (X-ray and skin cancer)
- Geospatial images (DOTA, Amazon)
- Real world images (Pets)
- Line drawing images (Quickdraw)
- Microscopic images (Protein)
Table 2: Dataset Metadata
For each dataset, we run our standard template script which executes the following procedure on both: (1) 20% of data (partial train), and (2) 100% of data (full train). Our intuition for using 20% of the data was that users of platform.ai might just upload a subset of the data, rather than the entire dataset, therefore we wanted to determine if there would be appreciable difference in the evaluation metric by using different proportions of the data.
We take our baseline set of hyperparameter values as defined in Table 1. We then loop through lists of approximately 3-5 values of each hyperparameter and hold the “baseline” and all other parameters constant. For illustration, the learning rate (lr) list may include three values such as 1e-2, 1e-3, 1e-4, and all other hyperparameters will remain at baseline. In this case, the 1e-3 would have hypothetically been found as the ideal learning rate upon execution of the lr_finder. Our results for phase 1 will be the best performing value of each hyperparameter. For example, if the weight decay (wd) 1e-2 outperforms all other wd values in the desired performance metrics (i.e. accuracy) then it will be considered as the best candidate.
The head of the neural network (i.e the last layer) is trained for 5 epochs. We then unfreeze the entire neural network and train for another 15 epochs.
Measuring distance to ImageNet
In an effort to derive patterns from the results of our findings, we analyzed the visual similarity to ImageNet, the dataset upon which ResNet model is pretrained. We thus implemented the method described in Cui’s “Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning” to extract the similarity to Imagenet. In section 4.1 of this paper, the Earth Mover’s Distance (EMD) is used to weigh the distance between classes of images in different datasets. Classes with more training examples are considered to have more weight.
However, we slightly modified the method by using cosine similarity. Our reasoning was based on our observations of inconsistent results using Euclidean distance. We think that this might have happened because Euclidean distance weighs the overall magnitude of activations (the tendency of a dataset to have high activation vs. low activation on average) significantly in its measurement. Therefore, we decided that switching to cosine distance would improve the consistency of our results, because it measures the angle between vectors without without being significantly biased by overall magnitude. Indeed, our results using cosine similarity showed a clear delineation between datasets that were near to and far from ImageNet.
Because we could not use the entire Imagenet due to time and compute constraints, we measured Cosine Similarity with respect to Tiny-ImageNet, a smaller dataset with smaller images, which we believe to have similar image styles as the real ImageNet.
Our results from testing all of the different hyperparameters can be summarized in this heatmap which compares the performance of each dataset against its default setup. The default hyperparameter settings for each dataset are provided above in the table to aid in providing context for the results displayed in the heatmap. Note that the only hyperparameter default that differs among datasets is the baseline learning rate. lr-C is the default learning rate for that dataset; the 0% improvement across all datasets is because it is being compared to itself. The learning rates go in decreasing order from A to E, with C as default in the middle.
The hyperparameters tested are sorted from best performing on the right-hand side to worst performing on the left. As we can see from this chart, the default/learning rate finder yields good results overall. Possible hyperparameter baseline improvements, which can be found to the right of lr-C, would be to slightly decrease the divisors for fine tuning, and perhaps reduce a bit of the regularization by lowering the dropout of the custom head to 0.25 in the first layer. Also since the weight decay is defaulted to 0.1, lowering it to 0.01 may provide a slight improvement overall.
LR relative to ImageNet distance
Based on the results and findings, we observed 2 categories of datasets:
- Cosine similarity > 0.72 (“Near” ImageNet)
- Cosine similarity < 0.72 (“Far” from ImageNet)
In all of our experiments, we set our baseline experiment with learning rate = learn.lr_find(). For datasets having cosine similarity learning rate = lr_find/10 were proved to be performing better at earlier epochs than lr_find. And for datasets with cosine similarity > 0.72, learning rate = lr_find performed best at almost each epoch. In conclusion, we can say that learning rates less than lr_find (lr_find/10) helped for faster convergence for datasets having cosine similarity < 0.72. Below are the visualizations to support our findings.
In our research, we set out to develop heuristics that achieve near-SotA performance. To date, we have trained a standardized model on seven datasets in search of defining:
- clear dataset groups, if any; and
- distinct hyperparameter ranges within each group.
Learning rate is clearly the most significant hyperparameter in terms of performance. The learning rate finder packaged in fastai v1.0 in most cases accurately identifies a near-optimal learning rate. The main exception would be for datasets with a cosine similarity with Imagenet of less than 0.72 tend to perform even better with a learning rate within one order of magnitude less than that given by the lr_finder.
We would like to thank everyone who provided invaluable guidance and support during this research, including Arshak Navruzyan, Frank Sharp, Jeremy Howard and Antoine Saliou.