New Food-101 SoTA with fastai and’s fast augmentation search

by Amit Redkar, Krisztian Kovacs, Jim Bremner, Arshak Navruzyan


Recently, better image classification models have tended to follow a trajectory towards deeper or wider networks [2,3] or extensive test time augmentations [3]. We will share some of the techniques of fastai v1 which allowed us to advance the State of the Art (SoTA) results for the Food-101 dataset, using transfer learning with a simple ResNet-50 architecture with minimal augmentations.

Dataset & Augmentations

Food-101 is a challenging dataset consisting of 101,000 images of 101 different food classes. Taking a look at some of the images, we can see why models may struggle to get good results. For example, all of the images in Figure 1 have been labelled as "bread pudding", yet even as a human, I think I’d struggle to classify them as such.

Figure 1: A sample of images from the Food-101 dataset, all labelled as “bread pudding”.

The creators of the dataset left the training images deliberately uncleaned, so there are a number of mislabelled images, and as we can see, a large range in brightness / colour saturation. More fundamentally, however, it’s clear that no two bread puddings are quite alike (or at all alike it seems). Classifying images with such high intraclass variation is hard.

If our model is to perform well “in the wild”, it needs to be able to handle these issues: real, non-professional photos of food aren’t going to be of uniform size and are unlikely to have perfect lighting or framing.

The fastai library offers a neat solution to this problem: Test Time Augmentations (TTA). This technique uses the data augmentations at test time. When predicting the test set labels, we also predict an additional 8 random augmentations for each image. We then combine our 8 augmented predictions with the original prediction to get the final result.

These 8 different augmentations try to cover all of the image’s 4 corners (also flipping it once for each, taking the total up to 8), it also applies zoom and other transformations randomly. Using this ensemble, the final prediction is taken as the most predicted class from 9 separate predictions. The final prediction is an average the original prediction (weighted by β), and an average of other 8 predictions (weighted by 1 - β).

The transformations available in the fastai library were used for this experiment:

  • Primary Transformations (probability 0.75): flipping (horizontal for this experiment), warping, rotating, zooming, brightness/contrast alterations.
  • Secondary Transformations (probability 0.5): jitter, skewing and squishing.

Figure 2: fastai’s image transformations.

Figure 3: combined effect of all transformations in Figure 2.

As we can see, the various transformations do not substantially alter the original image, however, their regularisation effects are powerful.

Changing hyperparameters for each of the transformations is not feasible since doing so is tedious, and our transformations may not perform well in certain combinations. Our solution to the above problem was to use optimal transforms augmentations.

A group of fellows from our Fellowship.AI cohort have been working on this for the past few months and have researched and devised methods that can be used on any custom dataset to find the optimal hyperparameters for each transformation. These result in data augmentations which make the network less prone to overfitting, and produce a nice loss curve which helps improve metrics.

Comparison with the SoTA

The previous SoTA (90.27% top-1) uses two parallel branches: one with a resnet34-like architecture and another with a single convolutional layer. The latter uses special wide “slice” convolutions: 224x5 kernels aiming to capture the layered structures in some of the food types (e.g. chocolate cake, pancakes, burgers etc.).

Whilst this slice branch clearly leads to better performance, we felt it was more of a tailored approach - playing into the dataset’s specific food classes (specifically, the large number of layered foods). It’s also unclear whether these slices would have generalised to top-down images where the layers aren’t visible, so in the spirit of building a better general food classification model, we opted for a standard resnet50 architecture.

Table 1: A comparison of the previous SoTA classification results for the Food-101 dataset.




Additional Notes

Top-1 Accuracy %

Top-5 Accuracy %

Inception V3

Flip, Rotation, Color, Zoom

10 crops for validation


Manually doing transformations and crops during validation




Flip, Rotation, Color, Zoom

10 crops for validation

Around 32

Ensemble of Residual and Slice Network



ResNet50 + fastai [Ours]

Optimal transformations

Test time augmentations


Using a size of 512 only for later epochs



Comparing to the previous methods in Table 1, our model offers a faster approach to SoTA performance. We started by training the network with an image size of 224, and once the loss started to converge, training the network again with an image size of 512 to reach SoTA accuracy.

There was no need for a complex network - we used a vanilla ResNet50 pre-trained on ImageNet, with a custom head (which fastai adds-on automatically, configured according to our particular dataset).

Training using larger images is also easily handled since fastai adds Adaptive Average Pooling and Adaptive Max Pooling Layers at the input of a Fully Connected Layer. These will produce a fixed-size output regardless of their input dimensionality, this lets us use an arbitrary-sized square image as input to the ResNet architecture.

Projections on

Having reached SoTA performance, we trained a similar model to roughly equivalent accuracy and used it to produce projections on with the Food Recipes demo collection - never seen before by the model. Our idea was that if this model has learnt more about food image classification, then perhaps it would also now perform better at the task of clustering together conceptually similar food images from other datasets.

Figure 4: selected projections using Food-101 (top) and ImageNet (bottom) models. Magnified windows A and B are included for the purpose of comparison.

Figure 4 shows selected projections using the Food-101 and ImageNet models on We’ve used magnified windows A and B in order to compare the quality of clustering in each. For the cookies in A and the soups in B, the Food-101 model seems to have clustered the images closer together and with a better separation from other classes than in the ImageNet model, where they fall amongst a number of different classes, and a little further apart.

Of course, this is just one hand-picked example. Whilst we can visually locate improvements in this particular projection, to get a better sense of the overall performance of the model across all of the projections, it's better (and fairer) to carry out a more quantitative analysis. For this, we applied a simple K-Means clustering algorithm to the projections and then calculated the silhouette score for each. This is a simple measure of how well-defined/separated clusters are in a given clustering, it has been shown to work well across a variety of different datasets.

We tried a range of different K parameters for the clustering, but all showed the Food-101 models’ projections to produce a higher mean silhouette score. In Figure 5, with K=5, the mean silhouette scores were 0.489 ± 0.006 and 0.527 ± 0.007 ( s ± s N for score s and number of different projections N=75 ) for the ImageNet and Food-101 pre-trained models respectively ( the higher the score, the better the clustering ).

These results suggest that our Food-101 model will, in general, provide more useful projections when classifying a new, unseen dataset of food images on

Figure 5: Silhouette score histograms from K-Means clustering (K=5) for projections produced by ImageNet and Food-101 pre-trained classification models. Higher scores represent better defined/separated clusters.


Working with fastai, we have easily trained a SoTA image classifier using transfer learning. The final result was achieved through a number of useful techniques, but optimal image transformations and test time augmentations played a vital role. The implementation of a learning-rate scheduler also helped to get the desired convergence in the final epochs.

We’ve also shown that with our new and improved model, we can now use to produce more meaningful projections from previously unseen food image datasets.


We would like to especially thank Arshak Navruzyan, for working as an architect for this project and helping us implement the experiment. We would also like to thank the Berlin team for their efforts on optimal transformations of training images and finally to Jeremy Howard and his team for the great fastai library and mentorship during this project.


  1. Hassannejad, Hamid, et al. Food image recognition using very deep convolutional networks . Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management . ACM, 2016.
  2. Martinel, Niki, Gian Luca Foresti, and Christian Micheloni. Wide-slice residual networks for food recognition . Applications of Computer Vision (WACV), 2018 IEEE Winter Conference on . IEEE, 2018.
  3. NVIDIA DEEP LEARNING CONTEST 2016 , Keun-dong Lee, DaUn Jeong, Seungjae Lee, Hyung Kwan Son (ETRI VisualBrowsing Team), Oct.7, 2016.