Surpassing Human Judgement for Fashion Style

Baptiste Metge, David Pollack and Krisztian Kovacs

October 6, 2020

by Baptiste Metge, David Pollack and Krisztian Kovacs

Overview

We analyse how machines distinguish fashion styles. In so doing, we surpass the previous state of the art performance in classifying fashion styles and demonstrate that computer vision algorithms can significantly outperform human users on sensing style. Our solution is only surpassed by a selected group of "savvy" users.

Our specific goal is to detect a fashion style from 14 different styles from fashion photos. We are using a custom ResNet34 based network to do so. We will share some of the techniques and customization of fastai v1, which allowed us to tackle this problem.

Methodology

In the paper, What Makes a Style: Experimental Analysis of Fashion Prediction, Simo-Serra attempts to classify 14 different fashion styles in 13,234 images using a new dataset created specifically for this purpose. One can see this as a more comprehensive version of the Hipsterwars dataset, which only contains only 5 classes and 1,893 images.

Fashion 14 Dataset

Gathering relevant fashion data is challenging. Images are usually scraped from fashion websites like chictopia.com and labelled using the website meta-information resulting in noisy labels. Simo-Serra has manually removed images that appear to be misclassified to obtain the clean-label fashion 14 dataset. Examples of the images and classes are below:

‍

Figure 1. Fashion 14 dataset examples.

The paper states the author split the dataset into training, validation, and test sets (60% / 5% / 35%) and in the tarball of the dataset we find three csv files named "train.csv", "val.csv", and "test.csv". However, there are many discrepancies between these splits and those stated in the paper, such as number of total images, number of classes in each split, and missing filenames.

Training Procedure

Like Simo-Serra, we split our data into training, validation and test sets (60% / 5% / 35%), the exact ratios used in the paper. Using fastai, implementing a slew of image transformations so our network could generalize each style was very easy. We then fine-tuned the head of our classifier with a learning rate of 5e-2 for two epochs with fastai’s implementation of Leslie Smith’s 1cycle learning rate policy. We then unfroze the entire network and trained it for 13 more epochs at a learning rate between 1e-6 and 5e-4. We achieved a maximum accuracy on the validation set of 77.6%

Results

Using a shallower network than the one used by Simo-Serra, we were able to achieve state of the art results due to the optimizations provided by fastai and in some cases, surpass human level classification performance. After test time transformations (TTA), we achieved an accuracy of 78.49% vs the author's maximum accuracy of 72.0%. We achieve the best results on 12 styles out of 14.

	conserv.	dressy	ethnic	fairy	feminine	gal	girlish	casual	lolita	mode	natural	retro	rock	street	average
ResNet34 fastai	0.72	0.90	0.79	0.91	0.80	0.73	0.61	0.69	0.95	0.73	0.80	0.72	0.78	0.86	0.78
ResNet50	0.66	0.91	0.74	0.88	0.64	0.74	0.47	0.66	0.92	0.72	0.70	0.62	0.68	0.69	0.72
VGG19	0.54	0.79	0.57	0.81	0.43	0.50	0.26	0.54	0.80	0.62	0.56	0.42	0.53	0.60	0.58
Xception	0.44	0.79	0.63	0.84	0.45	0.50	0.33	0.54	0.80	0.61	0.56	0.44	0.52	0.53	0.58
Inception v3	0.37	0.73	0.54	0.78	0.41	0.39	0.27	0.45	0.78	0.55	0.44	0.35	0.47	0.46	0.51
VGG16	0.31	0.78	0.49	0.78	0.42	0.45	0.22	0.43	0.81	0.58	0.57	0.23	0.43	0.43	0.51

Table 1. Our results using a resnet34 fast-ai based model in top row against results from [1].

The author also had two groups of actual humans classify these images, a savvy group and a naive group. The savvy group attains an accuracy of 82% and the naive group an accuracy of 62%. Notably, our network performs better than the savvy humans in the categories of conservative (72% vs 59%), fairy (91% vs 89%), mode (73% vs 69%) and rock (78% vs 74%).

	conserv.	dressy	ethnic	fairy	feminine	gal	girlish	casual	lolita	mode	natural	retro	rock	street	average
Savvy (human)	0.59	0.92	0.80	0.89	0.84	0.92	0.71	0.75	0.95	0.69	0.81	0.79	0.74	0.91	0.82
ResNet34 Fastai	0.72	0.90	0.79	0.91	0.80	0.73	0.61	0.69	0.95	0.73	0.80	0.72	0.78	0.86	0.78
Naïve (human)	0.35	0.87	0.64	0.83	0.60	0.62	0.51	0.50	0.83	0.29	0.57	0.50	0.58	0.74	0.62

Table 2. Comparing savvy users and naïve users results [1] to our own results using a renset34 fast-ai based model.

The confusion matrix we obtain clearly shows the model has learn to differentiate style. One can also notice its main struggle is on distinguishing ‘kireime-casual’ and ‘conservative’ among each others.

Table 3. Confusion matrix on the test set.

We also plotted some heat-maps that shows our network is properly looking at the outfit of a person to understand his style, and do not care about the background.

Figure 2. GradCam heatmaps.

Hipsterwars Dataset

Hipsterwars is a dataset composed of ~1,900 images classified in 5 different styles (Hipster, Bohemian, Goth, Preppy and Pinup) labelled by humans through gamification. We decided to test our obtain model on this dataset to see if it was able to generalized well.

Semisupervised classification

Using no additional training, we used the nearest neighbors in based on euclidean distance between the outputs of an intermediate layer of our fashion14 network to achieve a competitive result to the Simo-Serra’s supervised SVM classification also on the intermediate outputs of his network. We measure whether one of the top 1, 2 or 3 closest neighbors in this projected space has the same label as the input image. This demonstrates our network clusters similar images even if the target classes are not the same as the training classes.

	Top 1	Top 2	Top 3
Stylenet Joint w SVM	0.64	0.80	0.86
ResNet34 Fastai (ours)	0.53	0.69	0.78
VGG CNN_M	0.45	0.64	0.76
VGG16 Places	0.40	0.61	0.72

Table 4. Comparison with deep networks using feature distances on the Hipsters Wars dataset (related paper).

Table 5. Confusion matrix of our model top-3 predictions on the Hipsterwar dataset.

platform.ai visualization

platform.ai seeks to bring deep learning to an extended (including non-technical) audiences. Among others functionalities, platform offers some great visualisations tools to understand how your data flows and get clusterized through your network. We used a subset of 3 styles from Hipsterwars to feed our model into platform.ai and obtained the below visualization. Please note it is background insensitive and able to understand the underlying style of a person ( i.e. a black dress won’t necessarily be classified as Goth)

Figure 3. Screenshot from platform.ai displaying 3 styles from Hipsterwars.

Conclusion

Working with fastai we easily trained a fashion classifier using transfer learning and achieve a state of art result on the Fashion 14 dataset. The visualisation tools of fastai played an important role to understanding our dataset specificities and ensure our model had learned a ‘sense of style’.

References

Moeko Takagi and Edgar Simo-Serra and Satoshi Iizuka and Hiroshi Ishikawa, What Makes a Style: Experimental Analysis of Fashion Prediction. Proceedings of the International Conference on Computer Vision Workshops. ICCVW, 2017.
M. Hadi Kiapour, Kota Yamaguchi, Alexander C. Berg, Tamara L. Berg. Hipster Wars: Discovering Elements of Fashion Styles. In European Conference on Computer Vision. 2014.
Edgar Simo-Serra and Hiroshi Ishikawa, Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction. Proceedings of the Conference on Computer Vision and Pattern Recognition. CVPR, 2016.

‍

The Silhouette Loss Function: Metric Learning with a Cluster Validity Index

At its heart, platform.ai is all about visualisation. We aim to take any large dataset of unlabelled images and place each image at a position on the screen, such that similar images are positioned closer together. In this way, we can help users label vast numbers of images with minimal effort.

Tutorial

Platform Blog