Using recognition models on face attributes

credit: Large-scale CelebFaces Attributes (CelebA) Dataset

by Sergei Chudov, Mostafa Gazar and Timothy Quill


Face attribute models are used for retrieval tasks like "find all actors with green eyes and curly hair" or for applications used to measure emotional states like smiling / not-smiling.

We investigate model architectures, loss functions and training sets to determine the optimal combination for distinguishing  facial attributes in a weakly-supervised setting.  We then compared’s default model against our new model, using a variety of projections based on linear and non-linear dimensionality reduction methods and discovered that the new model can more easily identify attributes like gender, race, age, baldness, hair color and attractiveness.


Celeb-A dataset

Celeb-A is a large-scale face attributes dataset with more than 200K celebrity images, consisting of 10,177 celebrity identities with 40 binary attribute annotations per image, sized 178 × 218 pixel.

We used 3,392 images of Celeb-A for testing. In addition, we added an attribute for race by manually labelling 500 images from the test set. We selected 100 images for each of the following categories: Caucasian, Black, Asian, Indian and Latino. Only obvious cases were labelled, and in many instances the race of the subject was confirmed through a Google Image search of their photo.

Figure 1. Frequency of attributes in subset of Celeb-A dataset


We’ve noticed that most of the available pre-trained models for face recognition use either ResNet or SeNet architecture, and, since we were able to utilize some of them to compose reasonably meaningful pairs to evaluate the impact of dataset, architecture and loss function, the following networks were chosen:

Network architecture
Loss functions

Training dataset









emore (largely based on MS1M)

ResNet-34 [current model]



ResNet-34 [new model]




We’ve used reflect padding for comparisons, as it gave better separation for features on scatter plots than zero padding and introduces extra information to the network.

Loss function

Softmax is the most commonly used loss function for facial recognition problems, however the loss has some drawbacks. One of them is that it does not explicitly optimize the feature embedding to enforce higher similarity for intra-class samples and diversity for inter-class samples, which results in a performance gap for deep face recognition under large intra-class appearance variations like pose variations and age gaps.

ArcFace is also easy to implement, does not require much extra computational overhead and is able to converge quickly.

Experiment Conditions

We identified 4 key factors that would likely influence a CNN’s ability to classify facial attributes accurately:

  • Architecture
  • Loss Function
  • Training set
  • Projection method

We selected a list of attribute comparisons that are mostly complimentary, to evaluate usefulness of the network for proposed use case.

  • Caucasian, Black, Asian, Indian, Latino
  • Old, Young
  • Bald, Not Bald
  • Male, Female
  • Chubby, Thin
  • Wearing Earrings, No Earrings
  • Wearing Hat, No Hat
  • Attractive, Not Attractive
  • Smiling, Not Smiling
  • Grey Hair, Black Hair
  • Wavy Hair, Straight Hair
  • Beard, No Beard
  • Black Hair, Brown Hair, Blond Hair
  • Heavy Makeup, Wearing Lipstick, Attractive

We designed a series of comparisons between the models to identify which set of which combinations would achieve the best results:

  • VGGFace2 vs VGGFace2: comparing model architecture
  • ImageNet vs VGGFace2: comparing training dataset
  • SeNet trained vs VGGFace2 (with Softmax) and SeNet trained on refined MS1M (with ArcFace loss): compares dataset, as well as loss function


We’ve employed a variety of projection methods to gain deeper insights into how the models classify the data:

To be able to observe the complete performance of the models, we represented the projections in two formats: 1) Using scatterplots 2)

  1. Scatterplots allow us to clearly see the results of separation only on the labelled attributes we’re interested in. Dimensionality reduction techniques were taken primarily from the SKLearn, though a few other libraries were also used.
  2. The projections plotted on are the ultimate purpose of our research, however all of the projections are plotted at once, which makes it difficult to observe all of the clusters.


We performed pair-based comparisons on each set of models, datasets and loss functions to identify which of these helped improve distinguishing facial attributes.

The intuition was that VGGFace2 would perform best because of its high variations in terms of subject poses, age, etc. compared to  "emore", which is not as diverse, and general purpose ImageNet dataset. The experiments proved our intuition.

VGGFace2 pre-trained models pay more attention to the colors, while ImageNet models seem to pay more attention to the texture. Besides that, all ImageNet’s projections of attributes have a large spread and don’t separate well from each other.

We also experimented with 2 loss function ArcFace and Softmax. ArcFace performed best which can be credited to its tendency to enforce higher inter-class disparity and intra-class compactness of attributes, unlike the Softmax loss function, which does not explicitly optimize the feature embedding.

Source ArcFace: Additive Angular Margin Loss for Deep Face Recognition [6]

When it came to the choice of model architecture we knew that it has to be ResNet-34 to match’s current model. We still wanted to experiment with different model architectures. An interesting find was that SE modules seem to decrease the quality of manifold projections, despite increase in classification accuracy. This may be due to the ability of SE modules to ease passing the unmodified information from previous residual blocks further into the network, which allows for more complex features.

We used scatter plots as quick visual tool for demonstrating the results of separation only on the labelled attributes of interest.

After experimenting with different combinations, we concluded that ResNet-34 model, trained on VGGFace2 dataset, using ArcFace loss function should deliver the best results.

Scatter plot projections

We’ve discovered that UMAP provides best projections for all evaluated models, while t-SNE also manages to cluster evaluated attributes somewhat well. In most cases some PCA axis are able to provide gradients of attributes even when non-linear methods are not effective. ICA projections are also good, and are often better than PCA, though fail when manifold methods fail.

Gender seems to be the easiest feature to classify for both models. Most PCA axes for both networks also allow clear separation.

Figure 3. UMAP plotting of gender (Red - Male, Green - Female)

Figure 4. UMAP plotting of age (Red - Old, Green - Young)

Figure 5. UMAP plotting of attractiveness (Red - Attractive, Green - Unattractive)

Figure 6. UMAP plotting of beard (Red - Beard, Green - No beard)

Figure 7. UMAP plotting of hats (Red - Wearing hat, Green - Not wearing hat)

Figure 8. UMAP plotting of baldness (Red - Bald, Green - Not Bald)

Figure 9. UMAP plotting of race (Red - White, Green - Black, Yellow - Asian, Blue - Latino, Cyan - Indian)

Figure 10. UMAP plotting of hair color (Red - Black hair, Green - Brown hair, Yellow - Blonde hair) projections

We ‘cherry picked’ plots containing clusters with a high inter-class disparity and intra-class compactness. Both models have many projections where no significant clustering is exhibited; these results have been excluded.

Default Model

The model was only able to significantly cluster 3 attributes: gender, hats, and background colour:

  • Gender - clustered well, though there’s some overlap and a large spread for men
  • Hats - are clustered fairly well, though there are is a lot of overlap with ‘No Hats’
  • Background colour - was found to be the most significant influencer of clustering amongst projections. The model appears unable to classify based on any other color in images: hair colour, skin colour, etc

Figure 11.’s current model clustering subset of Celeb-A: Layer 4 axis 1

Figure 12.’s current model clustering subset of Celeb-A: Layer 1 Axis 1

ResNet-34 / VGGFace2 / ArcFace (new model)

New model was able to cluster all of the same attributes as’s current model, except for Hats/No Hats (which we expect is due to the wearing of hats being of no relevance to face recognition systems). In addition to these attributes, new model can classify Race, Age, Hair colour, and Hair length:

  • Gender - clustered well, though there’s some overlap and a large spread for women
  • Skin color - there is a clear gradient of hair color across projection
  • Age - seems to be most easily classified in white women. Both older women and young, fair-haired women cluster nicely together.
  • Hair color - gradient of color across all data, though there doesn’t appear to be any clusters of other attributes in the projection.

Figure 13. New model’s clustering subset of Celeb-A: Layer 4 Axis 1

Figure 14. New model’s clustering subset of Celeb-A: Layer 3 Axis 5

Figure 15. New model’s clustering subset of Celeb-A: Layer 4 Axis 4

The clusters in Figure 6 aren’t perfect; there are some Asians and old men outside of the clusters, though the poor clustering is suspected to be due to projections rather than the model.


The new model improves on the default one in distinguishing between more facial attributes, producing a much higher inter-class disparity and intra-class compactness amongst attributes. As we suspected, the model architecture, training set and loss function all played an important role in achieving this increased performance.

The superior results of ArcFace’s loss function can be credited to its tendency to enforce higher inter-class disparity and intra-class compactness of attributes.

The superior results of VGGFace2 can be credited to the high variations in terms of subject poses, age, etc. compared to ImageNet general purpose dataset. In future, preprocessing the dataset with AI Fairness 360 [7] to reduce the dataset bias could further improve the quality of features.


  1. Deng, Jia, et al. ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  2. Deep Insight. Report Your Verification Accuracy of New Training Dataset 'insightv2_emore', GitHub, Deep Insight, 2018.
  3. Cao, Qiong, et al. VGGFace2: A Dataset for Recognising Faces across Pose and Age. 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), 2018.
  4. He, Kaiming, et al. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  5. Hu, Jie, et al. Squeeze-and-Excitation Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  6. Deng, J., Guo, J., Xue, N. and Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. [online], 2018.
  7. Bellamy, R. K. E., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., … Zhang, Y. (2018). AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. [online], 2018.