by Niko Laskaris and Aradhya Chouhan
Logo detection in images has many applications, such as brand recognition for marketing analytics and intellectual property protection. As a computer vision problem logo detection is challenging for two primary reasons: there is no fixed number of company logos to classify, and in wild images the logo itself is often a very small feature of the input image. State-of-the-art (SOTA) results have been achieved using object localization methods. Using the the Logos32Plus dataset  and a YOLOv3-Darknet object localization pipeline , we achieve new SOTA recall (97.71%), F1 (97.76%) and accuracy (98.12%) compared to benchmark models.
Use Cases for Logo Detection
Detecting brand logos in images (and video) has important applications in domains ranging from marketing analytics (allowing a company to track how frequently and where brand images are appearing in social media content), and intellectual property protection. At the time of posting this article, the European Union has announced new laws that hold technology companies responsible for material posted without proper copyright permission . In order for companies to comply with stricter copyright laws and protections, the ability to identify logos in posted content is becoming increasingly important.
Figure 1: Sample images in which a brand logo is present. Left: Apple; right: Corona.
Prior to the development of deep learning models to tackle the problem of logo detection, state-of-the-art results came from techniques such as keypoint-based detectors and descriptors [1, 2], and methods to localize logos in images using homographic class graphs [3, 4]. In the last few years, pre-trained ConvNets have achieved SOTA results  . For most public SOTA results, the Flickr32 dataset is used with object localization for training . Flickr32 contains 8240 images across 32 classes. Recently, Bianco et. al [note] have developed the Logos32Plus dataset, an expansion of Flickr32. It contains the same 32 logo classes with a much larger cardinality (12,312 images).
Figure 2: Visual summary of Logos32Plus dataset.
Localization — focussing on certain regions of input images — can dramatically improve the performance of logo detection classifiers. There are many object localization methods and algorithms, including Faster-RCNN, RFCN, Retinanet, SSD, Selective Search and YOLO .
Figure 3: Visualization of Selective Search method used in Bianco et al.
When trained on datasets in which objects of interest occupy small areas of input images, location information about objects of interest can dramatically improve classification performance. For many of localization methods, training requires ‘ground truth’ object location labels. These usually take the form of bounding box annotations, or the coordinates of a bounding box within which the entirety of the object of interest is contained.
Figure 4: Training image with bounding box annotation around object of interest.
Training Without Object Localization
To compare training with and without object annotations, we trained baseline models without bounding box annotations.
Figure 5: Model architecture and training stats.
In each we were 15-20% below top performance benchmarks, which can be seen comparing our accuracy scores in Figure 6 to benchmark results in Figure 10.
Figure 6: Model performance on benchmark datasets.
Training With Object Localization: YOLOv3 and Darknet
For training with annotations we used the YOLOv3 object detection algorithm and the Darknet architecture . YOLO (You Only Look Once) is an algorithm for object detection in images with ground-truth object labels that is notably faster than other algorithms for object detection. Previous methods for this, like R-CNN and its variants, use a pipeline of separate networks for the localization and classification in multiple steps. Because each of these components must be trained separately, these methods can be slow to train and hard to optimize. YOLO, conversely, does it all with a single neural network.
YOLO Output Vector
- pc = probability of an object in the image
- bx = bounding box center x-coordinate
- by = bounding box center x-coordinate
- bh = bounding box height
- bw = bounding box width
- c1 = Highest confidence class prediction for object in bounding box
- c2 = Second highest confidence class prediction for object in bounding box
- c3 = Third highest confidence class prediction for object in bounding box
- cn = … nth highest confidence class prediction for object in bounding box
The input image is divided into an S x S grid of cells. For each object that is present in the image, one grid cell — the cell where the center of the object falls — is said to be “responsible” for predicting it.
Each grid cell predicts B bounding boxes as well as C class probabilities. Each bounding box prediction has 5 components: (x, y, w, h, confidence). The (x, y) coordinates represent the center of the box, relative to the grid cell location . These coordinates are normalized to fall between 0 and 1. The (w, h) box dimensions are also normalized to [0, 1], relative to the image size.
Figure 7: Example of image grid and bounding box prediction for YOLO.
The confidence score is defined as Pr(Object) * IOU(pred, truth). IOU is the Intersection-Over-Union and reports how much overlap our predicted bounding box has with the ground truth bounding box (a score close to 1 is good and means the prediction is mostly overlapped with the ground truth box). If no object exists in the cell, the confidence score should be zero. Otherwise, we want the confidence to equal the IOU.
It is also necessary to predict the class probability for each cell, which is defined as Pr(Class[i]) / Pr(Object). Adding the class predictions to the output vector we will get a S x S x (B * 5 + C) tensor as output where:
S: the number of grid cells on each axis. (In the above image, S = 3)
B: number of bounding box predictions we will make per grid cell
C: number of classes we are prediction in our classifier
Figure 8: Darknet architecture.
The network for YOLO is straightforward, with a custom head of two fully-connected layers. The final layer will reflect the tensor dimensions of the output vector. For example, in the case of our dataset (Logos32Plus), if we want to divide our images into a 5x5 search grid, and we want to predict 5 bounding boxes per grid cell, then our output vector will be of dimensions S x S x (B * 5 + C), or 5 x 5 x (5 * 5 + 32) = 5 x 5 x 57 = 1425.
YOLO Loss Function
Here we explain the YOLO loss function in each of its parts.
Figure 9: The YOLO loss function.
Part I: X and Y
Loss from predicting the bounding box center position (x, y). The function computes a sum over each bounding box prediction (j = 0...B) of each grid cell (i = 0...S^2). Iobj is defined as 1 if an object is present in grid cell i and the jth bounding box predictor is ‘responsible’ for it and 0 otherwise. X and y are the coordinates of the center of the predicted bounding box.
Part II: Width and Height
Loss from predicting bounding box height and width. Because small deviations in large boxes should matter less than in small boxes, the square root of the height and width is predicted, rather than the height and width directly.
Part III: Object Confidence
Loss associated with the confidence score for each bounding box predictor. C is the confidence score and Ĉ is the IOU of the predicted bounding box with the ground truth. Iobj is equal to one when there is an object in the cell, and 0 otherwise. Inoobj is the opposite.
Part IV: Classification Loss
This is essentially normal sum-squared error, except for the Iobj term. This term is used so classification error is not penalized when no object is present on the grid cell.  The λ parameter weights parts of the loss function. This is necessary to increase model stability.
Figure 10: YOLO predictions at 1000, 1300, and 9000 iterations.
After training our model for about 160 epochs (1,024,256 images in total), we achieved a testing accuracy of 91.41%, which is close to SOTA and well within the range of world class results. This was without any data augmentations or other transformations, or any fast.ai best practices (like lr_find()). To our knowledge, this is the first time the Logos32Plus dataset has been trained with a Darknet architecture using YOLO for object detection.
We initially hypothesized that domain specific training (not using pre-trained weights) would improve model performance, but we also conducted an experiment training our darknet using weights from ImageNet. After training the model using ImageNet weights for about 160 epochs as well, we achieved a validation accuracy of 98.12%, a new SOTA classification accuracy on the Logos32Plus dataset.
Figure 11: Some state-of-the-art results. FL32: Flickr32; L32+: Logos32Plus datasets. Model performance compared to benchmarks. 
Platform.ai provides useful tools for visualizing class distribution of image datasets. Here we upload a 10-class subset of our Logos32Plus dataset for visualization. Without bounding box annotations, and training on a ResNet, the model clearly learns to distinguish logos based on ImageNet-related objectness: shapes (bottle, car) or color (red, yellow).
Figure 12: Visual clustering projections from platform.ai.
After training a number of benchmark datasets for the problem of logo detection (Flickr27, Flickr32 and Logos32Plus) on ResNet pretrained models using fast.ai best practices, as well as the Logos32Plus dataset on our darknet/YOLO architecture, our hypothesis that object localization would improve model performance was verified. Our darknet/YOLO model achieved a new SOTA.
 A. D. Bagdanov, L. Ballan, M. Bertini, A. Del Bimbo. “Trademark matching and retrieval in sports video databases.” Proceedings of the international workshop on Workshop on multimedia information retrieval, ACM, 2007. https://www.researchgate.net/publication/210113141_Trademark_matching_and_retrieval_in_sports_video_databases
 J. Kleban, X. Xie, W.-Y. Ma. “Spatial pyramid mining for logo detection in natural scenes.” IEEE International Conference, 2008. https://ieeexplore.ieee.org/document/4607625
 R. Boia, C. Florea, L. Florea, R. Dogaru. “Logo localization and recognition in natural images using homographic class graphs.” Machine Vision and Applications 27 (2), 2016. https://link.springer.com/article/10.1007/s00138-015-0741-7
 R. Boia, C. Florea, L. Florea. “Elliptical asift agglomeration in class prototype for logo detection.” BMVC, 2015. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=5C87F52DE38AB0C90F8340DFEBB841F7?doi=10.1.1.707.9371&rep=rep1&type=pdf
 S. Bianco, M. Buzzelli, D. Mazzini, R. Schettini. “Logo recognition using cnn features” Image Analysis and Processing ICIAP, 2015. https://link.springer.com/chapter/10.1007%2F978-3-319-23234-8_41
 C. Eggert, A. Winschel, R. Lienhart. “On the benefit of synthetic data for company logo detection.” ACM, 2015. http://www.multimedia-computing.de/mediawiki/images/c/cf/ACMMM2015.pdf
 S. Romberg, L. Garcia Pueyo, R. Lienhart, R. van Zwol. “Scalable Logo Recognition in Real-World Images.” ICMR11, 2011. http://www.multimedia-computing.de/flickrlogos/.
 Uijlings, Jasper RR, et al. "Selective search for object recognition." International journal of computer vision 104.2 (2013). http://www.huppelen.nl/publications/selectiveSearchDraft.pdf
 J. Revaud, M. Douze, C. Schmid. “Correlation-based burstiness for logo retrieval.” ACM, 2012. https://hal.inria.fr/hal-00728502/document
 G. Oliveira, X. Frazão, A. Pimentel, B. Ribeiro. “Automatic graphic logo detection via fast region-based convolutional networks.” IEEE, 2016. https://arxiv.org/abs/1604.06083
 F. N. Iandola, A. Shen, P. Gao, K. Keutzer. “Deeplogo: Hitting logo recognition with the deep neural network hammer.” 2015. https://arxiv.org/abs/1510.02131
 S. Bianco, M. Buzzelli, D. Mazzini, R. Schettini. “Deep Learning for Logo Recognition.” Neurocomputing 245, 2017. http://dx.doi.org/10.1016/j.neucom.2017.03.051
 M. Menegaz. “Understanding YOLO.” 2018. https://hackernoon.com/understanding-yolo-f5a74bbc7967
 S. Ren, K. He, R.B. Girshick, J. Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 2015. https://arxiv.org/abs/1506.01497
 J. Dai, Y. Li, K. He, J. Sun. “R-FCN: Object Detection via Region-based Fully Convolutional Networks.” NIPS, 2016. https://arxiv.org/abs/1605.06409
 T. Lin, P. Goyal, R.B. Girshick, K. He, P. Dollár. “Focal Loss for Dense Object Detection.” IEEE, 2017. https://arxiv.org/abs/1708.02002
 W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S.E. Reed, C. Fu, A.C. Berg. “SSD: Single Shot MultiBox Detector.” ECCV, 2016. https://arxiv.org/abs/1512.02325
 J. Redmon, S.K. Divvala, R.B. Girshick, A. & Farhadi. “You Only Look Once: Unified, Real-Time Object Detection.” IEEE, 2016. https://arxiv.org/abs/1506.02640
 J. Howard, et al. “Fast.ai.” 2019. https://github.com/fastai/fastai
 Z. Kleinman. “EU backs controversial copyright law.” BBC News, 2019. https://www.bbc.com/news/amp/technology-47708144