Attention Cropping in

by Krisztian Kovacs, Constantin Baumgartner, Niko Laskaris, Richard Lipkin PhD


One way to make categorical image labelling easier on is to zoom in on the objects of interest, showing the user images with irrelevant background cropped away. This could generate a large improvement in ease of use: in some datasets the relevant objects only occupy a small area, and recognizing them is often tricky. Further, showing users a combination of regular and zoomed images could allow them to recognize a wider range of patterns.

To zoom in on the objects, we need to draw a bounding box around the relevant region so that we can crop away the background. Figure 1 shows example images from a brand logo dataset in which the logos occupy only a small part of the image (and thus are hard to recognize). Cropping these images using the bounding boxes shown would enhance the user’s classification ability.

Figure 1: In logo classification, the object of interest might comprise a very small part of the input image. Bounding boxes shown as red rectangles.

Bounding boxes can be predicted using a wide range of object detection methods, such as Region-based Convolutional Neural Networks (R-CNN) (1), You Only Look Once (YOLO) (2), Single Shot Detector (SSD) (3), and RetinaNet (4), but these methods require bounding box-annotated training data. Annotating bounding boxes is time-consuming (and therefore expensive), as it currently has to be done image-by-image. It cannot be done as efficiently as assigning whole-image labels in, where hundreds of images can be labeled in a few seconds.

Therefore, we aim to find a method that predicts bounding boxes without such annotations, using only class-level labels. The predicted bounding boxes will then allow us to create a dataset with a mix of cropped and uncropped images for a smoother labelling experience.

Saliency Maps

We will use various Saliency Maps (SMs) to generate bounding boxes and compare their performance. In computer vision, SMs are topographical, scalar representations of unique image features that can be anything from pixel values to resolution or contours (5). A class activation map (CAM) is a type of SM that can be used to generate bounding boxes. Informally, CAMs use activations in a neural network layer to highlight locations that contribute to a given model prediction. Once we have such a map, obtaining the bounding box is relatively straightforward: we select a box that contains the highly activated region, as shown in Figure 2.

Figure 2. An example Saliency Map and respective bounding box.

CAMs were first introduced by Zhou et al. (6) They created a custom architecture with a VGG followed by a global-average pooling (GAP) layer. Instead of feeding the pooled activations through a series of linear and ReLU layers, the authors used a single linear layer to calculate the output activations. This architectural change had one major benefit: it made the weights of the final linear layer reveal the ‘importance’ of each channel in the last convolutional layer (before GAP). Accordingly, changing the activations in a channel with a high weight affected the final score more than doing so for a low-weight channel.  

Using these weights, Zhou et al. (6) calculated the weighted average of the last convolutional layer’s activations across the channel dimension. For example, for a 224✕224 image, the VGG network would have 14✕14✕512 activations in the last convolutional layer; averaging the channels would give a map with 14✕14 dimensions. We could then use bilinear interpolation (or any other up-sampling technique) to convert this 14✕14 map back to the original 224✕224 shape; Voilá, that’s our class activation map!

One major problem with this approach is that it requires a custom architecture. Ideally, we would use the same architecture for the CAM that we use for obtaining class predictions. In’s case, that’s the ResNet34 architecture. We still want to use the last convolutional layer of our ResNet34, we just need to find suitable weights to average the channels. The natural solution is to use gradients, as presented by Selvaraju, Ramprasaath, et al. (7). We average the gradients in each channel of the last convolutional layer, giving us a natural measure of channel ‘importance’. If the gradients with respect to the final output are high, changing the activations will have a large effect on the final score. Channels with this property are more likely to occur at the object’s location than channels that have small gradients (where changing activations is immaterial).

This method is called GradCAM, and it can be implemented in any standard architecture. To obtain an SM using the GradCAM method, we go through the following steps:

  1. We feed an image through the model, obtaining class probabilities for all of our labels, and choose the label with the highest probability.
  2. We do a backward pass, calculating the gradients with respect to the last convolutional layer in our model.
  3. We calculate the average gradient for each channel—that’s our weight.
  4. We calculate the average activation in the last convolutional layer using the weights from the previous step.
  5. We apply a ReLU—after all, we’re not interested in negative values; we only care about locations that positively contribute to our chosen output.
  6. We resize the SM to have the same dimensions as the input image (using bilinear interpolation).

We also test a more complex adaptation of GradCAM called GradCAM++, introduced by Chattopadhyay et al. (8). The authors still rely on gradients, choosing channel weights proportional to the (positive) channel gradients, and calculate channel contributions to (a smoothed function of) the final output score.

Calculating bounding boxes from the SM is relatively uncomplicated. One option is to choose the smallest/largest nonzero pixels along both the x and y axes. One disadvantage of this method is that a tiny activation in a distant corner of the image forces the bounding box to cover that corner. We are more interested in a ‘central mass’ of activations and want to ignore these potential outliers. A simple solution is to switch from the minimum/maximum to a percentile range. If we cut off ε percent from the cumulative distribution of activations along the x and y axes, it will not affect the center as long as ε is small enough. However, this technique will get rid of small outlying activations in one corner of the image. We set ε to 2.5%, so that our bounding boxes represent the middle 95% of nonzero activations along both x and y. Note that ε is an adjustable hyperparameter that is governed by recall/precision trade-offs.


Data & Method

To calibrate the performance of our semi-supervised bounding box approach, we benchmarked it to Pascal VOC2012 (9). This dataset is well-known in the object detection field, and classification benchmarks have been obtained using a wide range of models. Instead of trying to beat those benchmarks, we want to make sure that our proposed method results in bounding boxes that are more useful than using the full training images.

We have two objectives in trying to create useful bounding boxes:

  1. Making sure that the object of interest is in the bounding box, and
  2. Reducing the area of the original image.

If we achieve both objectives, we effectively create a higher-quality dataset for both human annotators and the models trained on the dataset. We can judge our success on (1.) by looking at average recall, which measures the fraction of the object area that appears in our bounding box. We can judge our success on (2.) by looking at the area of the bounding box relative to that of the original image. In general, the larger the reduction, the more useful our bounding box is.

While we report the mean average precision (mAP) measure used in the literature as a benchmark, our approach is not optimized for it. In particular, we prioritize recall over precision. Not losing the relevant object is much more important than excluding as much of the background as possible.  

We show the results for both GradCAM and GradCAM++. In addition to standard test-set calibrations, we display the results of two additional experiments:

  1. Creating SMs based on a ‘small model’ trained with only 10 classes and fewer than 100 examples per class.
  2. Instead of using the original image resolution, obtaining SMs for resized 224✕224 square images, then converting the bounding boxes back to the original image dimensions.

The motivation behind (1.) is to see how the SMs perform in an environment similar to At the beginning of the labelling stage, users usually label ~100 images per class before fitting the model. While the class predictions for (1.) for unseen classes will necessarily be wrong, we want to make sure that the SMs generalize. In other words, while we won’t be able to correctly classify unseen classes, we want to make sure that we recognize ‘general object shapes’. The reason for (2.) is execution speed. Resizing images to 224✕224 will presumably decrease performance somewhat, but it allows us to assemble images into mini-batches and utilize standard PyTorch dataloaders and fastai databunches for a 10✕ total speedup.


Figure 3 and Table 1 show sample results for the PASCAL dataset. Note that we only evaluated images containing a single object, but the method can also be transferred to multi-label settings.

Figure 3. Unbatched bounding boxed predictions for various Saliency Map methods and model sizes.



Mean Intersection over Union (IOU)

Mean Average Precision (mAP)

Image Area Reduction

GradCAM Small Model






GradCAM Full Model






GradCAM++ Small Model






GradCAM++ Full Model






GradCAM Small Model Batched






GradCAM Full Model Batched






GradCAM++ Small Model Batched






GradCAM++ Full Model Batched






Table 1. Results for various Saliency Map methods and model sizes.

The results are in line with our objectives. Our recall is always above 70%, and for our best models, it is above 90%. This means that our bounding box almost always contains a large portion of the target object. Even though we are being conservative in this way, the method allows us to perform substantial zooming, reducing the original image area by 30%–70%.

The difference between GradCAM and GradCAM++ is that the former has a better recall (producing larger bounding boxes), while the latter has better precision. The batched version of GradCAM has a worse recall than the original—unsurprisingly, as it has a much smaller image to work with. To compensate, it produces larger image reductions.

The biggest surprise is the performance of the small model. Although trained on few instances per class, its performance is not too far behind that of the full model. This is good news for iterative development on it means that right after an initial model is fit, we can apply GradCAM to zoom in on relevant objects.

While the results are promising, it’s important to investigate how the method transfers to other datasets. Particularly challenging are datasets in which the object of interest occupies only a small fraction of the image, such as the Logos 32+ dataset (10). Partial results for the Logos 32+ test set are shown in Table 2.



Mean Intersection over Union (IOU)

Mean Average Precision (mAP)

Image Area Reduction

GradCAM Full Model






GradCAM++ Full Model






Table 2. Results for the Logos 32+ dataset.

In the Logos 32+ dataset, the target objects occupy only 7% of the image area, resulting in low precision and hence low intersection over union (IOU) values. However, we still have very high recall while reducing the image area substantially. GradCAM can therefore still be used to create a better training set or highlight objects for improved efficiency of human inspection. As before, GradCAM gives better recall at the expense of slightly lower precision relative to GradCAM++.


Figure 4 shows an example projection using both original (left) and cropped (right) images. In this example, cropping helped make motorcycles more visible and improved the quality of their separation from normal bicycles in the clustering. However, the second effect (better clustering) does not generalize: sometimes, uncropped images produce better clusters, but sometimes the cropped ones do.

For this reason, plans to incorporate a mix of cropped and uncropped images. Evaluating how well this feature aids labelling is generally a qualitative judgment. However, one quantitative way to evaluate this would be by asking users to label datasets both with and without crops and seeing whether adding cropped images improves the labelling process or results.

Figure 4. Uncropped (left) and cropped (right) projections from

Additionally, the amount of cropping can also be adjusted. Figure 5 shows sample images that have been cropped with an increasingly narrow activation retention margin (the cumulative amount of activated pixels retained in the x and y directions). Smaller cropping margins work better for smaller objects while larger cropping margins work better for larger objects. could incorporate a mix of cropping rates among the cropped images. The rates could be determined at random or based on some image characteristic, such as the size of the object of interest based on the ratio of activated pixels to total pixels in the image.

Figure 5. Sample images cropped with an increasingly narrow activation retention margin.


We used GradCAM saliency maps to crop images with the goal of eliminating irrelevant background and focusing on the object of interest. We also analyzed several different performance metrics to assess the quality of the automatically generated outputs, finding that our method’s bounding boxes successfully reduce image area while retaining the vast majority of the object of interest. Our implementation is compatible with the fastai library. We will soon implement this technique as a native feature on to visually assist users with assigning image labels. In the future, we will also examine how several other factors influence the outputs, such as changing model architecture, using multiple layers to calculate the CAMs, or cropping all vs. only a subset of the images in the dataset.


  1. Ross Girshick, et al.  “Rich feature hierarchies for accurate object detection and semantic segmentation.” In Proc. IEEE Conf. on computer vision and pattern recognition (CVPR), 2014, pp. 580-587,
  2. Joseph Redmon, et al. “You Only Look Once: Unified, Real-Time Object Detection” CVPR 2016,
  3. Wei Liu et al. “SSD: Single Shot MultiBox Detector”, ECCV 2016,
  4. Tsung-Yi Lin, et al.  “Focal Loss for Dense Object Detection”, IEEE transactions on pattern analysis and machine intelligence, 2018,
  5. Simonyan, Karen, et al. “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.” Visual Geometry Group, University of Oxford, 2014,
  6. Zhou et al. “Learning Deep Features for Discriminative Localization”, CVPR 2016,
  7. Selvaraju, Ramprasaath, et al. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.” Virginia Tech University, 2016,
  8. Chattopadhyay, Aditya, et al. “Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks.” Indian Institute of Technology Hyderabad, 2017,
  9. Everingham, Mark. “Visual Object Classes Challenge 2012 (VOC2012).” 2012,
  10. Bianco, Simone, et al. “Deep learning for logo recognition.” Neurocomputing, vol. 2456, 2017, pp. 23-30.