Predicting Rich Attributes in Real Estate Images

credit: Redfin

by David Samuel, Naveen Kumar, Atin Mathur, Arshak Navruzyan


Visual attribute search can greatly improve the user experience and SEO for home listing and travel websites. Although Zillow, Redfin, Airbnb, and TripAdvisor already have some metadata about the amenities of a property, they can expand searchable attributes by analyzing the property images using vision models.

In this post, we share our initial approach towards a few-shot model for predicting property attributes like view, kitchen island, pool, high ceilings, hardwood floors, fireplace, etc. Since these attributes are often room and context-dependent, we start with an accurate classification model to group our images into interior and exterior settings of a property.

In the process of training our initial room-type model, we notice that some of these rich attributes are easily separable in


Previous work has focused on using images to improve price estimation [1]. However, the incremental gain of adding image features to pricing models has been minimal with only about 2.3% improvement over using a handful of conventional attributes like location and property size. While pricing data for building these models has been readily available, there is a scarcity of datasets for predicting rich attributes like view, kitchen island, swimming pool, high ceilings, hardwood floors, fireplace, etc.

Our initial dataset, previously used in price estimation [1], consists of 146,791 images and seven classes: living room, dining room, bedroom, bathroom, kitchen, interior and exterior.

Rooms Distribution

Fig 1. Class distribution of real estate images

Bathroom is the most underrepresented class with about half the amount of images than any other class. We addressed this class imbalance using fastai’s vision.transform method [4] to oversample the data using the default image augmentations.

Fig 2. Example image augmentation of the classes: bathroom, dining room, kitchen, living room, bedroom, interior, and exterior

The images were pre-processed using’s built-in transforms. Data was split randomly into 60% train, 20% validation and 20% test.

The model was initialized with pre-trained ResNet-34 weights. The network’s custom head was trained for 3 epochs, followed by unfreezing the entire network and fine tuning for another 10 epochs using discriminative learning rates. Fine tuning improved the model fit, achieving an overall test set accuracy of 97%.

By increasing the network capacity to a ResNet-50, 98% final accuracy was achieved - a significant improvement over the 91% accuracy of the previous results [1].

Building a Rich Attribute Dataset

We constructed a rich attribute dataset by crawling property listing websites. The crawler captured both images and the attributes of interest.  In total 18,790 listings were obtained along with 350,000 images.

Feature Class Distribution

Our web scraper captured unstructured HTML and extracted the rich attributes contained in the listings’ details table.

Rich attributes distribution

Fig 3. Rich attribute distribution in the crawled data

The final dataset consists of 18,790 individual listings that each holds an average of 21 images. We have identified several features visible in the photos like pools, patios, kitchen islands, and fireplaces. Nearly half of the listings in our dataset have a pool or a patio, and only about 25 listings have wine cellars. Furthermore, the appearance of the attribute can be seen in different spaces; modern wine cellars tend to be above ground.

Example images in the dataset

Fig 4. Example feature from listings dataset: wine cellar


We uploaded our model, and a sample of 20,000 images from our dataset to in order to compare its performance against the pre-built ImageNet model. Our model forms neat clusters, easily separable by eye, of similar attributes of interest like fireplaces, pools, and kitchen islands. In comparison, ImageNet tends to form wider clusters with dissimilar attributes. Projections

Fig 5. Our Model’s Projection (above) vs Imagenet Projection (below) Projections

Using the projections as visual aids, clusters of interest were highlighted and selectively filtered using The zoomed-in views of our model projection show three rich features which we have identified through our model: fireplace, kitchen island, and pool.  When compared with Imagenet, we can see more numerous clusters bound closely to rich attributes vs. labeled room class features.

Silhouette score

Fig 6. Zoomed in projections show a fireplace cluster

Kitchen islands

Fig 7. Zoomed in projections show a kitchen islands cluster


Fig 8. Zoomed in Projections and selected images from our model show an outdoor swimming pool cluster

Cluster Analysis

After downloading our projections, we were able to evaluate a clustering solution comparing our model’s silhouette score against ImageNet. The results show that our silhouette score is significantly greater than ImageNet per t-test results on k=5 K-means clusters Silhouette score. Thus, our model produces similar clusters more consistently than ImageNet-ResNet.

Silhouette score

Fig 9. Similarity “Silhouette” scores for k=5 K-Means clusters


Our Model

























Table I. Silhouette score summary statistics

Finding rich features using Approximate Nearest Neighbors

As previously demonstrated, we can find similar rich attributes in images using nearest neighbors. Instead of exact nearest neighbor search, we use the approximate nearest neighbor search (annoy) as it gives good results in much less time. Instead of the pixel space, we use the activations of the second last layer of our trained Resnet34 model as our search space for annoy. We can access the activations of any layer of a model in fastai using its hooks callback function. We then create an annoy index from these activations for all images in the training data which is a one time process.  

To confirm that our Resnet model has learned the rich attributes from the images and can classify rich labels in any given image, we perform an experiment using annoy search to predict the rich attributes using the activations of our model.

In this experiment, we consider 3 rich attributes: Kitchen Island, Hot Tub, and Fireplace. Using the metadata crawled from the listing site along with images, we create new training data with rich label annotations. Next step is to create an annoy index (n_trees=500). We then use this index to search the nearest neighbor for a test image and predict from the above three rich labels based on the class of the majority of its nearest neighbors. In doing so, we get an overall accuracy of 97.48% for K=5. This clearly shows that our Resnet model has learned rich label attributes in an image and will be able to predict rich label attributes in any given room type image.


Our Model

Kitchen Island



Hot Tub






Table II. Comparison of accuracy of different rich attributes


Applying modern machine learning practices, we have developed a computer vision model that not only predicts room classes, but also the deeper attributes present in the homes that we live in. It performed better than ImageNet by clustering our nested attributes closer together, allowing visually separable groups to be extracted and labeled. Development of an accurate attribute search model could be an essential search tool in finding the right home or rental.


We would like to thank Arshak Navruzyan for his mentor support and guidance during this project. We would also like to thank the fastai team for a convenient deep learning library.


  1. Poursaeed, Omid et al. Vision-based real estate price estimation. Machine Vision and Applications 29 (2018): 667-676.
  2. Santoro, Adam et al. A simple neural network module for relational reasoning. NIPS (2017).
  3. He, Kaiming et al. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): 770-778.
  4. Howard, Jeremy, et al. fastai library. 2019.
  5. Clarke, Adrian, et al. Optimizing hyperparams for image datasets in fastai. 2019