by David Samuel, Naveen Kumar, Atin Mathur, Arshak Navruzyan
Visual attribute search can greatly improve the user experience and SEO for home listing and travel websites. Although Zillow, Redfin, Airbnb, and TripAdvisor already have some metadata about the amenities of a property, they can expand searchable attributes by analyzing the property images using vision models.
In this post, we share our initial approach towards a few-shot model for predicting property attributes like view, kitchen island, pool, high ceilings, hardwood floors, fireplace, etc. Since these attributes are often room and context-dependent, we start with an accurate classification model to group our images into interior and exterior settings of a property.
In the process of training our initial room-type model, we notice that some of these rich attributes are easily separable in platform.ai.
Previous work has focused on using images to improve price estimation . However, the incremental gain of adding image features to pricing models has been minimal with only about 2.3% improvement over using a handful of conventional attributes like location and property size. While pricing data for building these models has been readily available, there is a scarcity of datasets for predicting rich attributes like view, kitchen island, swimming pool, high ceilings, hardwood floors, fireplace, etc.
Our initial dataset, previously used in price estimation , consists of 146,791 images and seven classes: living room, dining room, bedroom, bathroom, kitchen, interior and exterior.
Fig 1. Class distribution of real estate images
Bathroom is the most underrepresented class with about half the amount of images than any other class. We addressed this class imbalance using fastai’s vision.transform method  to oversample the data using the default image augmentations.
Fig 2. Example image augmentation of the classes: bathroom, dining room, kitchen, living room, bedroom, interior, and exterior
The images were pre-processed using fast.ai’s built-in transforms. Data was split randomly into 60% train, 20% validation and 20% test.
The model was initialized with pre-trained ResNet-34 weights. The network’s custom head was trained for 3 epochs, followed by unfreezing the entire network and fine tuning for another 10 epochs using discriminative learning rates. Fine tuning improved the model fit, achieving an overall test set accuracy of 97%.
By increasing the network capacity to a ResNet-50, 98% final accuracy was achieved - a significant improvement over the 91% accuracy of the previous results .
Building a Rich Attribute Dataset
We constructed a rich attribute dataset by crawling property listing websites. The crawler captured both images and the attributes of interest. In total 18,790 listings were obtained along with 350,000 images.
Feature Class Distribution
Our web scraper captured unstructured HTML and extracted the rich attributes contained in the listings’ details table.
Fig 3. Rich attribute distribution in the crawled data
The final dataset consists of 18,790 individual listings that each holds an average of 21 images. We have identified several features visible in the photos like pools, patios, kitchen islands, and fireplaces. Nearly half of the listings in our dataset have a pool or a patio, and only about 25 listings have wine cellars. Furthermore, the appearance of the attribute can be seen in different spaces; modern wine cellars tend to be above ground.
Fig 4. Example feature from listings dataset: wine cellar
We uploaded our model, and a sample of 20,000 images from our dataset to platform.ai in order to compare its performance against the pre-built ImageNet model. Our model forms neat clusters, easily separable by eye, of similar attributes of interest like fireplaces, pools, and kitchen islands. In comparison, ImageNet tends to form wider clusters with dissimilar attributes.
Fig 5. Our Model’s Projection (above) vs Imagenet Projection (below)
Using the projections as visual aids, clusters of interest were highlighted and selectively filtered using platform.ai. The zoomed-in views of our model projection show three rich features which we have identified through our model: fireplace, kitchen island, and pool. When compared with Imagenet, we can see more numerous clusters bound closely to rich attributes vs. labeled room class features.
Fig 6. Zoomed in projections show a fireplace cluster
Fig 7. Zoomed in projections show a kitchen islands cluster
Fig 8. Zoomed in Projections and selected images from our model show an outdoor swimming pool cluster
After downloading our projections, we were able to evaluate a clustering solution comparing our model’s silhouette score against ImageNet. The results show that our silhouette score is significantly greater than ImageNet per t-test results on k=5 K-means clusters Silhouette score. Thus, our model produces similar clusters more consistently than ImageNet-ResNet.
Fig 9. Similarity “Silhouette” scores for k=5 K-Means clusters
Table I. Silhouette score summary statistics
Finding rich features using Approximate Nearest Neighbors
As previously demonstrated, we can find similar rich attributes in images using nearest neighbors. Instead of exact nearest neighbor search, we use the approximate nearest neighbor search (annoy) as it gives good results in much less time. Instead of the pixel space, we use the activations of the second last layer of our trained Resnet34 model as our search space for annoy. We can access the activations of any layer of a model in fastai using its hooks callback function. We then create an annoy index from these activations for all images in the training data which is a one time process.
To confirm that our Resnet model has learned the rich attributes from the images and can classify rich labels in any given image, we perform an experiment using annoy search to predict the rich attributes using the activations of our model.
In this experiment, we consider 3 rich attributes: Kitchen Island, Hot Tub, and Fireplace. Using the metadata crawled from the listing site along with images, we create new training data with rich label annotations. Next step is to create an annoy index (n_trees=500). We then use this index to search the nearest neighbor for a test image and predict from the above three rich labels based on the class of the majority of its nearest neighbors. In doing so, we get an overall accuracy of 97.48% for K=5. This clearly shows that our Resnet model has learned rich label attributes in an image and will be able to predict rich label attributes in any given room type image.
Table II. Comparison of accuracy of different rich attributes
Applying modern machine learning practices, we have developed a computer vision model that not only predicts room classes, but also the deeper attributes present in the homes that we live in. It performed better than ImageNet by clustering our nested attributes closer together, allowing visually separable groups to be extracted and labeled. Development of an accurate attribute search model could be an essential search tool in finding the right home or rental.
We would like to thank Arshak Navruzyan for his mentor support and guidance during this project. We would also like to thank the fastai team for a convenient deep learning library.
- Poursaeed, Omid et al. Vision-based real estate price estimation. Machine Vision and Applications 29 (2018): 667-676.
- Santoro, Adam et al. A simple neural network module for relational reasoning. NIPS (2017).
- He, Kaiming et al. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): 770-778.
- Howard, Jeremy, et al. fastai library. 2019.
- Clarke, Adrian, et al. Optimizing hyperparams for image datasets in fastai. 2019