Classifying Burn Depth

by Constantin Baumgartner, Richard Lipkin, and Dr. Peter Grossman, MD


The American Burn Association reports that roughly 485,000 patients receive hospital and emergency room treatment for burns each year [1]. Proper burn treatment depends primarily on the depth of tissue damage. Although accurate burn-depth assessment is the first and arguably most important step in a patient’s recovery, burn depth is still misdiagnosed 20%–40% of the time, even among experienced practitioners [2].

We investigate the feasibility of supplementing the burn diagnosis process using a computer vision system that classifies burn depth based on standard 3-channel visual color images of the burns. We analyze the performance of various architectures and data transforms using a web-scraped dataset of burn images. Our final model employs a pre-trained Convolutional Neural Network (CNN) architecture and the fastai library for image classification, resulting in 81.0% accuracy on the validation portion of our web-scraped dataset.

Related Work

Although medical imaging is a popular domain for computer vision and machine learning applications, burn classification remains significantly under-researched. A 2017 paper analyzing and summarizing over 300 contributions to the field of deep learning for medical imaging did not mention a single contribution focused on burn diagnosis [3]. Figure 1 shows a summary of the paper’s findings.

Figure 1. A breakdown of the papers included in the survey by year of publication, task addressed, imaging modality, and application area. The number of papers for 2017 was extrapolated from the papers published in January 2017 [3].

Current deep learning applications in medical imaging heavily use specialized and standardized image modalities, such as MRI, microscopy, or CT scans. Of the 306 papers included in the analysis, fewer than 10 used standard photographic data, and those that did use photographic data often did so as a supplement to images in a more specialized modality, such as dermoscopic images. The lack of standardized burn image repositories is a primary reason why burn depth classification is so under-researched despite its similarity to other deep learning medical imaging applications.

Outside the scope of the aforementioned deep learning analysis, a handful of studies have used machine learning to classify burn images. However, all such examples either utilized feature engineering and non-convolutional models [4] [5] [6] [7] [8], focused on segmentation rather than classification [9], or reported results without a clear description of the primary evaluation metric [10], leaving us without a 1-to-1 comparison for our research. Further, all burn classification and segmentation research we’ve found was conducted on datasets of fewer than 1,000 images. Table 1 shows a summary of the machine learning burn literature we analyzed.

Classification Task

Dataset size


Template Matching

3-class classification (superficial derman, partial thickness, and full thickness burns)

120 images

66% accuracy


3-class classification (superficial derman, partial thickness, and full thickness burns)

120 images

75% accuracy


One-vs-all 3-class classification (superficial derman, partial thickness, and full thickness burns)

120 images

90% accuracy

SVM (Gaussian Kernel)

3-class classification (second, third, and fourth degree burns)

396 images

72% accuracy

SVM (Polynomial Kernel)

3-class classification (second, third, and fourth degree burns)

396 images

74% accuracy

One Class-SVM

One-vs-all 3-class classification (second, third, and fourth degree burns)

396 images

78% accuracy


4-class classification (first, second, third, and fourth degree burns)

Not Reported

3% error


Binary classification (burns that require grafts vs burns that don’t require grafts)

94 segmented images

80% accuracy

Fuzzy-ARTMAP Neural Network

3-class classification (superficial derman, deep dermal, and full thickness burns)

312 segmented images

82% accuracy


3-class classification (burn depth)

94 images

66% accuracy


Binary classification (burns that require grafts vs burns that don’t require grafts)

94 images

84% accuracy


Binary segmentation (burn vs no-burn)

929 segmented images

0.85 Pixle Accuracy0.67 IOU


4-class segmentation (superficial, partial thickness, full thickness, and unburnt)

929 segmented images

0.60 Pixle Accuracy0.37 IOU

Table 1. Summary of burn literature performance [4] [5] [6] [7] [8] [9] [10].

Web-scraped Dataset

To analyze the feasibility and limitations of burn classification, we developed a web-scraped dataset of first-, second-, and third- degree burn images. To keep our model focused on the burn region and ignore the underlying physical characteristics of the patient, such as body part, skin tone, and background, we included a collection of random body parts without burns as a ‘no-burn’ class in our dataset. The addition of the ‘no-burn’ images has a beneficial regularization effect on the learning process. A summary of the dataset and some example images are shown in Table 2 and Figure 2, respectively.

Class (alternate label)
Training Data

Validation Data


No burn (none)




First Degree (1)




Second Degree (2)




Third Degree (3)




Table 2. Summary of web-scraped dataset.

Figure 2. Web-scraped dataset examples.

Although some of the images had reliable labels, we had to manually label a significant portion of the data. Because of the ambiguity among different burn-depth guidelines and our limited experience with burn data, there is a high likelihood that there are some mislabeled images among the first, second, and third degree burn images, especially considering that experienced burn surgeons still misdiagnose burn-depth 20%–40% of the time. However, this analysis is only a starting point for our work with burn classification: this work serves as a proof-of-concept to demonstrate the ability of a well-trained CNN model to accurately classify burn depth, even with limited data. The final system for deployment in commercial, medical settings will be trained on significantly more images, which will be labeled by individuals with medical expertise. Projections

We used to visualize our data and analyze some of the dataset’s characteristics according to the ResNet34 2-dimensional projections. Although some of the projections focus on lower level features, like color or shape, the platform produces some projections that show clear delineations between groups of similar images without much training and with only a limited number of labels. Distinct groupings of first-degree burns, second-degree burns, third-degree burns, and limbs can be seen in Figure 3. This bodes well for our feasibility question: Can we classify burn-depth using simple, 3-channel visual color images of burns?

Figure 3. projection demonstrating clustering ability.

Training Procedure

We split our data into training and validation partitions using a 70/30 ratio and tested three different pre-trained architectures (ResNet10, ResNet34, and ResNet101) as well as several data transforms and parameters. We implemented the standard transfer learning process by first only fine tuning the head of the model and then unfreezing the remaining layers and training the weights of the entire model with a slower learning rate. Ultimately, it was very difficult to find transform settings that produced higher validation accuracy or faster training time than the default fastai transform values, with the exception of adding vertical flipping because of the top-down orientation of the medical images.


We achieved a maximum accuracy of 88.6% on the validation data. Our best results were achieved with the ResNet152 architecture pre-trained on ImageNet. DenseNet201 was a close second with 86.4% accuracy. We will revisit different architectures as we continue to develop our dataset and obtain access to a larger dataset of burn images.

We evaluated the model’s accuracy only on first, second, and third degree burn images. Including the accuracy of the ‘no-burn’ class positively skews the results because the model easily classifies non-burn images. The accuracy of the model when including the ‘no-burn’ class is 89.9%, while the accuracy of the model when only including first-, second-, and third- degree burns is 88.6% (confusion matrix: Figure 4).

Figure 4. Confusion matrix for the validation data excluding the none class.

The highest misclassification rates are between first- and third-degree burns. This may be due to the high presence of blisters among second-degree burns. The blisters act as a consistent physical characteristic making it easier for the model to distinguish them from first- and third-degree burns. Additionally, many first-degree burns are sunburns, which cause simple, minor irritation of the epidermis. In contrast, third degree burns involve deeper tissue penetration and are often characterized by skin discoloration. We expect that the consistent physical characteristics of first- and second-degree burns make them easier to classify in contrast to the third-degree burns. This assumption is reflected in our results, as indicated by the confusion matrix.

Because of the lack of continuity in terms of datasets, models, and classifications across various works on burn classification, direct comparisons with our results would be difficult to make. However, it is encouraging that in terms of raw percentage accuracy, our model’s performance already rivals that of experienced burn surgeons.

In addition to accurate burn depth classification, we are interested in the model’s ability to identify burnt skin. We have used Class Activation Maps (CAMs) to demonstrate our model’s burn localization ability. Although the model was not able to correctly localize the region of interest in all burn images, especially misclassified images, the initial results look promising for future work. Example CAMs for correctly classified first-, second-, and third-degree burns are shown in Figure 5.

Figure 5. Class Activation Maps for a first, second, and third degree burns.

Next Steps

We are currently working with a leading burn surgeon and burn center in the US to better understand the characteristics of various burns and the scope of the burn treatment process. This information will be used to adjust our model’s parameters to bring our classifications more in line with the medical community’s burn treatments and procedures. We are also developing a dataset comprising thousands of burn images that will be used to train a commercial model. Through a series of feedback iterations with the burn center, we are refining the scope of the dataset and curating images of first, second, third, and fourth degree burns. Finally, we are creating a series of gateway models that will be used to validate images before they are presented to the burn classification model. This will prevent the model from misclassifying non-burn images.


Our results indicate that it is possible to build a computer vision system to classify burns and augment the burn treatment process. We have used a pre-trained model to build an accurate burn classifier that has the ability to localize the burn region. Our next steps involve curating a larger, more accurate burn dataset, developing models to classify burns appropriately according to medical standards, and building a prediction pipeline that includes gateway models to prevent misclassification of non-burn images.


  1., “Burn Incident Fact Sheet”
  2. Pape, Sarah, et al. “An Audit of the Use of Laser Doppler Imaging (LDI) in the Assessment of Burns of Intermediate Depth.” Burns, vol. 27, no. 3, 2001, pp. 233-239. PMID: 11311516,
  3. Litjens, Geert, et al. “A Survey on Deep Learning in Medical Image Analysis.” Medical Image Analysis, vol 42, December 2017, pp. 60-88,
  4. Suvarna, Malini, et al. “Classification Methods of Skin Burn Images.” International Journal of Computer Science & Information Technology, vol. 5, no. 1, 2013, pp. 109-118.
  5. Hai, Son Tran., et al. “Real Time Burning Image Classification Using Support Vector Machine.” EAI Endorsed Transactions on Context-Aware Systems and Applications, vol. 4, no. 12, 2017, e4.
  6. Serrano, Carmen, et al. “Features Identification for Automatic Burn Classification.” Burns, vol. 41, no. 8, 2015, 1883-1890. PMID: 26188898,
  7. Acha, Begona, et al. “Segmentation and Classification of Burn Images by Color and Texture Information.” Journal of Biomedical Optics, vol. 10, no. 3, 2005, 034014.
  8. Acha, Begona, et al. “Burn Depth Analysis Using Multidimensional Scaling Applied to Psychophysical Experimental Data.” Transactions on Medical Imaging, vol. 32, no. 6, 2013, pp. 1111-1120.
  9. Despo, Orion, et al. “BURNED: Towards Efficient and Accurate Burn Prognosis Using Deep Learning.” Stanford University, 2017,
  10. Hai, Son Tran, et al. “The Degree of Skin Burns Images Recognition using Convolutional Neural Network.” Indian Journal of Science and Technology, vol. 9, no. 45,  2016, pp. 1-6. DOI: 10.17485/ijst/2016/v9i45/106772,