Building a Diabetic Retinopathy Prediction Application using Azure Machine Learning

This post is co-authored by Anusua Trivedi, Data Scientist, Microsoft; Patrick Buehler, Data Scientist, Microsoft; Dr. Sunil Gupta, Founder, Intelligent Retinal Imaging System (IRIS); and Jocelyn Desbiens, Researcher, IRIS.  


Diabetic Retinopathy (DR) is the most common cause of blindness in the working population of the United States and Europe. The World Health Organization (WHO) predicts that the number of patients with diabetes will increase to 366 million in 2030. For patients with diabetes, early diagnosis and treatment have been shown to prevent visual loss and blindness. Automated grading of DR has potential benefits such as:

  • Increasing the efficiency, reproducibility, and coverage of screening programs.

  • Reducing barriers to access.

  • Improving patient outcomes by providing early detection and treatment.

To maximize the clinical utility of automated grading, an accurate algorithm to detect referable DR is needed.

Machine Learning on DR images

Machine Learning has been used in a variety of medical image classification tasks including automated classification of DR. However, much of the work has focused on feature extraction engineering which involves computing image features specified by experts, resulting in algorithms built to detect specific lesions or predict the presence of many types of DR severity. Deep Learning is a machine learning technique that avoids such feature engineering by learning the most predictive features directly from the images given a large data set of tagged examples. Identifying candidate regions in medical images is of uttermost importance since it provides intuitive visual hints for doctors and patients of how the diagnosis is inferred. Recently, advances in Deep Learning have dramatically improved the performance of DR detection.

Partner Collaboration

In collaboration with our partner, Intelligent Retinal Imaging Systems (IRIS), the first step of the analysis was defined as experimenting from image-level pre-processed annotations of DR which highlights the suspicious regions. This can help a physician, examining an image, by selecting areas showing high probability of being lesions. These annotated images are passed on to a Convolutional Neural Network (CNN) which in turn predicts their respective DR severity. In this approach, a combination of extracted features, targeting specific characteristics like micro-aneurysms, hemorrhages, and exudates for DR detection were used, with the robust potential of Deep Learning systems to yield more precise results.

Clinical Dataset and Image Grading

IRIS provides all the images for this collaboration. The initial image population consists of a collection of fundus images coming from 250,000 adult diabetic patient exams (87,401 unique patients from various ethnic origins and U.S. locations) with 33,428 patients suffering from a sight threatening disease. Eye exams were done during normal routine checkups. Afterward, all images were graded by ophthalmologists for the presence of diabetic retinopathy and diabetic macular edema. DR severity was graded according to the International Clinical Diabetic Retinopathy scale. The severity breakdown per eye for the original population was: 62% Normal, 24% Mild, 8% Moderate, 3% Severe, and 3% Proliferative. To effectively create a valid dataset, images have been randomly chosen in the database based on their quality scores prior to being added to the process to create a stratified random sample. An image quality process which provides a score for each image passed through it was employed to sort images. This process was run on every processed image and a score is provided to show the expected gradeability of the image based on feature detection. Images of excellent, good, and adequate quality was considered gradable. The graders for the training and validation sets were all US-licensed ophthalmologists.

The following model/application is intended for research and development use only and is not intended for use in clinical diagnosis or clinical decision-making or for any other clinical use and the performance of the model/application for clinical use has not been established.

Microsoft AI Platform for Training Models

To build the AI models, the following tools and platform were used:

  • DSVM as the compute environment with an NVIDIA Tesla K80 GPU, CUDA and cuDNN libraries.

  • All the experiments were run on Microsoft Azure Machine Learning (AML). Azure Machine Learning is a cross-platform application, which makes the modelling and model deployment process much faster versus what was possible before. Three deep learning models were created using open-source packages supported in AML. Cognitive Toolkit (CNTK) and Keras with Tensorflow backend were used to build the models. The software Pip installed all the dependencies in the AML environment.

Training the AI Model

Two-Step Paradigm

The two-step (i.e., feature extraction à prediction) automated DR detection paradigm was predominant in the field of DR detection for many years. Given a color fundus image, this type of solutions basically extracted visual features from the image based on anatomical structures like blood vasculature, fovea and optic disc using standard image processing methods such as Wavelet transform, Gabor iterations, Local Binary patterns, etc. With the extracted features, an object classification algorithm like Random Forests or AdaBoost are usually applied to identify and localize DR lesions. This approach forms the first phase of our work, producing feature-annotated images. For feature extraction several steps were followed:

Radial Symmetry Transformation: Radial symmetry maps are generally used to detect interest points in images that correspond to object centers. Many works on radial symmetry maps exist but the most commonly used is the one by Loy et. al. With this approach a symmetry score is calculated from votes of one pixel to surrounding pixels. The transform is calculated in one or more radii n, the value of the transform at radius n indicates the contribution to radial symmetry of the gradients a distance n away from each point (see Figure 2a). Round-shaped objects present in the image will appear as local optima in the result map.

Entropy Transformation: In information theory, entropy is used to quantify the amount of information. The entropy reflects the information content of symbols independent of any probability models. Image analysis takes the concept of entropy in the sense of information theory, where entropy is used to quantify the minimum descriptive complexity of a random variable. Because entropy can provide a good level of information to describe a given image, we used the Tsallis’ entropy of the distribution of the symmetry image and obtained an appropriate partition of the source image giving a probability map of the potential lesions (see Figure 2b).

DR Candidate Lesion Detection: The most common signs of DR are dark lesions (micro-aneurysms, hemorrhages) and bright lesions (exudates, drusens and cotton wool spots). The presence of dark/light lesions is indicative of early stage DR. Progression of DR also causes macular edema, neo-vascularization and in later stages, retinal detachment. Many methods were proposed in previous works but none of them use the radial symmetry map to detect either dark or bright lesions.

Bright and Dark Candidate Localization: Micro-aneurysms are focal dilatation of retinal capillaries and appear as red dots in retinal fundus images. Wherever capillary walls are weak inside the retina, dot hemorrhage lesions are found which are slightly larger than micro-aneurysms. On rupturing it will cause intra-retinal hemorrhages. The following steps are followed to localize micro-aneurysms and hemorrhages on a given color fundus image:

  • Compute the symmetry image with the respective radii for the type of lesion to be detected.

  • From the symmetry image retrieve the set of local minima.

  • Eliminate from the set of local minima the candidates located too far away from the macula, i.e., candidate lesions being at a distance greater than two disk diameters from the fovea.

  • Eliminate from the set of local minima the candidates located too close from each other.

  • Use local entropy image to regroup adjacent candidates (see Figure 2b).

  • Annotate image with findings (see Figure 1b).

Bright lesions or intra-retinal lipid exudates result from the breakdown of blood retinal barrier. The same steps as the ones above are used to detect the bright lesions (see Figure 1b).

Figure 1: Source and annotated images.**a) An image of a proliferative DR retina, and b) Bright and dark lesion top normalized image.

Figure 2: Image transformations.**a) Radial symmetry function, and b) Tsallis entropy map applied to image in Figure 1a.

CNN Model

IRIS** **worked on developing the Two-Step Paradigm approach and creating deeply annotated images. A CNN was trained on these images to classify presence of DR in the images.

In recent works, CNNs have achieved superior performance in many visual tasks, such as object classification and detection. However, the opaque learning strategy makes CNN representations a black box. Except for the network output, it is difficult for people to figure out the rationale behind some CNN predictions. In recent years, researchers have realized that a high model interpretability is of great values in both theory and practice and developed variable success models with human-readable knowledge representations. The Inception-Resnet (see Figure 3) was adopted as the second phase of basic architecture.

Inception is an architectural element developed to allow for some scale invariance in object recognition. If an object in an image is too small, like a dot hemorrhage for instance, a CNN may not be able to detect it because the network iterations were trained to recognize larger versions of hemorrhages. Inception can convolve through the image with various iteration sizes, all at the same computational step in the network. To use the pre-trained model over fundus images, it was decided to use transfer learning and fine-tuning. An ImageNet pre-trained Resnet-50 model was used, and the Inception’s bottom layers were frozen to prevent their weights from getting updated during training process while the remaining top layers were trained with the pre-processed fundus images.

Implementation Details

Three types of images can be produced by our lesion classification process: Original, Enhanced, and Annotated images. A differential experiment was conducted to check out which of these types yield the best performance when given as inputs to a CNN. The experiment was run over three equivalent datasets of 1,949 high-quality images and validated by running the model over a set of 539 images. By far, the annotated image type outperforms original and enhanced image types by 20 points (see Figure 4a). The original pre-processed annotated images were only used for training the network once. Afterwards, real-time data augmentation was used throughout training to improve the localization ability of the network. During every epoch each image was randomly augmented with:

  • Random rotation of -45° to 45°.

  • Random horizontal and vertical flips.

  • Random horizontal and vertical shifts.

Figure 3: Inception-v4 network.

Figure 4: Prediction accuracy and generalization error.a) Prediction accuracy, and b) Scaling generalization error.

Experimental Results

Hestness et. al. presented empirical results showing how increasing training data size results in scaling of generalization error and required model size to the training set for application domains (see Figure 4b) reproduced from the authors’ paper). These relationships hold across various model architectures, optimizers, and loss metrics. From the experiments that were conducted using CNN, this expression was found to stand for the type of fundus images that were being processed. Since accuracy plateaus after 5,000 images, it was decided that a sample size of 7,000 would be more than sufficient to test the CNN with an estimated loss value of around 0.14. The image sample was split in two parts of 5,000 images for the training set and 2,000 for the validation set. Five experiments were run with different training/validation datasets (see Figure 5).

Figure 5: Power law and loss decay.


Figure 6 summarizes the performance of the algorithm in detecting referable DR (DR/No DR – binary classification) on the validation data sets for fully gradable images. The algorithm achieved an average accuracy of 97.1% with an average sensitivity of 96.6% and an average specificity of 98.0% for an AUC value of 99.5%. As predicted, the average loss was equal to 0.16781.

  • For No DR vs. Non-Sight Threatening DR the accuracy was 89.5% with sensitivity of 87.5% and specificity of 91.8%.

  • For No DR vs. Sight Threatening DR achieved an accuracy of 97.9% with sensitivity of 97.9% and specificity of 99.5%.

  • For Non-Sight Threatening DR vs. Sight Threatening DR the accuracy was 79.7% with sensitivity of 77.7% and specificity of 94.7%.

On the other hand, on the publicly available Messidor-2 dataset, our approach achieved a sensitivity of 92.9% and a specificity of 98.9% with an AUC of 99.2%.

Prior Study Performance Comparison

CNN-automated DR evaluation has been previously studied by other groups. Abràmoff MD et. al. reported a sensitivity of 96.8% at a specificity of 87% for detecting referable DR on Messidor-2 dataset. Gargeya R et. al. reported a sensitivity of 93% at a specificity of 87% on the same dataset. Gulshan V et. al. reported a sensitivity of 87% at a specificity of 98.5%. (see comparison table in Figure 7).**

Figure 6: ROC and Precision/Recall curves.

Figure 7: Performance comparison.

Operationalize Models on using Azure Machine Learning

Operationalization is the process of publishing models and code as web services and the consumption of these services to produce business results. The AI models can be deployed to local or Azure Container Service (ACS) cluster. You can scale the service to more containers.

Azure Machine Learning (AML) was used for deploying AI models in local system or on cluster. AML enable data scientists to easily operationalize their models. To operationalize an AI model using AML the command-line interface (CLI) can be leveraged to specify the required configurations using the AML operationalization module. From the AML command line:

Make sure to run az login before the environment setup step.

Make sure to run az ml env set -n [environment name] -g [resource group] to set up deployment environment

AML Web deployment solution architecture diagram is in Figure 8.

Once the deployment environment is setup, run the commands in Figure 9 to operationalize the AI model.

Figure 8. AML operationalization architecture.


Figure 9. AML commands to operationalization AI model.


In this blog post, it was shown how Azure Machine Learning was used to train and test an AI model. In this DR classification work, we took a multi-step paradigm approach, based on Deep Learning, coupled with a pattern matching pre-marking process. This unique combination of approaches has helped us achieve high accuracy for detecting referable DR. Moreover, the pre-processed and pre-digested annotated images help provide the visual diagnostic report we have been lacking from traditional Deep Learning processes.

NOTE: For privacy reasons, we cannot make the blog code and data public. But some sample code on image classification using Azure Machine Learning is available here.

Anusua, Patrick, Sunil & JocelynYou can email Anusua with questions related to this post.