Single-cell conventional Pap smear 1 image classification using pre-trained 2 deep neural network architectures

Abstract


Background
Cervical cancer is a women-specific sexually transmitted infectious disease caused by, mainly, high-risk human papillomavirus (HPV).Worldwide, an estimated 570,000 cases and 311,000 deaths were registered in 2018 only.Among these numbers, about 85% of them are from developing countries (1).
Considering the prevalence of the disease, international organizations such as the World health organization (WHO) start to set new initiatives to eliminate it from the public health burden.The WHO's new strategy emphasized on the elimination of cervical cancer from public health problems by 2030 mainly focusing on three pillars (prevention, screening and treatment/ management) in a comprehensive approach.In the strategy, it is clearly stated that to reach the stage of cervical cancer elimination, every country must give 90% coverage of HPV vaccine for girls of 15 years of age, perform 70% high-performance cervical cancer test (screening) for females between ages of 35 and 45, treat 90% of precancerous lesions and 90% management of invasive cancer patients (2).
In the past few decades, high-income countries have implemented population-wide screening programs and showed a significant reduction in mortality and morbidity caused by cervical cancer (3,4).The experience could be a good model to be further extended in lowand middle-income countries.However, the lack of basic resources such as skilled health personnel and screening tools have been posing a major challenge (5,6).
The latest WHO guideline regarding cervical cancer screening recommends three main techniques: high-risk HPV type testing using polymerase chain reaction (PCR), visual inspection with acetic acid (VIA), and cervical cytology (7).Among the three, cervical cytology is the most common and the orthodox way of screening.It has been considered as the standard technique valuing its contribution to the reduction of incidence and mortality rate in many high-income countries worldwide (5).The popular and well-developed techniques for cervical cytology are conventional Papanicolaou smears (CPS) and liquidbased cytology (LBC).The results of comparative studies focusing on the quality of CPS and LBC samples concluded that LBC is better than CPS (8,9).However, considering the economic burden, LBC is more common in high-income countries whereas CPS is more preferable in low-and middle-income countries (8).
Even though cytology techniques are effective in the reduction of morbidity and mortality, they suffered from the following main drawbacks: their sensitivity is less optimal, the interpretation of the result mainly depends on the morphological characteristics of the cytoplasm and nucleus of the cell which requires a highly skilled cytotechnologist.
Moreover, analyzing a single sample takes considerably longer time and labor-intensive.
To bridge the aforementioned gaps of cervical cytology screening, computer vision experts have been proposing semi-or fully-automated computer-aided analysis tools, especially for the CPS ones.They developed systems that either classify single-cell CPS images or detect abnormal cells from full-slide CPS images.A detailed and extended review is found in (10).
In general, as illustrated in Figure 1, three single-cell CPS image analysis pipelines have been proposed in the literature.The traditional techniques follow either pipeline 1 or pipeline 2 or both combined.The main difference between the two pipelines is the requirement of the segmentation stage.For instance, if the required feature vectors are attributes of the morphology or shape of an object such as area, perimeter, thinness ratio, and eccentricity, first, the cytoplasm or the nucleus of the cells need to be segmented from the rest of the image content.On the other hand, if the required features do not require descriptors of segmented objects such as chromatin and texture, the segmentation stage will be skipped as it is depicted in the pipeline 2. In other words, the feature vectors will be directly calculated from the pre-processed CPS images.Features calculated using the two pipelines commonly known as hand-crafted features.Hand-crafted features give a privilege to the computer vision expert to select and supervise the extracted feature vectors.Sometimes dimensionality reduction schemes pick the right subset from a bucket of large feature vectors.Approaches that follow pipeline 1 and pipeline 2 have presented in (11)(12)(13)(14)(15)(16)(17)(18).The other technique (pipeline 3) takes the benefit of deep convolutional neural networks (CNNs) to learn complex features directly from the labelled raw images.The main advantage of these deep CNNs is their ability to extract feature vectors without the intervention of computer vision experts.Works presented in (19)(20)(21)(22)(23)(24) are good examples of pipeline 3.In this study, we followed pipeline 3 and leveraged pre-trained deep learning architectures considering their recent remarkable performance improvement for image classification tasks.Secondly, they used the pre-activated CNN of the model as a feature extractor and employed a support vector machine (SVM) as a classifier.Thirdly, they extracted the features by activating the CNN of the model to extract the feature maps before classifying the images into 5 classes using SVM.They achieved an average accuracy of 95.35%, 93.35% and 94%, respectively.Except for the benchmark, to the authors' knowledge of the researchers, there are no deep learning based works that base their work on the SIPaKMeD dataset.
In this study, our main aim was to classify single-cell CPS images from SIPaKMeD dataset into five classes using the fine-tuned architectures of pre-trained image classification deep neural networks.

Experiment and Results
To maintain a fair comparison, all the training hyperparameters were kept identical in all experimental setups.As illustrated in Error!Reference source not found.Figure 2 and Figure 3 the networks were trained over 100 epochs with a loss function of categorical cross-entropy, a batch size of 32 and adagrad optimizer.Also, we have trained all the models with an initial learning rate of 0.001 which changes its value by a factor of 0.5 if there is no increment in the validation accuracy over 10 consecutive epochs until it reaches a value of 0.00001.After the training, we evaluated the architectures using the test dataset and their evaluation results are summarized in Table 1.

Discussion
As it can be inferred from Table 1, DenseNet169 outperform the rest of the architectures in all evaluation mercies.The normalized average accuracy, precision, recall and f1-score values are 0.990, 0.974, 0.974 and 0.974, respectively.Almost in all experiments, Koilocytotic cells are the challenging one to classify, i.e. their true positive value is the least compared to the others.Similar reporting can be found in the benchmark manuscript (17).
In the second place, metaplastic cells are to be classified incorrectly.
When we further inspected the aforementioned cell types, we found out that most of the false negatives of Koilocytotic cells were incorrectly classified as metaplastic and most of the metaplastic cells were incorrectly classified as Koilocytotic cells.That means, the experiment tells us the need to increase the data variation between the two classes.As an example, see Table 1.Individual and average accuracies, precisions, recalls and F1-scores of the deep neural architectures when they evaluated using the test dataset.Another aspect of the experiment we need to give an emphasis is the size of the weight files.DenseNet169 has the smallest weight size (Table 2 shows the size of the original weight file).However, the size is not suitable to deploy to mobile and edge devices.It is required to experiment with small size and mobile image classification architectures.
Finally, we compared our finding with the deep convolutional neural network benchmark results reported by the dataset creators.In their work, they presented an average accuracy of 95.35 ± 0.42 % using the fine-tuned VGG19 network as a feature extractor and softmax as a classifier.In this research, we achieved a normalized average accuracy of 0.990 which is way better than the benchmark.

Conclusion
In this paper, we experimented with the top ten Keras Applications deep learning architectures selected based on their top-1 accuracy for image classification tasks.We used the selected architectures to analyze their performance to classify single-cell CPS images.All the architectures were re-trained using SIPaKMeD dataset by changing the output layer of the architectures from 1000 classes used in ImageNet to 5 classes of CPS images.From the selected 10 pre-trained architectures, DenseNet169 outperformed the other architectures and achieved state-of-the-art performance compared to the benchmark result generated by the SIPaKMeD dataset creators.Using DenseNet169 a normalized average accuracy of 0.990 was achieved which is greater than the benchmark by approximately 3.7%.In the future, further experimentation with small size and mobile architectures is required to make the size of model weights suitable for mobile and edge devices.

Dataset
In this study, a recently introduced publicly available dataset named SIPaKMeD was used (17).The dataset contains a total number of 4049 single-cell images that were manually cropped-out from 966 full-slide Pap smear images.The cells were grouped based on their abnormality and benign level as -superficial-intermediate cells (SIC), parabasal cells (PC), koilocytotic cells (KC), dyskeratotic cells (DC) and metaplastic cells (MC).The first two are normal, the second two are abnormal and the last one is benign.The distribution of images across the single-cell image classes is seemingly uniform -831, 787, 825, 793 and 813, respectively.Figure 5 shows representative images of the five classes.

Data Preparation
The dataset was partitioned into train, validation and test sets.The test data was prepared by taking randomly 100 single cell images from each cell category which accounts for nearly 12% of the total dataset.The remaining dataset was partitioned into 80% for training and 20% for validation.
As depicted in the pipeline of this study (see Figure 6), during the data pre-processing stage all images were resized to 128x128 before being fed to the classification models to keep the pre-trained architecture requirements.Image normalization was done to keep the dynamic range of pixel intensities of the image between 0 and 1.During the model training, we applied affine transformation using horizontal and vertical flipping and rotation in the range of -45 0 and 45 0 to increase intraclass variation.Even though the cross-class distribution is considerably uniform (the ratio between the classes with the smallest to the largest number of images is approximately 0.95), at the end of the preprocessing stage, class weight balancing was applied to the training and validation partitions of the dataset using Equation 1.At the time of training, the distribution of the classes for individual batches turned out to be 0.97, 1.03, 0.98, 1.02 and 1.00 for SIC, PC, KC, MC and DC, respectively.

Proposed Approach
In this study, as illustrated in Table 1, we selected 10 popular pre-trained deep learning architectures for image classification from Keras applications (25) based on their top-1 accuracy.Top-1 accuracy refers to the normalized performance of a model to predict exactly the expected answer.For example, the probability of NASNetLarge to predict exactly the first answer is 0.823 out of a unit scale.The selected modes were trained on ImageNet (26)a dataset of 1000 classes of natural images.Considering the dataset size (enough for deep learning applications) and the image type (microscopic image -not natural image), the selected learning approach was re-training the entire architectures after fine-tuning the classifier layer.In other words, the architectures of the feature extraction base were re-trained again using the CSP dataset to populate it with new weights and the output layer was fine-tuned from 1000 classes down to 5 classes.To converge the output of the feature extraction base from 4D tensor to a 2D tensor an average pooling layer was introduced.In the end, fully connected links were created between the pooling layer and the output dense layer as indicated in Figure 6.All the experiments were performed using Google's free cloud platform Kaggle with NIvida Tx1008 GPU and 12 GB of RAM.

Evaluation Metrics
We evaluated the performance of the classification models using four objective evaluation metrics including accuracy, precision, recall and f1-score.The metrics base their mathematical foundation on the true positive (TP), true negative (TN), false negative (FN) and false positive (FP) values of the models' prediction.A comprehensive summary of the metrics is found in (27) and their mathematical formulation as follows.

Figure 1 .
Figure 1.Common pipelines to classify CPS images: classification using hand-crafted features

Figure 2 .
Figure 2. .Training accuracy (left) and training loss (right) of the architectures over 100

Figure 3 .
Figure 3. Validation accuracy (left) and validation loss (right) of the architectures over 100

Figure 4 .
Figure 4. Confusion matrix for classification of the test dataset using fine-tuned

Figure 5 .
Figure 5. Sample images from the SIPaKMeD dataset: superficial-intermediate (a), parabasal Where,   stands for the weight of the class ,  for the total number of samples,  for the number classes and   for the samples in the class j.

Table 2 .
Proposed pre-trained classification models weight size and their top-1 accuracy performance on the ImageNet's validation dataset.