### Materials

The dataset contains 50 corneal endothelium images from the central cornea of 50 glaucomatous eyes, imaged with a non-contact specular microscope (SP-1P, Topcon Co, Japan). They are part of an ongoing study in The Rotterdam Eye Hospital regarding the implantation of a Baerveldt glaucoma drainage device.

Glaucoma is a condition related to the buildup of pressure inside the eye, which can eventually damage the optic nerve. In primary open-angle glaucoma (POAG), the eye cannot properly drain the aqueous humor through its drainage system, whereas in primary angle-closure glaucoma (PACG) the iris blocks the entrance of the drainage system. In PACG, surgical intervention is usually required to remove the blockage. In POAG, eye drops are the first treatment option in mild cases, either to reduce the formation of fluid in the eye or increase the outflow, but surgical intervention is usually considered when these treatment modalities have proven ineffective. Trabeculectomy is a common procedure, which consists of a small hole in the sclera, covered by a thin trap-door, which makes it possible to drain the aqueous humor out of the eye. However, scarring may lead to failure of the trabeculectomy. Therefore, but also because of other possible complications with trabeculectomies, glaucoma drainage devices are often preferred over trabeculectomy. Indeed, in refractory cases, the success rates five years postoperatively of Baerveldt implants are higher than those of trabeculectomies [42].

A common postoperative complication after implantation of a Baerveldt (or similar glaucoma drainage) device is a change in the CE, in both cell count and cell shape [43, 44], due to the proximity of the device’s tube. In the study currently ongoing in The Rotterdam Eye Hospital, eyes were imaged before and after the implantation of the device. Here, we focused on solving the cases prior to the implantation, which let us assume that the CE was only affected by the natural aging process. Indeed, it has not been observed that glaucoma has any direct effect in the morphology of the CE cells. In our dataset, the average age is 64.8±9.2 (mean ± SD). Our dataset showed a large variability in cell size and morphology, with a range of 1100-2800 cells/mm^{2} in ECD, and 18-36% in CV, and 44-74% in HEX.

Each image covers an area of 0.25 mm × 0.55 mm and was saved as 8-bits grayscale images of 240 ×528 pixels. According to the manufacturer, pixels have a lateral size of 1.038 *μ*m. On average, there are 240 cells per image. One expert created the gold standard by performing manual segmentation of the cell edges using an open-source image manipulation program (GIMP).

### U-net architecture

The U-net follows a standard fully convolutional architecture, with a contraction and an expansion path, each composed of four resolution steps (Fig. 8). In the contraction path, each step consists of two 4 ×4 padded convolutions with a rectified linear unit (ReLU), a dropout layer with a drop rate of 50% between the two convolutions, and a 2 ×2 max pooling with stride 2 at the end for downsampling. In the expansion path, each step contains a 4 ×4 transposed convolution with stride 2 for upsampling, a concatenation with the corresponding feature map from the contraction path, two 4 ×4 padded convolutions with ReLU, and a dropout layer with a drop rate of 50% between the convolutions. The convolutional layers in the first resolution step have 32 feature channels, doubling it at each downsampling step, and halving it at each upsampling step. In the last layer, a 1 ×1 convolution reduces the channels to the number of classes, which is set to two (cell body and cell edges). A cross-entropy loss function with a pixel-wise soft-max activation is used over the final feature map. No class weighting is employed. The optimizer of our choice is Adam [45] with an initial learning rate (*l**r*_{i=0}) of 0.001 and a decay of 0.001, such that *l**r*_{i}=*l**r*_{i−1}·(1/(1+decay·iteration)), where *i* denotes iteration. The network accepts the whole image as input. A batch size of 4 images is used.

Compared to the original U-net architecture, several modifications were introduced. First, we used a kernel size of 4 ×4 instead of 3 ×3, and the network was downscaled in width and depth, halving the number of feature channels and removing one resolution step.

Second, dropout layers were added in between the two consecutive convolutions per resolution step. Dropout is a regularization method used to avoid over-fitting, originally described for neural networks [46, 47]. It stochastically sets to zero a certain number of activations of hidden units at each training iteration. This prevents the co-adaptation of feature detectors by forcing neurons to rely on population behavior. In CNNs, it simply sets input values (of the feature maps) to zero.

Third, we used transposed convolutions for upsampling in the expansion path. The transposed convolution is described as the operation that forms the same connectivity as the normal convolution but in the opposite direction [48]. Since the weights in the transposed convolution are learnt, this avoids to predefine an interpolation method for upsampling. Unfortunately, transposed convolutions can also produce a checkerboard effect due to the uneven overlapping of the filter range in the output pixels [49]. Specifically, the uneven overlapping occurs when the kernel size is not divisible by the stride. While the CNN could, in principle, learn weights to avoid this problem, in practice this effect is often observed, especially in images with strong colors. One practical solution is to use a 4 ×4 kernel size with a stride of 2 [50]. Nonetheless, we did not observe in our work the checkerboard effect when using filters of 3 ×3 with a stride of 2.

### SW-net architecture

The SW-net architecture follows the same contraction path as the aforementioned U-net (Fig. 8). However, instead of the entire image, a patch of size 64 × 64 is provided as input, and the filter size of the convolutional layers is 3 ×3. At the end of the contraction path it adds a global averaging pooling layer, where each channel is reduced to its average value, and a fully connected layer of two neurons, which provides the outcome for the two classes regarding the central pixel of the patch. A batch size of 128 patches is used here, which holds a similar amount of data as the batch in our U-net. Moreover, the same loss function and optimizer is employed.

The original Cireşan et al.’s architecture [3] consisted of four stages of one convolutional layer followed by max-pooling. All convolutional layers had 48 feature maps and filters of size 4 ×4 (one of 5 ×5). The network ended with two fully connected layers: one of 200 neurons followed by another with 2 neurons to obtain the class labels. In comparison, our network has doubled the number of convolutional layers, albeit with a smaller kernel size, increasing the receptive field (61 pixels instead of 48 pixels). Moreover, we substituted the large fully connected layer with a global averaging pooling. This idea was originally suggested by Lin et al. [51], where he argued that fully connected layers at the end of a CNN are prone to over-fitting, whereas global averaging layers are more native to the convolution structure, over-fitting is avoided, and the feature maps can be interpreted as categories confidence maps.

### Prediction

For U-net, the segmentation was retrieved directly from the network output. In the SW-net, one patch per each pixel was retrieved, building up the segmentation image with the classification value of each patch. Images were mirrored in order to extract the patches that reached beyond the image borders.

### Postprocessing

To obtain the final segmentation, we smoothed the CNN output and applied the classic watershed algorithm [52]. Specifically, we first estimated the average cell size in the image by Fourier analysis in order to built a Gaussian smoothing filter whose standard deviation was related to that size. It is well known how the 2D Fourier Transform (FT) of a CE image shows a distinctive concentric ring due to the fairly regular pattern of the cells [27], and for the output of the CNN that ring is clearly noticeable (Fig. 9a). Selig et al. described in [27] how the radius of the ring, called *characteristic frequency* (*f*^{∗}), is related to the most common cell size in the image, *l*=1/*f*^{∗}. We estimated the radius by first applying a method called ‘reconstruction by dilation’ to remove the low frequencies (defined by Selig et al. in [27]) and later computing the 1D radial magnitude, defined as the angular averaging of the magnitude of the 2D FT of the images,

$$ \mathcal{F}_{RM}(f)=\frac{1}{2\pi}{\int\nolimits}_{0}^{2\pi}{|\mathcal{F}(f,\theta)|d\theta}, $$

(4)

where \(\mathcal {F}(f,\theta)\) is the FT of the image in polar coordinates (Fig. 9b). In our previous work [39], we described a fitting function to estimate the peak position (*f*^{∗}) and also derived a parameter, *k*_{σ}=0.20, used to adapt the filter *σ* to each image, *σ*=*k*_{σ}/*f*^{∗}. Once images were smoothed, the watershed algorithm was applied, and the clinical parameters were estimated from the resulting images. The classic watershed does not require any parameter tuning, but it is expected that each object (cell) to detect has a single local minimum, otherwise cells will be oversegmented.

### Labels

The gold standard, a binary image where value 1 indicates a cell edge and value 0 represents a cell body, was defined such that cell edges are 8-connected-pixel lines of 1 pixel-width (Fig. 10b). In the intensity image, the cell edges might appear thicker, with a steep but clear transition in intensity from the peak of the edge towards the inner cell. However, this thickness might vary considerably even in the same image (Fig. 10a). Hence, instead of using the gold standard images as labels, we proposed to use probabilistic labels where edges appear thicker and in which the aforementioned intensity transition between edges and cells is preserved. There are three reasons for doing so: (1) it is counterproductive to teach the network that the pixels adjacent to the annotated 1-pixel-width edge are cell pixels as they usually have the same characteristics as the annotated edge; (2) mimicking the intensity transition in the labels is a more natural approach and helps the network in its classification task; (3) as the network will learn to replicate this pattern (gradual intensity transition between edges and cell bodies), this will be beneficial when applying the watershed algorithm in the postprocessing step.

To create the probabilistic labels, we convolved the gold standard images with a 7 ×7 isotropic unnormalized Gaussian filter of standard deviation 1 pixel. This allowed all pixels with label 1 (edges) in the gold standard to keep a value equal to 1 in the probabilistic label image, with increasingly smaller probabilities for pixels further away from the annotated cell edge (Fig. 10c). Hence, the pixels in the label image can be regarded as the probability of being part of an edge. This is used as the target output of the networks to be trained. During evaluation, the edge class was considered any pixel with *p*>0.5. In practice this means that we accept a 1 pixel error in the location of the edge. For comparative purposes, we also evaluated the outcome segmentation when the ‘hard’, binary gold standard labels are used as target output.

### Preprocessing of the intensity images

Specular microscopy images usually have a non-uniform luminosity across the image and low contrast (Fig. 11a). Here, we want to evaluate whether the CNN can benefit from some kind of image enhancement. Furthermore, it is common practice in neural networks to standardize the input images,

$$ {image}_{stand} = \frac{\text{image} - mean(\text{image})}{std(\text{image})}, $$

(5)

or normalize them,

$$ {image}_{norm} = \frac{\text{image} - min(\text{image})}{max(\text{image})-min(\text{image})}. $$

(6)

To enhance local image contrast, we proposed to use contrast limited adaptive histogram equalization (CLAHE) [36] with a kernel of 24 ×24 (Fig. 11b). This kernel size matches approximately the area of the average cell. A kernel with a size less than half of a cell would overamplify noise, whereas a kernel too large would reduce the benefits of local contrast enhancement. In earlier work on aneurysm detection in fundus images, we achieved a much better performance with intensity normalization than without it [53].

In summary, we tested the influence of preprocessing by analyzing five possible scenarios: feeding the raw images, normalizing them, standardizing them, and enhancing them by CLAHE (with and without standardization, since the output of CLAHE is already normalized).

### Data augmentation

Given the nature of the images, flipping them horizontally and/or vertically was a natural way of augmenting the training data by a factor of four. We avoided other transformations, such as rotation or elastic deformations [54], for two reasons: (1) the images show a small degree of distortions only in horizontal and vertical lines, hence rotating or deforming the images would create new noise patterns that do not exist in the original images; (2) when rotating, the image corners need to be filled, either by mirroring the image or setting that area in black; either way, we are introducing new patterns to be solved by the network.

### Implementation details, and computational cost

The data set was divided in 5 folds of 10 images each. To obtain the optimal network parameters, we used 4 folds for training and 1 for validation/test. For the evaluation of the CNN segmentation and the clinical parameters, a 5-fold cross-validation approach was employed in order to test all the remaining folds, using the parameters determined in the first test set.

Regarding class weighting in U-net, we evaluated whether adding weights in the loss function was advantageous. Here, the edge class has 4 times less pixels than the cell class. For the SW-net, we sampled the same amount of patches per class in each batch.

Other loss functions were tested, specifically mean-squared and mean-absolute loss, but with very similar performance as using cross-entropy. Batch normalization layers [55] were also tested by including them after every ReLU, but this created slightly more over-fitting and degraded the performance. Similarly to what Springenberg et al. reported in [56], no differences were observed in SW-net if max-pooling layers were substituted with a stride of 2 in the previous convolutional layer.

CNN filter weights were initialized from an uniform distribution of mean=0 and width≈1 (glorot uniform initializer in Keras). Networks were coded in Python 3.6 programming language, using the Keras library and Tensorflow as backend. Experiments were run in the free research tool Google Colaboratory, which includes GPU support (Tesla K80), taking roughly 0.8 s per training iteration in U-net and 0.5 s for the SW-net. The testing took less than 1 s per image for U-net. However, for the SW-net, evaluating all patches in an image took around 1 min. The postprocessing and parameter estimation took barely 1-2 s per image.

### Metrics and statistical analysis

In the evaluation of the CNNs performance, accuracy and AUC were provided. However, due to the probabilistic nature of the labels, pixels with label values *p* close to 0.5 are not relevant for our ultimate goal. Indeed, the most important pixels are either at the crest of the cell edge (*p*=1) or at the inner cell body (*p*=0). Furthermore, the class imbalance makes it important to evaluate each class performance independently. Hence, we also reported the precision (PRE), sensitivity (SEN), and specificity (SPE) for the final designs, but only considering the pixels with values 0 and 1 in the label images. For clarification purposes, we placed an asterisk (*) in the metrics that followed this rule.

In the evaluation of the postprocessed segmentation, only the cells within the area of the gold standard were kept, discarding all cells in contact with the image borders. We used the modified Hausdorff distance (MHD) [38] to measure the distance between the gold standard and the proposed segmentation. MHD is defined as

$$ \text{MHD}(\mathcal{U,V})=max(hd(\mathcal{U,V}), hd(\mathcal{V,U})), $$

(7)

where

$$ hd(\mathcal{U,V})=\frac{1}{|\mathcal{U}|} \sum\limits_{a \in \mathcal{U}} \min_{b \in \mathcal{V}} ||a-b||_{2}, $$

(8)

\(\mathcal {U}\) is the gold standard segmentation, and \(\mathcal {V}\) the proposed segmentation.

DICE [31] was used to assess the segmentation at the cell level. We computed the DICE for each cell independently, reporting the average DICE. Specifically, for each cell (*C*_{i}) in the gold standard images, we select the superpixel (*S*_{j}) in the proposed segmentation with the largest overlap to *C*_{i}, such that TP =*C*_{i}∩*S*_{j} (True Positive), FN =*C*_{i}∖*S*_{j} (False Negative), FP =*S*_{j}∖*C*_{i} (False Positive),

$$ \text{DICE}_{ith \ cell}=\frac{2 \cdot TP}{2 \cdot TP + FP + FN}, $$

(9)

$$ \text{DICE}_{image}=\frac{1}{n} \sum\limits_{i=1}^{n} \text{DICE}_{ith \ cell}, $$

(10)

where *n* is the number of cells in the image.

We also evaluated the number of cells correctly segmented, reporting the number of cells that were oversegmented (divided in more than one superpixel) and undersegmented (within a superpixel that covers more than one cell). We considered a cell was correctly segmented if its *T**P*_{i}>0.80·*m**a**x*(*C*_{i},*S*_{j}). That margin was added to allow small deviations in the cell boundary locations and was selected after visual analysis.

For the three previous metrics, either the parametric paired t-test or the non-parametric Wilcoxon signed-rank test was performed to determine which method, U-net or SW-net, was more accurate. We used the non-parametric test when the distributions did not fulfill the normality assumption (Shapiro–Wilk normality test). A *p*-value of *p*<0.05 was considered statistically significant.

In the evaluation of the clinical parameters, a statistical analysis based on linear mixed-effects models [57] was performed to determine, for each parameter, whether there was a statistically significant difference in accuracy (smaller absolute mean) and in precision (smaller variance) between the two estimation errors. To determine whether the variances were different, we used a likelihood test to compare a model that assumes equal variances between both estimation errors with a model that assumes different variances. From the fixed effects test of the models we evaluated whether the absolute mean values in both estimations were different. No correction for multiple testing was applied, and a *p*-value of *p*<0.05 was considered statistically significant.