Green roofs: automatic detection of green roofs and vegetation type from aerial imagery¶
Clotilde Marmy (ExoLabs) - Swann Destouches (Uzufly) - Ueli Mauch (Canton of Zürich) - Alessandro Cerioni (Canton of Geneva) - Roxane Pott (swisstopo)
Proposed by the Canton of Zürich and Canton of Geneva - PROJ-VEGROOFS
Project start in November 2023 - Intermediate publication on November 7, 2024 - Complementary publication on March 3, 2025
All scripts are available on GitHub: the traditional machine learning approach and the deep learning approach.
This work by STDL is licensed under CC BY-SA 4.0
Abstract: With rising temperatures and increased rainfall, mapping green roofs is becoming important for urban planning in dense areas like Geneva, Zürich and the surrounding areas. Green roofs, whether engineered or spontaneous, provide cooling, rain capture, and habitats, supporting biodiversity. Using national aerial imagery and land survey data, the study focuses on identifying green roofs and distinguishing among various vegetation types, including extensive, intensive, spontaneous, lawn, and terrace categories. Machine learning and deep learning approaches have been developed to detect and classify green roofs in two study areas on the cantons of Geneva and Zürich. Regarding the machine learning setup, statistical descriptors for the roof occupancy were derived from airborne images to train a random forest and a logistic regression predicting if a roof was green or not. Metrics on the test dataset showed that the best performance was achieved by combining two models, a random forest and a logistic regression, trained with pixel statistics from potential vegetated areas defined by NDVI and luminosity thresholds on the original images. This combination yielded a recall of 0.87 for the green class and an F1-score of 0.85 on the entire test set. Secondly, in the approach leveraging deep learning techniques, a customed alogrithm has been implemented. The model is an adaptation of the DeepLabV3 model. It reuses its ASPP (Atrous Spatial Pyramid Pooling) module and processes its output signal in order to achieve image classification. The model consists of three blocks: the backbone, the ASPP and a residual Multi-Layer Perceptron (MLP). The backbone itself is a convolutional encoder with adjustable width and depth. The model has been trained to perform multiclass classification among the following categories: bare, terrace, spontaneous, extensive, lawn and intensive, on which it achieved recall scores of 0.91, 0,77, 0.67, 0.79, 0.68 and 0.50 respectively. To challenge the machine learning approach, it has also been trained for the binary classes, resulting in a F1-score of 0.92 on its validation set.
1 Introduction¶
With the rise of temperatures, the intensification of rain events and the care for biodiversity, mapping green roofs for urban planning is gaining importance. In cantons with dense urban regions, like Canton of Geneva and Canton of Zürich, the presence of green roofs is an aspect to be taken into account in urban planning given the role they play in creating cool islands, capturing rainfall and hosting biodiversity.
Vegetation is generally found on flat roofs, flat part of roofs or slightly tilted roofs. Formally, green roofs are engineered systems for growing plants on rooftop, like extensive and intensive green roofs. The former one hosting mosses, grasses, small vegetation; the later one hosting lawn, bushes and even trees. The green roofs concept can be extended to spontaneous green roofs and terraces. The former ones are developing spontaneously. Both are considered as green roofs for biodiversity reasons.
The detection of green roofs can be addressed by different methods applied on aerial imagery: thresholding on NDVI bands, classification by machine-learning-based on engineered features, object detection by deep learning. In the literature 1, 2, the use of thresholds on NDVI gave versatile performance and required consequent manual work. This is due to the variability of the NDVI, either caused by meteorological events preceding image acquisition, or by the vegetation period during image acquisition, or because of other site-specific reflectance factors that have an impact on image rendering. On the other hand, the classification of entire roofs with traditional machine learning showed good ability 3. For detection of the green patches on roofs, object detection by deep learning has been explored 4,5, 6, but such method require a significant effort of ground truth (GT) labeling.
Although the binary problem, i.e. the distinction between bare and green roofs, is treated in the literature, no work on the multiclass problem was found. However, image classification technique is applied to classify roofs according to their geometries 7, 8 or according to their materials 9, 10. This encourages to try a similar approach for green rooftops classification into several classes.
The aim of this project is first to detect green roofs using national aerial imagery and land survey. Secondly, the project will explore the classification of green roofs into different existing types.
2 Study areas¶
The study areas on both cantons of Geneva and Zürich have been defined to contain a variety of green roof types and bare roofs. Figures 1 and 2 show the study areas.

Figure 1: Study area over the city of Geneva and surrounding area.

Figure 2: Study area over the city of Zürich and surrounding area.
3 Data¶
The main input data of the project are aerial images and a vector layer of labeled building footprints as a ground truth.
3.1 Aerial imagery¶
Aerial images acquired in early summer every six years by the national aerial imagery survey, SWISSIMAGE RS, have been used. This corresponds to acquisitions in 2022 and 2023, for Zürich and Geneva respectively. The project makes use of the 10 cm resolution and the red, green, blue and near infrared channels of the product. The original product is acquired in the form of raw image captures encoded in 16-bit, but for lighter processing and normalization, the imagery was converted and delivered as regular-sized 8-bit images, hereafter referred to as SWISSIMAGE RS 8-bit.
In addition, an in-development product, derived from SWISSIMAGE RS, is also available for testing. It consists of SWISSIMAGE RS orthorectified on the building footprints of the large-scale topographic landscape model of Switzerland swissTLM3D (layer TLM TLM_GEBAEUDE_FOOTPRINT). The advantage of this innovative data is that the tilt of the building is mostly corrected, so the image and the land survey vector layer are aligned as illustrated in Figure 3.

Figure 3: Tilted building on the orthophoto and orthorectified orthophoto for the rootop.
3.2 Ground truth¶
For training and testing of machine learning techniques, a ground truth is necessary. The building footprints documented in the land survey, established and maintained by the cantons, have been used as geometry for the ground truth. Then, the beneficiaries have visualized the footprints on top of the aerial imagery to attribute a vegetation tag (vegetated or not) and a class: bare, terrace, spontaneous, extensive, lawn and intensive as depicted in Figure 4.

Figure 4: Along bare roofs, five classes of green roofs are present in the ground truth: green terraces and lawns, as well as spontaneous, extensive and intensive roofs.
Here are the characteristics of each class:
- bare: In this project, roofs with less than 10% of vegetation cover are considered bare. They are made of roof tiles, concrete, metal, glass or solar panels.
- extensive: They show a marble effect, due to height variation in the substrate and to the different species used: moss, sedum and grasses.
- spontaneous: Roofs that have been spontaneously colonized by plants. Vegetation is likely to develop in depressions of the roofs and is more dependent of external factors. Patches can be observed. Spontaneous green roofs are heterogeneous (color, height of vegetation, texture). At a young state, they do not cover all the available space and, above all, when going back a few years in time, evolution is observable.
- lawn: Lawn can be found on roof tops or on top of underground car parks (fake soil). This is kind of a sub-class of intensive green roofs.
- intensive: Intensive roofs are made out of lawn, shrubs, bushes and trees. They grow on a thicker substrate than extensive roofs.
- terraces: These are roofs with movable vegetation. Unlike green roofs, terraces are often designed for recreational use, although some can show quite developed vegetation.
Table 1 summarizes the class diversity in the ground truth.
Table 1: Summary of the ground truth data, showing the number of roofs and their attribution into specific categories.
Class | GE | ZH | Total | Percentage |
---|---|---|---|---|
Bare | 2102 | 875 | 2977 | 78.6 |
Extensive | 47 | 398 | 445 | 11.8 |
Spontaneous | 48 | 78 | 126 | 3.3 |
Lawn | 64 | 23 | 87 | 2.3 |
Intensive | 68 | 14 | 82 | 2.2 |
Terrace | 50 | 17 | 67 | 1.8 |
3.2.1 Challenging patterns in sample labeling¶
Some roofs were difficult to label, although the time travel function for SWISSIMAGE and the construction year of the buildings were used to best assess the condition of the roof without going on site. However, an error free ground truth is not ensured. The choice of the corresponding class for a sample is not always a trivial task and part of the decision was made subjectively.
Since multiple experts were involved in this project, they were provided with the same subset of samples to label, in order to highlight difficult patterns. The chosen subset consists in the samples that were misclassified the most during the training of the DL model. It was made of 57 samples, from which 14 were not classified the same by the two experts. While this experiment did help to point out some patterns, its small size and the nature of its selection gives it no statistical value and should not be extrapolated to the full dataset.
However, the following tendencies have been highlighted:
- The classes terrace and bare tend to be difficult to tell apart. The verdict was that, a lot of time, terraces represent a very small portion of the sample and experts have to set a threshold from their own opinion, which can be quite different from one person to another. Examples are given in Annexes 7.14.1.
- The classes spontaneous and extensive tend to be difficult to distinguish. Indeed, the difference can be of some more light-green parts on the roof and be quite subtle. Looking at the temporal evolution of the roof can help to evaluate the situation, but not always. Examples are given in Annexes 7.14.2.
- More globally, the first tendency applies between almost every green class and the bare one. Instead of performing a visual estimation of the portion of greenery of a roof, it could be interesting to have a method to quantify it. The method developed in the machine learning solution and documented in Sections 4.3.4 and 5.1.4 is going in that direction, but shadows, dry vegetation and other terrain conditions prevent to focus only on the correct area.
3.3 Other data¶
In addition, the canopy height models (CHM) derived from LiDAR acquisitions have been used to mask pixels corresponding to the vegetation of overhanging trees as illustrated in Figure 5.

Figure 5: Illustration of an overhanging tree above a garage.
For the study area in Zürich, the available WMS of the CHM produced by the City of Zürich with a LiDAR acquisition of 2022 has been converted to a binary raster and vectorized. For the canton of Geneva, the already vectorized layer of the CHM from LiDAR acquisition of 2019 has been used.
4 Method¶
The Method chapter consists of two main parts: classification by machine learning and by deep learning.
4.1 Evaluation metrics¶
To evaluate the performance of the machine learning algorithms, traditional metrics have been chosen:
- Overall accuracy (OA): the proportion of correctly predicted samples over the entire ground truth.
- Recall of the green class: measures how sensitive the model is to the green roofs.
- Precision: is useful in cases where false positives are more critical than false negatives.
- Balanced accuracy (BA): deals with imbalanced datasets as it corresponds to the average of recall obtained on each class.
- F1-score: Harmonic mean of precision and recall. The F1-score overcomes the limitations of overall accuracy in cases of dataset imbalance.
- F2-score: F1-score with emphasis on the recall.
- F0.5-score: F1-score with emphasis on the precision.
In the three aforementioned equations, the variables used are:
- TP are true positives, green roofs correctly predicted as such
- TN are true negatives, bare roofs correctly predicted as such
- FN are false negatives, green roofs not detected
- FP are false positives, bare roofs predicted as green
- P are the green roofs in the ground truth
- N are the bare roofs in the ground truth
That works in the binary case. In the multiclass case, however, each class (P) is evaluated against all others (N). Each metric ranges between 0 and 1, respectively the lowest and the highest values to be measured.
4.2 Uncertainty and calibration metrics¶
In combination with evaluation metrics, uncertainty and calibration metrics allow to account for overconfidence in predictions. In particular, the neural networks are known to be prone to overconfidence in their predictions.
In order to monitor the behaviour of the models during the training, several uncertainty and calibration metrics have been implemented:
- The Brier score: measures the mean squared difference between predicted probabilities and actual outcomes. Common in classification tasks for evaluating probabilistic predictions.
\(p_{i,k}\) = predicted probability for class \(k\) for instance \(i\), \(y_{i,k}\) = ground truth (1 if class \(k\) is correct, else 0), \(N\) = number of samples, \(K\) = number of classes.
- The Expected Calibration Error (ECE): measures how well predicted probabilities align with observed frequencies and indicates how well the predicted confidence matches the true likelihood of correctness.
\(S_b\) = set of samples in bin \(b\) ; \(acc(S_b)\) = accuracy in \(b\) ; \(conf(S_b)\) = mean confidence in \(b\) ; \(N\) = total number of samples.
- The Average Prediction Entropy (APE): measures the average uncertainty of predictions based on the entropy of the predicted probability distribution. Higher entropy means greater uncertainty in predictions.
\(p_{i, k}\) = predicted probability for class \(k\) for instance \(i\) ; \(N\) = number of samples ; \(K\) = number of classes
- The Negative Log-Likelihood (NLL): measures the log-loss of predicted probabilities relative to the true labels. It quantifies how well the probabilistic predictions match the ground truth. Standard metric in probabilistic classification models, particularly in deep learning. Higher values indicate the model is both accurate and confident in its predictions.
\(p_{i,y_i}\) = predicted probability for correct class \(y_i\) for sample \(i\) ; \(N\) = number of samples.
- The Uncertainty-Aware Accuracy (UAA): is an extension of accuracy that takes into account the model’s uncertainty by considering confident and correct predictions.
4.3 Classification by machine learning¶
Machine learning algorithms make use of descriptors to learns characteristics about the classes to predict. In this project, descriptors were derived from the NRGB images.
4.3.1 Raster preparation¶
A first step was to compute the normalized difference vegetation index (NDVI) and luminosity rasters corresponding to the images following these equations:
where R, G, B and NIR stand respectively for the pixels of the red, green, blue and near infrared (NIR) bands. The resulting NDVI index is between -1 and 1, whereas the luminosity range depends on the image format: 0 to 765 in 8-bit, 0 to 196605 in 16-bit.
4.3.2 Overhanging trees¶
As mentioned in Section 3.3, it was observed on the images that some big trees beside buildings may cover the roofs and erroneously lead to detection of green roofs. The mask derived from the CHM was buffered by 1 m to exclude misleading pixels.
4.3.3 Statistics per roof¶
After having filtered the bands for overhanging vegetation, computation of the following statistics of pixels per roofs were performed on the red, green, blue, NIR, luminosity and NDVI bands:
- mean
- median
- minimum
- maximum
- standard deviation
This leads to 30 descriptors. For instance, for the roof in Figure 6, the statistics (min, max, mean, median, standard deviation) of the pixels in the green band (image on the right) are:
- min = 6
- max = 255
- mean = 122.272
- median = 123
- standard deviation = 43.22
Furthermore, Figure 3 shows the leaning of the building in the image, leading to mismatch between the building in the image and in the land survey. To overcome that, an inner buffer of 1 m was applied to the geometry prior to the statistic computation.

Figure 6: The statistics (min, max, mean, median, standard deviation) of the pixels within the roof perimeter are computed for each building. Here, for the green band (image on the right), the statistic values obtained are: min = 6, max = 255, mean = 122.272, median = 123 and standard deviation = 43.22.
4.3.4 Potential greenery¶
On Figure 6, one can see that the building footprint encompasses not only the roof but also a courtyard and that, on the roof, infrastructures like solar panels are also considered in the statistics. Therefore, the extensive roof in Figure 6 is likely to show different statistics than an extensive roof without courtyard and/or without solar panels. To overcome that and focus primarily on the vegetated area, the potential greenery area on each roof was extracted based on NDVI and luminosity threshold values, and then vectorized. The term "potential greenery" is chosen because in the extracted areas, pixels corresponding to bare materials may still be found.
To chose the threshold values to apply on the NDVI and luminosity rasters, one can load the rasters in a visualizer and evaluate the effect of thresholds via the styling of the layer.
This potential greenery vector layer offers an alternative layer to the one of the building footprint from the land survey. For instance, Figure 7 shows the potential greenery extracted from the roof. The statistics per roof can be recomputed for this layer.

Figure 7: Extracted potential greenery for NDVI values greater than 0 and for luminosity values lower than 500.
4.3.5 Training and testing¶
The roofs in the ground truth were randomly split into a training set (70%) and a test set (30%) following the original multiclass distribution. Two machine learning algorithms were trained with the scikit-learn 11 library in Python, a random forest (RF) and a logistic regression (LR). The hyperparameters were optimized by means of a grid search strategy during training:
-
random forest:
- number of trees to grow: 200, 500, 800.
- number of features to test at split: square root of the number of descriptors plus or minus one. This leads to three values to test.
-
logistic regression:
- solver: liblinear, newton-cg
- regularization technique: l2
- inverse of regularization strength: 1, 0.5, 0.1.
- number of iterations: 200, 500, 800.
The random state is fixed before training the algorithms. Classes are weighted in inverse proportion to their frequency in the input data.
When optimizing the training with the GridSearchCV function from the Python library scikit-learn, 5-fold cross-validation is performed and evaluated using balanced accuracy.
The trained models are evaluated on the test set, and compared with the balanced accuracy, recall and F1-score. For instance, the beneficiaries can opt for the model with less green rooftops missed (high recall) or the model with less errors (high F1-score).
The importance of the descriptors has been evaluated with the permutation_importance function of the scikit-learn Python library. It shuffles values for each descriptor and measures the change in the model performance for the given scorer to use. In the present case, the balanced accuracy was used. Afterwards, an ablation study of the descriptors is carried out to observe the effective contribution of different sets of descriptors to the model.
The models were optimized once for the binary problem: green or not; once for the multiclass problem: is the roof bare, a green terrace, a spontaneous green roof, an extensive green roof, a lawn or an intensive green roof.
Finally, the best set of descriptors are used to train a model on the SWISSIMAGE RS orthorectified on the building footprints of the TLM and compare the metrics with those obtained on the original images.
4.4 Classification by deep learning¶
Since the traditional machine learning approach gave non-satisfying results for the multiclass classification as documented in Section 5.1.6, a deep learning (DL) approach was address which could also benefits the binary results.
In the following, the implemented data preprocessing and DL model are introduced.
4.4.1 Preprocessing¶
The preprocessing aims at transforming the raw data into a well-structured dataset, ready for use by the model.
The sources used to create the dataset are:
- rasters from the aerial measurings in the R, G, B, NIR bands.
- building's footprints as vectors.
To obtain a ready-to-use dataset, the bounding box of each polygon is clipped to the rasters in order to get a single image sample per roof which is then saved in separated folder by vegetation class.
Afterwards, multiple processes are applied on the samples to prepare them for the model :
- Cropping: The sample is cropped on the geometry of the building's footprint.
- NDVI layer: The NDVI is computed and added as a fifth layer.
- Size normalization: The sample's sizes are homogenized to a predefined standard size (e.g. 256x256, 512x512, 1024x1024, ...) using one of the two methods described in Section 4.4.4.
Moreover, for the image dataset SWISSIMAGE RS orthorectified on TLM, an additional preprocessing step is performed. As shown in Figure 8, the distribution of measurements in the 16-bit range is highly unbalanced. Indeed, in the graphs on the left part of the figure, one has zoomed on the first tenth of the range, where most of the information is. Most of the 16-bit range is used very little, if at all. This could cause troubles to the model to catch small differences when the values are projected from the range 0-216 to the range 0-1.

Figure 8: Distribution of values on SWISSIMAGE RS orthorectified on TLM. In the graphs on the left, one can see that most of the information is on the first tenth of the 16-bit range.
Multiple methods are compared to narrow down the range of values:
- clip: clipping all values before 0 and after an arbitrary threshold, 10'000.
- norm: normalizing the values between 0 and an arbitrary threshold, 10'000.
- lognorm: normalizing the log of the values between 0 and the log of an arbitrary threshold, 10'000.
- selfmaxnorm: normalizing each image with respect to its max value.
4.4.2 Data augmentation¶
As seen in Table 1 from Section 3.2, the representation of the different classes is extremely unbalanced. This is a common difficulty in machine learning since this may cause a model to pay extra attention to an over-represented class. To avoid such a problem, different techniques exist and one of them is to artificially tweak copies of the samples of under-represented classes. Here are the ones used in this project:
-
Samples rotation: The most common way of doing data augmentation on images is to rotate them. Since the samples are resized to be squared images, the least represented classes (all but bare) are rotated 3 times by angles of 90°, 180° and 270°. This allows the multiplication of those class sizes by 4.
-
Samples flipping: Another geometric transformation is flipping along vertical and horizontal axis. By doing so, and combining it with the rotation, it allows to end up with up to 16 times the number of initial samples.
4.4.3 Model¶
The choice for the model was based on the well-known DeepLabV3 12, reusing its ASPP (Atrous Spatial Pyramid Pooling) module and processing its output signal in order to achieve classification instead of segmentation. The pipeline, shown in Figure 9, is made of 3 blocks: the backbone, the ASPP and the RMLP (called Residual block in the diagram)

Figure 9: Architecture of the multiclass deep learning model implemented to process images, made of 3 blocks: the backbone, the ASPP and the RMLP.
4.4.3.1 The backbone¶
The backbone is made of a series of levels, each of them constituted of a subseries of blocks. Each of those blocks is made of a convolutional layer and a batch normalization layer. After the last block of a level, a max-pool layer is applied in order to reduce the signal's cardinality and augment its depth. Doing so, each level will specialize in recognition of patterns with increasing level of complexity.
When it comes to fine-tuning the backbone, this part of the model has been designed so that the number of levels and the number of layers within each level are customizable.
In order to find the best configuration, the different combinations of the following parameters were tested:
- number of levels: 2, 3
- number of layers: 1, 2, 3
4.4.3.2 The ASPP block¶
The ASPP block uses dilated convolution layers in parallel before concatenating them into a single signal. This method allows to enlarge the receptive field of the model while keeping it light.
The dilating factor of each convolution layer is scalable in order to adapt the ASPP block to the input signal's size.
In order to fine the best configuration the following sets of dilating factors were compared: [4, 8, 12], [8, 16, 32], [12, 24, 36], [4, 16, 36] and [16, 32, 48].
4.4.3.3 The RMLP¶
In order to transform the output signal of the ASPP block into a prediction between the different classes, a Residual Multi-Layer Perceptron is then set.
This block's aim is to transform the multi-dimensional signal into a vector-like signal that is reduced layer from layer until a vector of the size of the number of classes is reached. Finally, this layer goes through a softmax function which produces a prediction for each class.
4.4.3.4 Confidence analysis¶
In order to understand better the behavior of the model regarding its predictions, the analysis of the confidence is done. Indeed, an overconfident model tends to be prone to high noise sensitivity and will generalize poorly on slightly different data. Moreover, it makes it difficult to use techniques based on confidence like threshold tuning, here one could adjust the cutoff value of the classification to optimize performance metrics like precision, recall, or F1-score.
For all those reasons, it might be interesting to implement mechanics, like label smoothing, that help the model having less drastic predictions.
4.4.3.4 Label smoothing¶
This method is implemented in the model to help having less confidence in predictions by modifying slightly the labels during the training phase. Instead of providing a one-hot vector with 1.0 to the right class and 0.0 to all the others, it will twist slightly this vector by removing a small value from the 1.0 and adding a small value to all the others.
Typically, the label vector will be replaced by the following:
where \(C\) is the number of classes and \(y\) is the sample label.
The model is trained with a label smoothing of \(\epsilon=10\%\).
4.4.4 Image resizing type¶
A restriction of the model described is that the input samples need to have a normalized shape. Therefore, the images need to be resized to a common value. In order to do so, two techniques were investigated:
- Reshaping: The reshaping mode uses the function
transform.resize
of the libraryscikit-image
. It keeps the relative proportion of the image to the image size. However, this method looses the original proportion. - Padding: The padding mode keeps the original image's size if it is smaller than the targeted one and resizes it if it is bigger. The image is then placed at the center of the final image with a black padding to reach the predefined sample's size. Doing so, the proportions of the original image are kept but, as it can be seen in Figure 10, the small roofs information is contained on a small fraction of the final sample.

Figure 10: Comparison of the two different methods of resizing images into the predefined samples shape.
4.4.5 Threshold on sample size¶
Giving the fact that the smallest samples hold less information and look the most pixelize when resized, the following hypothesis was tested:
In order to do so, a threshold on the area (in pixels) of the roofs was set. Meaning that, in the preprocessing, all images with area smaller than the threshold were dismissed and no sample was made from them.
4.4.6 Binary classification using deep learning solution¶
In addition to the multiclass classification, the architecture was tested on the binary classification task to be compared to the ML method. In order to do so, small modifications were done on the structure of the model itself, as well as on the preprocessing pipeline. Indeed, the final prediction is made by a softmax function, which is not ideal for binary classification. Hence a flag was added to the model in order to make this final prediction using a sigmoid function if the model is in binary mode.
Regarding, the preprocessing, only the mapping of the input labels to the correct categories has to be changed.
4.4.7 Multi-modal model¶
The current model focuses solely on spatial information and could benefit of other types of data. An attempt was made to modify the pipeline so that it could be fed on two different types of features: the multi-band images, plus the global statistics for each band, in the form of a vector of size 30 (same features used with the machine learning solution, see Section 4.3.3).
Such model dataset is called multi-modal and the corresponding architecture of the model is shown in Figure 11.
However, the project encountered issues that could not be resolved during the allowed time. Hence, the decision was made to keep it on a side branch on GitHub for whom would be interested in completing it.

Figure 11: Architecture of the multi-modal model implemented to process images and global statistics.
4.4.8 Experiments¶
Finally, the sequence of planned experiments should allow to find the best model for the multiclass classifcation of green roofs and be reused to train a binary classification.
The best data sources between SWISSMAGE RS 8-bit and SWISSIMAGE RS orthorectified on the TLM is chosen after training of the basic model. Similarly, a study on the sample size is lead to fix it before finetuning.
The strategy of label smoothing is tested on top, before finetuning the model parameters: number of levels, number of layers, and dilating factor of ASPP module.
The best model is evaluated according to the confusion matrix, the metrics, the training evolution, and the uncertainty and calibration measures.
Furthermore, the best configuration is also trained for binary classification.
In order to make possible the inference of new data with a trained model, a small pipeline was created and the preprocessing and dataset scripts were modified.
5 Results and discussion¶
Results regarding the binary classification by machine learning and the classifications by deep learning are presented and discussed directly.
5.1 Binary classification by machine learning¶
Before presenting the results obtained by machine learning, the intermediate results from the data preprocessing steps are shown.
5.1.1 Data preprocessing¶
Table 2 summarizes the composition of the GT after the preprocessing steps: inner buffering of 1 m and masking with the CHM. The ratios between the classes remain in the same order of magnitude as in the original dataset. It is also worth noting that the inner buffer of 1 m leads to exclusion of 65 roofs narrower than 2 m, which are mainly small bare surfaces of tiny built parts attached to buildings or garden sheds. 95 more roofs are excluded by the mask for the overhanging vegetation.
Table 2: Composition of the ground truth after preprocessing steps.
Class | GT original | GT after inner buffering of 1 m |
GT after inner buffering of 1 m and after masking with the CHM |
---|---|---|---|
Bare | 2977 | 2915 | 2830 |
Extensive | 445 | 445 | 445 |
Spontaneous | 126 | 124 | 122 |
Lawn | 87 | 86 | 82 |
Intensive | 82 | 82 | 78 |
Terrace | 67 | 67 | 67 |
Total | 3784 | 3719 | 3624 |
Furthermore, it appears that the mask derived from the CHM was not excluding all the pixels corresponding to overhanging vegetation. Therefore, an additional subset of bare roofs with mean NDVI value greater than 0.05 has been excluded from the dataset. 43 bare roofs were concerned.
With this new version of the ground truth, the statistics on the red, green, blue, NIR, NDVI and luminosity bands were computed per roof. The results for the NDVI are given in Figure 12. One can observe, on the boxplot of the NDVI means, that the interquartile's range of the classes of the classes bare (b) and terraces (t) are largely overlapping. That is also the case between the terraces and spontaneous (s) classes, and between spontaneous and extensive (e) roofs. Furthermore, the intensive (i) class shows a wide interquartile range which is overlapping with those of the three aforementioned classes and lawn (l). Similar observations can be made about the boxplots of the means. The distribution of the minimum and maximum pixel values per roof per class shows also that similar values are to be find between classes, tough the distribution for the classes bare, spontaneous and extensive have a lower interquartile range than the others. Finally, from the distributions of the standard deviation, two groups are distinguishable: high standard deviations for the terraces, lawn and intensive classes; low ones for the bare, spontaneous and extensive classes. The former roofs have often a mix of bare materials and vegetation in a good health state; whereas the latter are often homogeneously covered and the spontaneous and extensive vegetation may be weak.

Figure 12: Boxplots of the statistics for the NDVI pixels per roof in the study area per class.
In Appendices 7.1, 7.2, 7.3, 7.4 and 7.5, the interested reader can visualize similar boxplots as depicted in Figure 12 respectively for the luminosity, near infrared, red, green and blue bands. A general conclusion is that the descriptors contain information even if overlap is observable between classes. There is the potential of leveraging ML algorithms to learn pattern from these data.
5.1.2 Parameter optimization and ablation of the descriptors¶
The best sets of hyperparameters after optimization of the random forest and logistic regression are given in Tables 3 and 4 for the sets of descriptors tested.
In Table 3, the best results of the runs made with the random forest were achieved with all the descriptors and with 800 trees grown and 6 descriptors tested at each split. During the optimization phase, the best model for each tested configuration have been kept. The results are given in Appendix 7.6. The evaluation of the models by means of the k-fold validation test indicated that all set of parameters performed similarly: 0.01 of difference in the k-fold mean balanced accuracy. This indicates that the range of parameters to test was suitable to extract information from the data.
Table 3: Metrics for the test set trained with random forest.
Descriptors | # of trees | # of descriptors | Balanced accuracy | Recall | F1-score |
---|---|---|---|---|---|
ndvi+lum+nrgb | 800 | 6 | 0.83 | 0.69 | 0.78 |
lum+nrgb | 200 | 5 | 0.82 | 0.65 | 0.77 |
nrgb | 800 | 5 | 0.83 | 0.67 | 0.78 |
rgb | 200 | 5 | 0.80 | 0.60 | 0.73 |
Table 4 shows that the best model for the runs made with the logistic regression is for 200 iterations, 1 coefficient of penalty and the newton-cg solver. In Appendix 7.7, the rest of the optimized models can be found. Again, one can notice the similarity of performances (0.01 of difference in the mean balanced accuracy).
Table 4: Metrics for the test set trained with logistic regression.
Descriptors | Iterations | C | Solver | Balanced accuracy | Recall | F1-score |
---|---|---|---|---|---|---|
ndvi+lum+nrgb | 200 | 1.00 | newton-cg | 0.89 | 0.86 | 0.80 |
lum+nrgb | 200 | 1.00 | newton-cg | 0.89 | 0.87 | 0.81 |
nrgb | 200 | 1.00 | liblinear | 0.89 | 0.86 | 0.80 |
rgb | 200 | 0.50 | liblinear | 0.87 | 0.85 | 0.77 |
The permutation importance results indicated that the important descriptors are different for the random forest and the logistic regression. In the random forest, the six most important descriptors are:
- NDVI standard deviation
- NDVI mean
- standard deviation of the blue pixels
- NDVI median
- NDVI maximum
- NIR mean
The rest of the important descriptors are given in Appendix 7.8. Indeed, in the ablation study shown in Table 3, the recall goes from 0.69 to 0.65 after removing the descriptors derived from the NDVI pixels (ndvi+lum+nrgb to lum+nrgb). It decreases further from 0.67 to 0.60 when removing the descriptors derived from the NIR band (nrgb to rgb). A lot of information is in the NIR band and by extension in the NDVI.
Regarding,the logistic regression, the six most important descriptors are:
- luminosity median
- standard deviation of the blue pixels
- mean of the red pixels
- mean of the green pixels
- luminosity standard deviation
- median of the green pixels
The rest of the important descriptors are given in Appendix 7.9. Those results highlight the fact that the logistic regression learns differently from the data than the random forest. Moreover, in the ablation study in Table 4, there are only 1% of decrease in recall and 3% in F1-score between the full model (ndvi+lum+nrgb) and the rgb model. These results indicate that the NIR band, while not providing much information for green roofs detection, helps slightly to avoid false positives
Furthermore, when comparing the recall in Tables 3 and 4, one observes that the LR is more sensitive to the green class than the RF (0.86 vs 0.69). However, the balanced accuracy, which is the mean of the recall for the bare class and for the green class, indicates that the recall for the bare class in the RF is higher than the one in the LR.
Therefore, the mean of the probability estimates by LR and of the predicted class probabilities by RF for the green class have been computed to take advantage of both ways of learning from the data. For values higher than 0.5, the corresponding roofs have been considered as green; otherwise, they were assigned to the bare class. The metrics for the RF, LR and combination of both are summarized in Table 5. One can appreciate the stability of the balanced accuracy and the increase of performance for the F1-score; although more green roofs are wrongly classified than by the LR only, the overall classification is getting better. Knowing the imbalance of classes in the reality, this leads to way less errors in the outputs.
Table 5: Metrics obtained for the test set after training with all the statistics per roofs.
Model | Balanced accuracy | Recall | F1-score |
---|---|---|---|
RF | 0.83 | 0, 69 | 0.78 |
LR | 0.89 | 0.86 | 0.80 |
RF+LR | 0.89 | 0.81 | 0.83 |
5.1.3 Performance on SWISSIMAGE RS and on SWISSIMAGE RS orthorectified on TLM¶
Finally, Table 6 shows the metrics of the test set with models trained on the descriptors derived from the projected orthophotos on rooftops. One can observe that a similar range of performances is reached. Therefore, it seems that the tilt of the buildings in the orthoimage and its implication on the calculation of the descriptors is negligible, or that the application of a negative buffer on the land survey footprint geometries has made it possible to focus on the roofs and not include too much of the inclined facade in the descriptors.
Table 6: Metrics for the test set on SWISSIMAGE RS and on SWISSIMAGE RS orthorectified on TLM.
Model | Images | Balanced accuracy | Recall | F1-score |
---|---|---|---|---|
RF | SWISSIMAGE RS 8-bit | 0.87 | 0.77 | 0.84 |
RF | SWISSIMAGE RS orthorectified on TLM | 0.89 | 0.81 | 0.87 |
LR | SWISSIMAGE RS 8-bit | 0.91 | 0.89 | 0.85 |
LR | SWISSIMAGE RS orthorectified on TLM | 0.89 | 0.85 | 0.83 |
5.1.4 Results on potential greenery areas¶
In a second step, the focus was put on potential greenery areas. The threshold values to apply on the NDVI and luminosity bands have been set to 0 and 500 respectively after visualizing the rasters in QGIS and masked for these values. An illustration is given in Figure 13. Pixels with a NDVI value smaller than 0 are overlaid with transparent blue and pixel with a luminosity value greater than 500 are overlaid with transparent red. The bright green pixels correspond to the potential vegetation identified.

Figure 13: Visualization of the threshold effect on the NDVI and luminosity rasters. Pixels with a NDVI value smaller than 0 are overlaid with transparent blue and pixel with a luminosity value greater than 500 are overlaid with transparent red. The vibrant green pixels correspond to the potential vegetation.
When referring to Figure 12 displaying the boxplots corresponding to the statistics of the NDVI band computed per entire roof, one can observe that a large majority of roofs have at least one pixel with a NDVI value greater than 0. When filtering the surface of the roofs according to NDVI and luminosity to focus on potential vegetated areas, 3189 out of 3624 roofs were indeed still included in the analysis as shown in Table 7.
Table 7: Comparison of the composition of the ground truth before and after filtering for the potential greenery area.
Class | GT after inner buffer of 1 m and after masking with the CHM |
GT after filtering on NDVI and luminosity |
Difference |
---|---|---|---|
Bare | 2830 | 2397 | 433 |
Extensive | 445 | 444 | 1 |
Spontaneous | 122 | 121 | 1 |
Lawn | 82 | 82 | 0 |
Intensive | 78 | 78 | 0 |
Terrace | 67 | 67 | 0 |
Total | 3624 | 3189 | 435 |
Once again, the statistics of the NDVI, luminosity and NRGB pixels were computed per class, but this time, they were computed on the potential greenery. The boxplot of the NDVI mean of the class terraces in Figure 14, with a median around 0.18 instead of -0.18 in the boxplot of the NDVI mean per entire roof (see Figure 12), illustrates that the potential greenery layer allows to focus on the vegetated part of the terraces. Increase is also to be noted in the other classes, but the increase for the terraces is particularly interesting, as the median of the distribution reached those of lawns and intensive roofs; whereas it was similar to the median of the bare class before (see Figure 12).
The bare class benefits also from the threshold, with a median for the distribution of the NDVI similar to those of the classes spontaneous and extensive. Moreover, Figure 15 shows the boxplots for statistics on luminosity where it can be seen that the medians of luminosity are generally lower on the bare class than the others. This corresponds to the fact that higher NDVI values are mostly to be found on the part of roofs in shadow as highlighted by Figure 15.

Figure 14: Boxplots of the statistics for the NDVI pixels per potential greenery in the study area per class.

Figure 15: Boxplots of the statistics for the luminosity pixels per potential greenery in the study area per class.
The boxplots of statistics on the NRGB bands are given in Appendices 7.10, 7.11, 7.12 and 7.13.
Since the statistics showed different characteristics on the potential greenery area than the entire roofs, another optimization was performed on the training based on the potential greenery area. The corresponding optimized parameters and metrics on the test set are shown in Table 8. For the random forest, the best model is reached for 5 features to test at the split and 200 trees to grow. Regarding, the logistic regression, 200 iterations are performed with a penalty coefficient of 1 and the newton-cg solver. Again, the combination of predictions from the RF and the LR is computed and corresponding metrics are also given in Table 8.
In the last row of Table 8, the RF and LR on the entire roofs have been trained and evaluated on the same dataset than the potential greenery one, then combined. It is worth noting that the model trained with the descriptors computed on the potential greenery surfaces performs better at detecting the green roofs than the models trained with the descriptors computed over the entire roofs (0.87 vs 0.84 of recall), but in overall leads to more bare roofs predicted as green (0.85 against 0.86 of F1-score).
Table 8: Metrics for the test set for statistics over the potential greenery area.
Model | Balanced accuracy | Recall | F1-score | Optimized parameters |
---|---|---|---|---|
RF | 0.87 | 0.76 | 0.82 | # of features = 5 # of trees = 200 |
LR | 0.88 | 0.87 | 0.80 | C=1 # of iterations = 200 solver = newton-cg |
RF+LR | 0.91 | 0.87 | 0.85 | |
RF+LR on entire roofs | 0.90 | 0.84 | 0.86 |
5.1.5 Results and use¶
Result of the best model trained on the entire roofs and on the potential greenery are shown in Figure 16. According to the metrics, the user interested in detecting green roofs should use the combination of RF and LR trained on the potential greenery since this model is more sensitive to green roofs and produces a limited number of wrong predictions (second best F1-score obtained). The geometry of the potential greenery may help to fasten the control when zooming on the roofs, whereas aggregation of the results on the original geometry of roofs delivers a better overview of the situation.

Figure 16: Results on an inference area.
5.1.6 Multiclass classification insights¶
Results for the multiclass classification with traditional machine learning were not satisfactory. Confusion between classes indicated that scarce vegetation for terraces, spontaneous and extensive roofs leads to confusion with bare roofs. From these tests, it appears that global statistics of the samples are not sufficient for the task. Hence, a strategy including the spatial structure of the rooftop might be needed (e.g. deep-learning approach).
5.2 Multiclass classification by deep learning¶
As well as with the binary classification by machine learning, the results and discussions regarding the different tests on the deep learning solution are presented in this section.
Moreover, the beneficiaries of this project labeled more samples for this approach so that the model could generalize on a wider span of cases. Table 9 shows the distribution of the final dataset.
Class | Bare | Terrace | Spontaneous | Extensive | Lawn | Intensive |
---|---|---|---|---|---|---|
Count [-] | 2756 | 259 | 344 | 672 | 93 | 100 |
Frac [%] | 65.25 | 6.13 | 8.14 | 15.91 | 2.20 | 2.34 |
By comparing Table 9 with Table 1, it can be seen that the new dataset increased significantly the number of roofs in under-represented classes - in particular in the terrace, spontaneous and extensive classes - and manage to balance a bit better the representation of each class. While this dataset is still very small for such task, those new samples will help greatly the model to catch more of the distinguishing signature on every class.
5.2.1 SWISSIMAGE RS 8-bit vs SWISSIMAGE RS orthorectified on TLM¶
As in the machine learning part, both image datasets at disposal - SWISSIMAGE RS 8-bit and SWISSIMAGE RS orthorectified on TLM - have been alternatively tested in input of the DL model (3 levels of 3 layers and ASPP dilation rates of [4, 8, 12]). The chosen sample's size was 512 px and the resizing type was reshaping.
After initial tests to narrow down the range of value in the SWISSIMAGE RS orthorectified on TLM, the clipping method quickly shown to be the worst method by far and hence, has been dismissed from the trainings. On Figure 17, are shown the training results on the different normalization techniques, plus a training without any range limitation technique. It can be seen that the best results were reached, regarding almost every metrics, using no range limitation method (none). This shows that the small differences of pixel values in 16-bit were high enough to be caught by the model and not generate numerical artefacts. Hence, the best method was the trivial one, by not altering the values at all before mapping it to the range [0, 1].

Figure 17: Results on multiple metrics for different range limitation methods. The threshold for each method was set to 10'000. The best results were reached, regarding almost every metrics, using no range limitation method (none).
5.2.1.2 Comparison¶
Once the best configuration for SWISSIMAGE RS orthorectified on TLM specifications was found, a comparison was made between this datasets and the SWISSIMAGE 8-bit dataset in order to define which data structure is fitting best the task at hand. As seen in Table 10, the results of both trainings were quite similar, highlighting the fact that both structures are showing enough important information for the model to classify accurately.
Dataset | OA | Recall | F2 | F1 | F0.5 | Precision |
---|---|---|---|---|---|---|
SWISSIMAGE 8-bit | 0.81 | 0.69 | 0.62 | 0.56 | 0.53 | 0.53 |
SWISSIMAGE RS orthorectified on TLM | 0.82 | 0.64 | 0.56 | 0.53 | 0.52 | 0.54 |
However, the SWISSIMAGE 8-bit dataset shown slightly better results, notably, regarding the F2-score, which is the most relevant metric regarding the needs of this project. For this reason and because it is lighter (encoded in 8-bit) and a more universal datatype, the decision was made to continue this project with only the SWISSIMAGE 8-bit dataset.
5.2.2 Sample size¶
The sample size is a simple, but very important parameter.
Figure 18 shows metrics from trainings done with the same configuration and by using different sample sizes. One can observe that all metrics are better for the size 1024 and reach the conclusion that the bigger the samples, the more accurate the model is going to be. However, the size 1024 seems to be the highest reasonable size that can be set because the size 2048 was also tested during this project but was generating too heavy feature maps during the training and the 24 Go VRAM of the graphic card were being saturated even with batches of 2 samples.

Figure 18: Results on multiple metrics for different sample sizes. The bigger the samples, the more accurate the model is.
5.2.2.1 Threshold on sample size¶
Multiple trainings were done with different threshold on the sample size and the results were then compared, as shown in Figure 19.

Figure 19: Results on multiple metrics for different sample size thresholds.
Given the fact that applying the threshold giving the best results, 2000 px, was more or less as efficient as applying no threshold at all, and since the use of such threshold would have dismissed all roofs smaller than 20m^2 (1px=10cm), the logical choice was made to continue without this filter.
5.2.3 Confidence analysis¶
An interesting output to look at, in addition to the prediction, is the confidence in the prediction. Figure 20 shows the partition of ranges of confidence per class. It can be seen that the model has almost 100% confidence (99%-100%) for more than half of the bare samples. Regarding the other classes, this range is a little less represented but still has more samples in the classes terrace, spontaneous and extensive.

Figure 20: Partition of the confidence in predictions for the different classes. The model has almost 100% confidence (99%-100%) for more than half of the bare samples. Regarding the other classes, this range is a less represented, in particular for lawn and intensive classes.
5.2.3.1 Label smoothing¶
After adding the label smoothing with \(\epsilon=10\%\), the model was retrained and the resulting confidence in prediction can be shown in Figure 21.

Figure 21: Comparison of confidence with and without label smoothing
The adding of a small smoothing factor \(\epsilon\) already reduced the number of predictions between 99-100% by 16%. The choice was then made to continue with this parameter.
5.2.4 Model finetuning¶
Three parameters of the model have been finetuned; the number of levels and number of layers in the backbone, and the dilation rates in the ASPP module.
5.2.4.1 Backbone module¶
In order to find the perfect width and depth of the backbone, 6 different trainings were done tweaking the number of levels and the number of layers. Corresponding F2-scores are shown in Figure 22. On the x-axis are the different tested values for the number of layers (width) and on the y-axis are the different tested values for the number of levels (depth). However, there is not a clear winning configuration. The most plausible explanation is that the model can already do as best as it can with this dataset and a factor that help distinguish those configurations from each other would be to train it on a bigger dataset.
Regarding the current project, the configuration with 3 levels and 1 layers was chosen since it allows catching complex patterns and holds, in our opinion, a higher potential.

Figure 22: Finetuning of the depth and width of the backbone, evaluated on the F2-score. On the x-axis are the different tested values for the number of layers (width) and on the y-axis are the different tested values for the number of levels (depth). All configurations gave relatively similar results.
5.2.4.2 ASPP module¶
Furthermore, Figure 23 shows the results of 5 trainings done with different sets of dilation rates. Even though a wide range of values has been tested, the results stay quite close. The training that provided the best recall, F1-score and F2-score is the one with the smallest set of dilation rates ([4, 8, 12]). Therefore, this is the one kept for the trainings. However, regarding the metrics on the precision-side, other configurations become competitive.

Figure 23: Results on multiple metrics for different sets of dilation rates on the ASPP module of the model. The training with the best recall, F1-score and F2-score is the one with the smallest set of dilation rates ([4, 8, 12]). However, considering the precision, other configurations become competitive.
5.2.5 Best configuration¶
After finetuning all those degrees of freedom, a final model was trained choosing a configuration that would produce the best possible results while still producing a model not too demanding in terms of setup.
5.2.5.1 Configuration¶
The configuration used for the final model is the following:
- Sample size: 1024
- Split Training set / Validation set: 70% / 30%
- Training length: 100 epochs
- label smoothing factor: 0.1
- ASPP atrous rates: [4, 8, 12]
- Backbone - number of levels: 3
- Backbone - number of layers: 1
The reason for the choice of 100 epochs is that, while the performances of the model on the validation set tend to plateau after 20-30 epochs, some training runs showed slight improvement after 50 epochs.
5.2.5.2 Confusion matrix and metrics¶
On Figure 24 are shown:
- the confusion matrix on the left: showing the production accuracy (row-normalized) of the validation set for each class (representing the recall).
- on the right, the scores regarding different metrics for each class and at a global scale. This last metric is the average of the score for all the classes. Doing so, the data imbalance is not affecting these final scores.

Figure 24: Confusion matrix and metrics of the final training with optimal configuration.
The confusion matrix shows results that we considered satisfying considering the relative small size of the dataset with respect to the task at hand.
However, by keeping the same configuration and just changing the random seed used to fix the way data are shuffled, some important difference in scores can be seen, as illustrated by Table 11. This happens, in particular, in the extensive, lawn and intensive classes. This underlines the fact that the model still has room for improvement in terms of generalization and would benefit from being trained on a bigger labeled dataset.
Table 11: Recall score of each class on a model with same configuration and different shuffling of the dataset.
Model | Bare | Terrace | Spontaneous | Extensive | Lawn | Intensive |
---|---|---|---|---|---|---|
#1 | 0.91 | 0.77 | 0.63 | 0.78 | 0.91 | 0.55 |
#2 | 0.87 | 0.77 | 0.66 | 0.69 | 0.77 | 0.36 |
5.2.5.3 Training evolution¶
Figure 25 shows the evolution of the training. The evolution of the loss is particularly interesting to focus on. Both training and validation losses start by dropping and then stay more or less constant. This indicates that the model is not able to overfit on the training set. An explanation to this is the use of batch normalization at many places in the model which is known to help generalizing and to avoid overfitting of the model.

Figure 25: Evolution of the loss and accuracy during the training of the final model with optimal configuration. The evolution of the loss, during both training and validation, starts by dropping and then stay more or less constant. This indicates that the model is not able to overfit on the training set.
5.2.5.4 Precision and recall evolution¶
On Figure 26 are shown the evolution of the precision and recall along the training. Both curves start low with quite erratic jumps to then grow and stabilize in terms of mean and variance.
The recall is higher than the precision, meaning that this model is doing a better job at minimizing the number of false positives than maximizing the number of true positives.

Figure 26: Evolution of the precision and recall during the training of the final model with optimal configuration. The recall is higher than the precision, meaning that this model is rather minimizing the number of false positives than maximizing the number of true positives.
5.2.5.5 Uncertainty and calibration evolution¶
On Figure 27, is shown the evolution of the different uncertainty and calibration metrics. Every metrics indicate a better score with lower values except the uncertainty-aware accuracy which indicate a better score with higher values. As shown on the figure, the different metrics keep increasing in the quality of their score along the training which is a wanted behavior.

Figure 27: Evolution of the different confidence matrix during the training of the final model with optimal configuration. The different metrics keep increasing in the quality of their score along the training which is a wanted behavior.
5.2.6 Binary results by deep learning¶
As mentioned in Section 4.4.6, the architecture was built such that it could also be trained for binary classification. The results on the validation set are shown in Figure 28. Since these scores are obtained for another set of roofs than the one used to evaluate the machine learning models, they can not be stricly compared with the scores in Table 8. However, one notices that a higher range of values is reached.

Figure 28: Confusion matrix (on the left) and metrics (on the right) of the deep learning solution on binary classification.
However, given the fact that the machine-learning-based solution gave also very good results and is lighter (no need of GPU power), the choice was made to enter the production phase with this model. This choice was also driven by the fact that the multiclass classification is done by the deep learning solution, whose predictions can be derived into binary classification.
5.2.7 Insights of the results¶
In order to compare results, Figure 29 shows a customed confusion matrix of the predictions on the two classes bare and vegetated made by the 3 different models on the final dataset, together with the ground truth. Regarding the multiclass classification model, the predictions for the different vegetation classes were projected into one.
As shown by the colored arrows, the lower triangular part of the matrices represent the partition of the bare samples into the vegetated category and the upper triangular part represent the partition of the vegetated samples into the bare category. The plot on the left shows the count of samples while the plot on the right shows the fraction of samples.
For this test, the used dataset contains 2293 bare samples and 1404 vegetated samples.

Figure 29: Comparison of the binary predictions by the different models between them and with the ground truth. gt= ground truth, ml_bin = machine learning binary classification, dl_bin = deep learning binary classification and dl_multi = deep learning multiclass classification. As shown by the colored arrows, the lower triangular part of the matrices represent the partition of the bare samples into the _vegetated_ category and the upper triangular part represent the partition of the _vegetated_ samples into the bare category.
By subtracting the numbers of the plot on the right to 100, it can be seen that all the models were classifying the vegetated samples correctly with an accuracy equal or higher to 84.15% with the deep learning binary solution reaching the highest score of 87.96%. The overview of the matrices shows that the two deep-learning-based models got results closer to each other than the machine-learning-based one which is to be expected due to the similarity of their architecture. Indeed, the percentage of vegetated sample predicted as bare with the deep learning classification are 6.34% and 5.06%, against 12.04% for the machine learning one. These numbers are respectively 9.64%, 12.69% and 16.83% for the percentage of bare sample predicted as vegetated. Moreover, subtracting percentages from the first row and column, the multiclass and binary classifications by deep learning models outperform the machine learning model by respectively:
- 4.14% (16.83%-12.69%) and 7.19% (16.83%-9.64%) regarding bare predicted as vegetated (see first column).
- 6.98% (12.04%-5.06%) and 5.70% (12.04%-6.34%) regarding vegetated predicted as bare (see first row).
However, one should be careful with these comparisons since the machine learning models were trained based on the previous, smaller dataset. The distribution of both state of the dataset before preprocessing have been given in Table 1 and Table 9. In the best scenario, all the models performances should be train on the same dataset to be compared but a lack of time prevented us from doing it. Hence, for future use, it might be interesting to retrain the logistic regression and the random forest.
6 Conclusions and outlooks¶
This study showed the effectiveness of using aerial imagery and machine learning models in detecting green roofs.
In the machine learning parts, the results demonstrated the ability of a random forest and logistic regression algorithms to detect green roofs among bare roofs, based on vegetation and material reflectance in airborne images. The metrics, with a recall of 0.87 for the green class and an F1-score of 0.85 on the entire test set, reveal that the combination of both models trained on pixels statistics derived from vegetated areas defined by NDVI and luminosity thresholds, achieved the best performances. These metrics highlight the model’s ability to accurately detect green roof coverage, making it a reliable tool for large-scale urban mapping.
The multiclass classification of vegetation type by deep learning approach shows promising performances with a F1-score of 0.68 and an overall accuracy of 0.85. However, this approach requires more data than the machine learning one to be trained. Moreover, the number of classes in the multiclass classification task is greater than in the binary classification one and thus, more data is required. Hence, there are indications that the DL model would benefit from a larger set of labeled images. In particular, the dataset the model was trained on is heavily imbalanced and, even by using strategies to correct it, more samples of some underrepresented classes (intensive, lawn, spontaneous and terrace) would surely help greatly to better generalize and distinguish between them.
Some further outlooks and insights:
- By training and testing the models with two areas separated by approximately 300 km, with images acquired in two different years and with six types of roofs represented in the ground truth, the models have already a certain ability for generalization. Moreover, the DL model explored showed great potential and, in the future, with more labeled data, it would be valuable to retrain it on a broader range of the feature space, allowing it to reach its full potential.
- From the scores in the k-folds cross-validations of the ML algorithms in Annexes 7.6 and 7.7, it is to be expected that the metrics vary of approx. 5% according to the ground truth split into train and test sets. Such behavior has also been observed with the deep learning approach.
- Machine learning approaches needing engineered features (descriptors) in entry let always rooms for improvement by including additional descriptors.
- The deep learning model showed to have a higher recall than precision, with corresponding overall scores of 0.72 and 0.66 respectively in Figure 24. This can be imputed to the way the model adapted to the imbalance dataset. However, in the case of future applications of the model requiring higher precision, a solution would be to adapt the precision over recall ratio through threshold tuning 13.
- Finally, the experts mentioned that, even using SWISSIMAGE Time Travel WMTS and the construction year of the buildings, they could not insure an error free ground truth. This may have had an impact on model training and evaluation, as well as on manual correction of results in the future.
7 Appendixes¶
7.1 Boxplots of the statistics for the luminosity pixels per roof in the study area per class¶

Figure 30: Boxplots of the statistics for the luminosity pixels per roof in the study area per class.
7.2 Boxplots of the statistics for the near infrared pixels per roof in the study area per class¶

Figure 31: Boxplots of the statistics for the near infrared pixels per roof in the study area per class.
7.3 Boxplots of the statistics for the red pixels per roof in the study area per class¶

Figure 32: Boxplots of the statistics for the red pixels per roof in the study area per class.
7.4 Boxplots of the statistics for the green pixels per roof in the study area per class¶

Figure 33: Boxplots of the statistics for the green pixels per roof in the study area per class.
7.5 Boxplots of the statistics for the blue pixels per roof in the study area per class¶

Figure 34: Boxplots of the statistics for the blue pixels per roof in the study area per class.
7.6 Results of the parameter optimization of the random forest¶
Table 12: Results of the parameter optimization of the random forest. Optimized parameters: number of trees to grow and number of descriptors to test at each split.
param_max_features | param_n_estimators | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score |
---|---|---|---|---|---|---|---|---|
6 | 800 | 0.877 | 0.834 | 0.857 | 0.875 | 0.855 | 0.860 | 0.016 |
5 | 200 | 0.870 | 0.830 | 0.849 | 0.870 | 0.864 | 0.857 | 0.015 |
6 | 500 | 0.877 | 0.839 | 0.841 | 0.875 | 0.850 | 0.857 | 0.017 |
6 | 200 | 0.877 | 0.825 | 0.846 | 0.870 | 0.862 | 0.856 | 0.019 |
5 | 800 | 0.864 | 0.830 | 0.852 | 0.875 | 0.853 | 0.855 | 0.015 |
5 | 500 | 0.872 | 0.825 | 0.841 | 0.878 | 0.852 | 0.854 | 0.020 |
4 | 800 | 0.868 | 0.820 | 0.853 | 0.870 | 0.849 | 0.852 | 0.018 |
4 | 200 | 0.870 | 0.820 | 0.852 | 0.863 | 0.839 | 0.849 | 0.018 |
4 | 500 | 0.864 | 0.820 | 0.843 | 0.874 | 0.839 | 0.848 | 0.019 |
7.7 Results of the parameter optimization of the logistic regression¶
Table 13: Results of the parameter optimization of the logistic regression. Optimized parameters: penalty, penalty coefficient and solver.
param_C | param_max_iter | param_solver | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score |
---|---|---|---|---|---|---|---|---|---|
1 | 200 | newton-cg | 0.882 | 0.859 | 0.919 | 0.885 | 0.922 | 0.893 | 0.024 |
1 | 500 | newton-cg | 0.882 | 0.859 | 0.919 | 0.885 | 0.922 | 0.893 | 0.024 |
1 | 800 | newton-cg | 0.882 | 0.859 | 0.919 | 0.885 | 0.922 | 0.893 | 0.024 |
0.5 | 200 | newton-cg | 0.882 | 0.859 | 0.915 | 0.892 | 0.917 | 0.893 | 0.022 |
0.5 | 500 | newton-cg | 0.882 | 0.859 | 0.915 | 0.892 | 0.917 | 0.893 | 0.022 |
0.5 | 800 | newton-cg | 0.882 | 0.859 | 0.915 | 0.892 | 0.917 | 0.893 | 0.022 |
0.1 | 200 | newton-cg | 0.892 | 0.844 | 0.907 | 0.901 | 0.919 | 0.893 | 0.026 |
0.1 | 500 | newton-cg | 0.892 | 0.844 | 0.907 | 0.901 | 0.919 | 0.893 | 0.026 |
0.1 | 800 | newton-cg | 0.892 | 0.844 | 0.907 | 0.901 | 0.919 | 0.893 | 0.026 |
1 | 200 | liblinear | 0.881 | 0.856 | 0.908 | 0.893 | 0.912 | 0.890 | 0.020 |
1 | 500 | liblinear | 0.881 | 0.856 | 0.908 | 0.893 | 0.912 | 0.890 | 0.020 |
1 | 800 | liblinear | 0.881 | 0.856 | 0.908 | 0.893 | 0.912 | 0.890 | 0.020 |
0.5 | 200 | liblinear | 0.881 | 0.847 | 0.907 | 0.898 | 0.910 | 0.889 | 0.023 |
0.5 | 500 | liblinear | 0.881 | 0.847 | 0.907 | 0.898 | 0.910 | 0.889 | 0.023 |
0.5 | 800 | liblinear | 0.881 | 0.847 | 0.907 | 0.898 | 0.910 | 0.889 | 0.023 |
0.1 | 200 | liblinear | 0.889 | 0.840 | 0.896 | 0.898 | 0.900 | 0.885 | 0.022 |
0.1 | 500 | liblinear | 0.889 | 0.840 | 0.896 | 0.898 | 0.900 | 0.885 | 0.022 |
0.1 | 800 | liblinear | 0.889 | 0.840 | 0.896 | 0.898 | 0.900 | 0.885 | 0.022 |
7.8 Permutation importance of the random forest¶
Table 14: Permutation importance of the random forest.
Descriptor set | Statistic | % drop in balanced accuracy |
---|---|---|
NDVI | standard deviation | 0.086 |
NDVI | mean | 0.079 |
Blue | standard deviation | 0.014 |
NDVI | median | 0.013 |
NDVI | maximum | 0.011 |
NIR | mean | 0.011 |
NIR | maximum | 0.010 |
NIR | median | 0.006 |
NIR | standard deviation | 0.003 |
Green | minimum | 0.001 |
Red | standard deviation | 0.001 |
Red | median | 0.001 |
Blue | median | 0.001 |
NIR | minimum | 0.001 |
NDVI | minimum | 0.000 |
Luminosity | minimum | 0.000 |
Luminosity | maximum | 0.000 |
Luminosity | mean | 0.000 |
Luminosity | medain | 0.000 |
Luminosity | standard deviation | 0.000 |
Red | minimum | 0.000 |
Red | maximum | 0.000 |
Red | mean | 0.000 |
Blue | minimum | 0.000 |
Blue | maximum | 0.000 |
Blue | mean | 0.000 |
Green | maximum | 0.000 |
Green | mean | 0.000 |
Green | median | 0.000 |
Green | standard deviation | 0.000 |
7.9 Permutation importance of the logistic regression¶
Table 15: Permutation importance of the logistic regression.
Descriptor set | Statistic | % drop in balanced accuracy |
---|---|---|
Luminosity | median | 0.318 |
Blue | standard deviation | 0.277 |
Red | mean | 0.219 |
Green | mean | 0.207 |
Luminosity | standard deviation | 0.199 |
Green | median | 0.168 |
NIR | mean | 0.157 |
Green | standard deviation | 0.144 |
Luminosity | maximum | 0.131 |
Green | maximum | 0.110 |
Red | median | 0.110 |
Blue | mean | 0.108 |
Blue | median | 0.099 |
Luminosity | minimum | 0.056 |
Luminosity | mean | 0.045 |
NIR | median | 0.044 |
Red | maximum | 0.029 |
NDVI | maximum | 0.012 |
Red | standard deviation | 0.011 |
NIR | minimum | 0.010 |
Green | minimum | 0.008 |
Red | minimum | 0.002 |
NIR | standard deviation | 0.001 |
NDVI | median | 0.000 |
NDVI | mean | -0.001 |
Blue | minimum | -0.001 |
NDVI | standard deviation | -0.002 |
Blue | maximum | -0.003 |
NDVI | minimum | -0.003 |
NIR | maximum | -0.004 |
7.10 Boxplots of the statistics for the near infrared pixels per potential greenery per class¶

Figure 35: Boxplots of the statistics for the near infrared pixels per potential greenery in the study area per class.
7.11 Boxplots of the statistics for the red pixels per potential greenery per class¶

Figure 36: Boxplots of the statistics for the red pixels per potential greenery in the study area per class.
7.12 Boxplots of the statistics for the green pixels per potential greenery per class¶

Figure 37: Boxplots of the statistics for the green pixels per potential greenery in the study area per class.
7.13 Boxplots of the statistics for the blue pixels per potential greenery per class¶

Figure 38: Boxplots of the statistics for the blue pixels per potential greenery in the study area per class.
7.14 Examples of difficult samples¶
7.14.1 Terrace vs Bare¶

Figure 39: Samples that illustrate the difficulty to distinguish terraces and bare rooftops. The red circles highlight the locations of greenery on the roofs.
7.14.2 Spontaneous vs Extensive¶

Figure 40: Samples that illustrate the difficulty to distinguish extensive and spontaneous rooftops. The red circles highlight the locations where the vegetation can induce errors.
8 Sources and references¶
Indications on software and hardware requirements, as well as the code used to perform the project, are available on GitHub: the traditional machine learning approach and the deep learning approach.
Other sources of information mentioned in this documentation are listed here:
-
Grün Stadt Zürich. Extensive Flachdachbegrünungen in der Stadt Zürich. Technical Report, Grün Stadt Zürich, March 2017. URL: https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://www.stadt-zuerich.ch/content/dam/stzh/ted/Deutsch/gsz_2/publikationen/beratung-und-wissen/wohn-und-arbeitsumfeld/dach-vertikalgruen/dachbegr%25C3%25BCnung/ErfolgskontrolleFlachdachbegruenungen170329.pdf&ved=2ahUKEwi0ptz_v8KJAxXogf0HHdVEIZ0QFnoECAwQAQ&usg=AOvVaw0lJtD7ffmgNMzGfse2ns1G. ↩
-
J Massy, P Martin, and N Wyler. Cartographie semi-automatisée des toitures végétalisées de la Ville de Genève. Géomatique Expert, 81(Juillet-Août):26 – 31, 2011. ↩
-
Tanguy Louis-Lucas, Flavie Mayrand, Philippe Clergeau, and Nathalie Machon. Remote sensing for assessing vegetated roofs with a new replicable method in Paris, France. Journal of Applied Remote Sensing, 15(1):014501, January 2021. Publisher: SPIE. URL: https://www.spiedigitallibrary.org/journals/journal-of-applied-remote-sensing/volume-15/issue-1/014501/Remote-sensing-for-assessing-vegetated-roofs-with-a-new-replicable/10.1117/1.JRS.15.014501.full (visited on 2023-06-15), doi:10.1117/1.JRS.15.014501. ↩
-
Annika Pauligk. Green Roofs 2020. March 2023. https://www.berlin.de/umweltatlas/_assets/literatur/ab_gruendach_2020.pdf. URL: https://www.berlin.de/umweltatlas/en/land-use/green-roofs/2020/methodology/ (visited on 2023-12-28). ↩
-
Abraham Noah Wu and Filip Biljecki. Roofpedia: Automatic mapping of green and solar roofs for an open roofscape registry and evaluation of urban sustainability. Landscape and Urban Planning, 214:104167, October 2021. URL: https://linkinghub.elsevier.com/retrieve/pii/S0169204621001304 (visited on 2022-05-23), doi:10.1016/j.landurbplan.2021.104167. ↩
-
Charles H. Simpson, Oscar Brousse, Nahid Mohajeri, Michael Davies, and Clare Heaviside. An Open-Source Automatic Survey of Green Roofs in London using Segmentation of Aerial Imagery. preprint, ESSD – Land/Land Cover and Land Use, August 2022. URL: https://essd.copernicus.org/preprints/essd-2022-259/ (visited on 2023-03-21), doi:10.5194/essd-2022-259. ↩
-
Yanjun Wang, Shaochun Li, Fei Teng, Yunhao Lin, Mengjie Wang, and Hengfan Cai. Improved Mask R-CNN for Rural Building Roof Type Recognition from UAV High-Resolution Images: A Case Study in Hunan Province, China. Remote Sensing, 14(2):265, January 2022. Number: 2 Publisher: Multidisciplinary Digital Publishing Institute. URL: https://www.mdpi.com/2072-4292/14/2/265 (visited on 2024-01-16), doi:10.3390/rs14020265. ↩
-
M. Buyukdemircioglu, R. Can, and S. Kocaman. DEEP LEARNING BASED ROOF TYPE CLASSIFICATION USING VERY HIGH RESOLUTION AERIAL IMAGERY. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLIII-B3-2021:55–60, June 2021. URL: https://isprs-archives.copernicus.org/articles/XLIII-B3-2021/55/2021/ (visited on 2024-01-16), doi:10.5194/isprs-archives-XLIII-B3-2021-55-2021. ↩
-
Małgorzata Krówczyńska, Edwin Raczko, Natalia Staniszewska, and Ewa Wilk. Asbestos—Cement Roofing Identification Using Remote Sensing and Convolutional Neural Networks (CNNs). Remote Sensing, 12(3):408, January 2020. Number: 3 Publisher: Multidisciplinary Digital Publishing Institute. URL: https://www.mdpi.com/2072-4292/12/3/408 (visited on 2024-01-16), doi:10.3390/rs12030408. ↩
-
Jonguk Kim, Hyansu Bae, Hyunwoo Kang, and Suk Gyu Lee. CNN Algorithm for Roof Detection and Material Classification in Satellite Images. Electronics, 10(13):1592, January 2021. Number: 13 Publisher: Multidisciplinary Digital Publishing Institute. URL: https://www.mdpi.com/2079-9292/10/13/1592 (visited on 2024-01-16), doi:10.3390/electronics10131592. ↩
-
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édourard Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. URL: https://scikit-learn.org/stable/index.html. ↩
-
DeepLabV3 Guide: Key to Image Segmentation. URL: https://www.ikomia.ai/blog/understanding-deeplabv3-image-segmentation (visited on 2025-02-07). ↩
-
Brett-Kennedy. Brett-Kennedy/ClassificationThresholdTuner. February 2025. original-date: 2024-06-12T19:58:33Z. URL: https://github.com/Brett-Kennedy/ClassificationThresholdTuner (visited on 2025-02-17). ↩