Automatic identification of anthropogenic soils potentially suitable for rehabilitation¶
Clémence Herny (ExoLabs) - Gwenaëlle Salamin (ExoLabs) - Clotilde Marmy (ExoLabs) - Alessandro Cerioni (État de Genève) - Roxane Pott (swisstopo)
Proposed by the Canton of Ticino and the Canton of Vaud - PROJ-SDA
March 2024 to December 2024 - Published in December 2024
This work by STDL is licensed under CC BY-SA 4.0
Abstract: Each Swiss Canton is required to make an inventory of potentially rehabilitable soils for maintaining the land crop rotation quota. To assist the cantons in this task, the STDL has developed an artificial intelligence-based framework to automatically identify soils degraded by human activities, i.e. "non-agricultural activity" and "land movement". A deep learning model was trained to segment the extent of the detected human activity in a multi-year dataset of aerial imagery. The ground truth was vectorised by the Canton of Ticino and the Canton of Vaud. The trained model achieved a F1 score of 0.53, with better detection performance for the land movement class than for the non-agricultural activity class. The average results of the model can be explained by the limited number of ground truth elements, the complexity of the features to be detected and the diversity of the characteristics of the images used. The trained model was applied to historical imagery from 1946 to the present day for the two cantons. A vector layer showing the distribution of human activities by year was produced in just a few days for each canton. Recall was preferred to precision in order to obtain exhaustive results, but this implies a large number of FP detections. Therefore, a thorough review of the results is necessary before they can be used. Despite the average performance of the model, it allows the identification of new areas that can be added to the inventory and fasten the process compared to a fully manual process.
1. Introduction¶
The constant increase in population and economic growth are putting considerable pressure on agricultural land. The Federal law on land management, adopted in 1979, aims to regulate the use of agricultural land to guarantee the food independence to the Swiss population in the event of crises and supply problems. As part of the sectoral plan1, the high-quality arable lands that need to be protected have been secured in the form of land crop rotation areas (LCR or surface d'assolement (SDA) in French), with a minimum area allocated by canton. To be eligible as an LCR area, the soil must comply with specific criteria123 ensuring the quality of the land for agriculture. However, certain construction programs may impinge on these lands. In such cases, the area lost must be compensated by the creation of a new LCR area of the same size. To identify areas that could be converted to LCR area, Swiss Cantons must provide a register or an indicative map of land that could potentially be rehabilitated to meet the LCR criteria. Among those, soils degraded by past anthropogenic activities are of interest. This includes soils affected by landfills, construction sites, pollution, etc.
For this project, the STDL was solicited by the Canton of Ticino and the Canton of Vaud to develop a method to identify soils degraded by human activities in the past. The goal is to help the Cantons to establish the indicative map of potentially rehabilitated soils for LCR compensation by providing a vector layer with the delimitation of the human activity affecting soils.
Some cantons already established this inventory adopting different approaches mainly based on register consultation, field investigation, human memory, visual inspection of aerial images or detection of elevation changes456. The Canton of Ticino commissioned a company to identify potential rehabilitable soils in six municipalities in the Locarno region and has access to a study performed in the scope of a CFF train project in the Magadino plain. Besides, they also developed a FME workflow based on LCR criteria and applied it to the Bellinzona valley, but it was not suitable for a large scale study.
Based on our experience with object segmentation78, we proposed to automatically segment human activities in aerial images available in Switzerland over the last 70 years with a deep learning approach.
In this report, we first present the areas of interest that will be studied. Next, the data used are described, including the images and the ground truth. We then introduce the deep learning method used. Next, we present and discuss the results of the model training and inference. Finally, we provide conclusion.
2. Study areas¶
Two cantons are considered in this study, the Canton of Ticino and the Canton of Vaud. Both cantons have established their LCR map and intend to finalise their indicative map of potential rehabilitable soils. Despite similar objective, their geography and climate raise different difficulties. On one hand, the canton of Ticino is mainly covered by high mountains, which limits the quota of the LCR area to 3,500 ha, but also the area eligible for rehabilitation. On the other hand, the Canton of Vaud displays larger lowland areas, with a quota of LCR to be maintained of 75,800 ha. Strong population growth and development are causing difficulties in the identification of land eligible for conversion.

Smaller area of interests (AoI) were defined to test the inference (Fig. 1). For the Canton of Ticino, the AoI comprises the six municipalities of the Locarno region, the Magadino plain and the Bellinzona valley, for which previous studies to find potential LCR were performed and can be used for result comparison. For the Canton of Vaud, an AoI located between the Jura mountains and Lausanne was selected.
3. Data¶
The project makes use of Swiss aerial imagery, a customised ground truth and additional data useful to identify potential LCR.
3.1 Images¶
Aerial orthophotos from 1946 to the present day from the swisstopo product SWISSIMAGE Journey were used (Table 1).
The images were captured in greyscale and colour using different instruments. Despite the rules of acquisition and image post-processing, the photometry and colour of the images can vary from year to year and from sensor to sensor. Some images were taken in winter rather than in summer, resulting in different colours of vegetation such as leafless trees or raw agricultural soils.
Product | Type | Year | Coordinate system | Spatial resolution |
---|---|---|---|---|
SWISSIMAGE 10 cm | RGB, numeric | 2017 - current | CH1903+/MN95 (EPSG:2056) | 0.10 m (\(\sigma\) \(\pm\) 0.15 m) - 0.25 m |
SWISSIMAGE 25 cm | RGB, numeric | 2005 - 2016 | MN03 (2005 - 2007) and MN95 since 2008 | 0.25 m (\(\sigma\) \(\pm\) 0.25 m) - 0.50 m (\(\sigma\) \(\pm\) 3.00 - 5.00 m) |
SWISSIMAGE 50 cm | RGB, photo | 1998 - 2004 | MN03 | 0.50 m (\(\sigma\) \(\pm\) 0.50 m) |
SWISSIMAGE HIST | greyscale, photo | 1946 - 1997 | MN95 | 0.50 m (\(\sigma\) \(\pm\) 1.0-5.0 m) |
The images are accessed via an XYZ connector using swisstopo's Web Map Tile Service (WMTS). Pre-rendered GEOTIFF tiles, with a size of 256 \(\times\) 256 pixels, are served on a grid of Cartesian coordinates (x, y, EPSG:3857 - WGS 84, Pseudo Mercator). The images at zoom level (z) 16 are chosen, with a resolution of 1.6 m px-1, as it is a good trade-off between model performance and numerical cost8. The images are fetched according to a desired year. Tiles with same coordinates but different years can exist in the dataset.
3.2 Ground truth¶
The ground truth (GT) was acquired manually by the beneficiaries. Two classes of human activities were defined:
-
Non-agricultural activity (Fig. 2, left): illegal activity on agricultural land, e.g. landfill, building, storage area, etc.
-
Land movement (Fig. 2, right): transport of material affecting soil, e.g. mineral extraction sites, quarry, backfill, excavation, construction site, etc.
The elements of the two classes are complex and have heterogeneous characteristics, particularly the elements of the non-agricultural activity class.

The labels were vectorised with SWISSIMAGE from various years (Fig. 3) in order to obtain a diverse set of images used to train the detection model. Efforts have been made to achieve GT with a good distribution between RGB and greyscale images, but it should be noted that more images and features are available in RGB images. Especially fewer elements of the non-agricultural activity class were vectorised in the greyscale images.

The two GT classes are balanced, with a total of 215 elements in the non-agricultural activity class and 234 elements in the land movement class (Table 2). We acknowledge that the number of elements is low for training a deep learning model.
Source | Non-agricultural activity | Land movement |
---|---|---|
Ticino | 73 | 93 |
Vaud | 175 | 146 |
Total | 215 | 234 |
3.3 Additional data¶
Potentially rehabilitable soils must meet defined criteria to be converted to LCR2 area. For example, a land can be located in an area that does not meet the criteria due to geographical parameters or that meets the criteria but is in conflict with another potential use. To properly identify potential LCR, beneficiaries provided us with vector layers containing geographical and land use information, so that we could cross-reference them with the results. These layers of objects of interest (OBI) include LCR areas already mapped, current and future buildings area, polluted areas and protected areas, for instance. In addition, the altitude is extracted from the Swiss digital elevation model (DEM, approx. 25 m px-1) provided as a mosaic by Lukas Martinelli on GitHub and the slope can be downloaded from the webpage of the Federal Office for Agriculture.
4. Method¶
This section presents how semantic segmentation was implemented and evaluated for automatic identification of anthropogenic soils with a multi-year dataset. Moreover, strategies as image pre-processing and training with empty tiles have been tested. Finally, post-processing steps, as merging nearby detections and filtering for OBI, have been implemented.
4.1 Semantic segmentation¶
To perform the automatic detection of anthropogenic soils, we use a deep learning approach using the object detector framework7 developed by the STDL, the detailed description of which can be found here. It allows performing instance segmentation on georeferenced data based on the detectron2 framework9.
The model is initially trained with tiles intersecting the labels and randomly split into three datasets: the training dataset (70%), the validation dataset (15%), and the test dataset (15%). The two classes described in Section 3.2 were similarly distributed across datasets and the distribution is fixed. The model hyperparameters were calibrated to obtain the best results. The selected model (Section 5.2) was trained with two images per batch, a learning rate of 5 x 10-3 with 500 iteration steps and 200 warm-up iterations. The training was performed over 7000 iterations and lasts about 2 hours on a machine with 32 GB RAM and a NVIDIA Tesla T4 GPU. The optimal detection model corresponds to the one minimising the validation loss curve, in this case around 3000 iterations. The trained model is evaluated and used to perform detection. Each detection made by the model is given a confidence score from 0 to 1.
4.2 Metrics¶
The model performance was assessed globally and by class by comparing the results with the GT and computing the following metrics:
-
Precision: number of correct detections among all the detections produced by the model;
\[precision = \frac{\sum_k TP_k}{\sum_k (TP_k + FP_k)}\] -
Recall: number of correct detections predicted by the model among all the GT labels;
\[recall = \frac{\sum_k TP_k}{\sum_k (TP_k + FN_k)}\] -
F1 score: the harmonic average of the precision and the recall;
\[F1 = 2 \times \frac{recall \times precision}{recall + precision}\]
with:
- TP, true positive, i.e. the detection is correct ;
- FP, false positive i.e. the detection is not correct;
- FN, false negative i.e. the labelled object is not detected by the algorithm;
- k, the label class.
To evaluate our multi-class model, we have chosen to calculate micro-average metrics because the primary objective is to detect as many objects as possible regardless of their class and the classes are balanced. Knowing the object class is a secondary objective.
4.3 Multi-year dataset¶
The framework can handle training and detection on images from several years in an AoI. The operator assigned to each label the year of the image used to vectorise it. Based on this year, downloaded tiles are assigned a unique identifier in the form (year, z, x, y). The model is evaluated by spatially comparing labels and detections of the same year (Fig. 4).

4.4 Image processing¶
Training a model based on heterogeneous colour images can make the detection task more difficult, particularly with a reduced ground truth dataset. To homogenise the image dataset, we propose to:
- convert the RGB images to greyscale images (Fig. 5, top) to match the pre-1998 greyscale images. We define the greyscale brightness as the linear combination of the RGB bands, based on the rec601 standard:
-
colourise the greyscale historical images to RGB (Fig. 5, bottom) using a deep learning framework10.
-
homogenise the image colours by applying histogram matching method such as the ones offered by the rasterio plugin rio-hist or the scikit-image library.

4.5 Empty tiles¶
The object detector offers the possibility of using tiles without annotations, hereafter referred to as "empty tiles", during training. These tiles can be added to the dataset either randomly from a given AoI or directly as input. In particular, a list of FP labels obtained with a previous model can be provided to select the corresponding empty tiles, hereafter referred to as "FP tiles". This allows us to confront the algorithm with problematic cases and try to improve its performance.
4.6 Detection integrity¶
We assume that nearby detections on different tiles belong to the same object. A buffer of 10 m is then used to merge detections across tiles, ensuring the detected object to be correctly delimited (Fig. 6). The average detection score is calculated for each feature. The class of the feature with the largest area is assigned to the final merged polygon. The processed results can be directly compared with the GT.

4.7 Result filtering¶
The results were inferred over the AoI. However, as mentioned in Section 3.3, not all areas of the AoI meet the criteria for the LCR area or the detections may conflict with other usage. To provide the beneficiaries with the most comprehensive information, we have created a detection layer with attributes that can be filtered according to the user's needs (Fig. 7).

First, the detection polygons are intersected with the polygons of other OBI provided by the beneficiaries (Section 3.3). The ratio between the overlap area and the detection area is calculated. A value of 0 indicates that there is no overlap, while a value of 1 indicates that the detection is completely overlapped by a feature of the OBI layer. Based on the needs of the beneficiaries, detections overlapping OBI layers were excluded by deleting the intersecting areas of the detections (e.g. detections overlapping lakes). Besides, we provide information on the proportion of detection polygons for which the slope is greater than the threshold set at 18%2 thanks to the ‘sloping terrain’ layer.
Secondly, according to the LCR criteria, the area must be greater than 500 ha, unless the area is contiguous with another LCR area. We therefore calculate the surface area and minimum distance from an LCR polygon for each detection polygon. A distance of 0 m indicates that the polygons are in contact.
Thirdly, altitude influences the climatic zones favourable for the establishment of an LCR area. The altitude of the centroid is calculated for each detection polygon from the DEM of Switzerland.
Fourthly, the confidence score given by the deep learning algorithm is provided for each detection, which can be used as a filter by the user.
5. Results¶
The results of this proof of concept are the model performance over the GT evaluated through the metrics and some graphs, as well as the analysis of false positive detections among the inference over the whole canton and all years.
5.1 Model performance¶
We note that replicate models trained with the same input parameters lead to different metric values, up to 15%, for the selected final parameters. Deep learning algorithms display random behaviour, but the influence on the final results should be negligible. The non-deterministic behaviour of detectron2 has been recognised111213, but no suitable solution has been provided yet.
Despite the high variability in the results induced by detectron2, we conducted some tests on the input files and parameters. We concluded that no input adaptation, such as adding tiles containing recurring FP objects or transforming the image colour space, significantly improve the performance of the model, apart from adding elements to the GT.
The metrics over the validation dataset for the best model are shown in Table 3 before and after post-processing.
Raw detections from OD
Score threshold = 0.50 | Precision | Recall | F1 score |
---|---|---|---|
Non-agricultural activity | 0.63 | 0.27 | 0.38 |
Land movement | 0.58 | 0.47 | 0.52 |
Global | 0.60 | 0.37 | 0.46 |
Post-processed detections
Score threshold = 0.05 | Precision | Recall | F1 score |
---|---|---|---|
Non-agricultural activity | 0.34 | 0.36 | 0.35 |
Land movement | 0.35 | 0.52 | 0.42 |
Global | 0.34 | 0.44 | 0.38 |
The evaluation of the post-processed detections was performed with a score threshold of 0.05, which is equal to the lowest possible score. It has therefore no effect and maximises the number of TP according to the needs of the beneficiaries. Accordingly, the recall is higher than the precision, reaching 0.44. The global F1 score is 0.39 over the validation dataset, which is very low. The algorithm performs poorly.
The land movement class is better detected and less mixed up than for the non-agricultural activity class (Table 4).
Class | TP | FP | FN | misclassified |
---|---|---|---|---|
Non-agricultural activity | 22 | 36 | 27 | 12 |
Land movement | 31 | 46 | 23 | 6 |
Figure 8 shows the tagged detections for each year.

No year performed significantly better or worse than the others (Fig. 8), at least when considering the training, validation, and test datasets together. This result could be biased by the superior performance on the training dataset, but there are not enough detections per year to compare datasets separately.
5.2 Inference¶
The model selected in Section 5.1 was used to make inferences over the cantons of Ticino and Vaud for the SWISSIMAGE years covering the territories. This took about two to three days per canton.
As mentioned in Section 5.1, the beneficiaries preferred higher recall than precision. Consequently, no threshold was applied to the detection score (Table 3). The aim is to maximise the completeness of the detection of objects of interest, but this also implies the presence of a large number of FP detections. The detections need to be reviewed carefully before use.

We recognise that there are a significant number of FP detections in the provided results, some of which even have high detection scores (Fig. 9). The main sources of confusion for the model vary according to the class. For the non-agricultural activity class, it is mainly caused by the presence of a feature, e.g. trees or bushes, in the middle of a field, isolated houses with outdoor storage, storage buildings or boats parked in the harbour (the latter can easily be eliminated by filtering the polygons that touch the bodies of water). For the land movement class, brownish fields in RGB images and lighter coloured fields in greyscale images are the main source of error. In addition, rock outcrops, river sand beds, and variations in the colour/texture of fields are identified. The case of an open area surrounded by forest is causing issues for both classes. These examples are a source of confusion for the algorithm, but can sometimes also be a source of confusion for the human eye.
The beneficiaries are overall satisfied with the provided layer of human activity affecting soils for each year. Although the model metrics are average, the results provide additional information and have enabled them to identify areas that had not previously been registered, particularly for the oldest years. By retaining all the detections and cross-referencing them with other vector layers of interest, beneficiaries have access to an information tool that can be customised to suit their specific needs.
6. Discussion and perspectives¶
The results obtained with our deep learning approach are promising, but they are accompanied by a large number of FPs and average detection performance. Below we discuss several issues that may be at the root of these average results and propose potential solutions to improve them.
6.1 Detection review¶
Based on the model performance and the fact that no score filtering was applied, the results consist of a large number of FPs. It is therefore necessary to review them manually before using them. Given the number of detections in some years (thousands of detections), this is a tedious task. We are currently working on a detection reviewing tool that will greatly simplify this task.
6.2 Ground truth¶
The quality and quantity of the GT is an essential element while using deep learning models. Testing over various parameters showed that it is the main element influencing the results quality. However, some constraints prevent us to improve the GT and so the trained models.
The objects we are trying to detect are complex. They are made up of several elements and their contours are not always easy to delineate. Each class presents a certain heterogeneity that can blur the characteristics of the key elements (Fig. 2). This is particularly the case for the elements of the class "non-agricultural activity", which also has few examples in the greyscale images (Fig. 3). This may explain the poorer detection performance for this class compared to the class "land movement".
In addition, some GT elements are ambiguous and can be mixed between classes (Fig. 10). However, misclassification, although it reduces the metrics, is not critical in the context of this project, as the priority is to detect human activities regardless of their class.

The GT used to train the deep learning models is limited, with only 200 to 250 elements for each class, spread across different image years. GT vectorisation is a tedious task, but further expansion of the GT would benefit model performance. In addition, defining sub-classes with more homogeneous characteristics could enable better model training.
6.3 Images¶
In addition to the heterogeneity of the ground truth, a heterogeneous image dataset is used. The images have different characteristics depending on the year and the acquisition conditions (Section 3.1). Despite our efforts to homogenise the image dataset, i.e. by converting the RGB images to greyscale or by colourising the greyscale images and homogenising the colour histogram, the impact on model performance is negligible. In the future, we may consider using more sophisticated methods such as the one of Nguyen et al. (2024)14, proposing a deep learning approach to manage with irregular time intervals between image acquisition and overcoming different image characteristics. This method was successfully applied to the same SWISSIMAGE dataset to monitor the forests of the Swiss Alps using image segmentation.
6.4 Model variability¶
Unfortunately, the variability of the model results is not negligible. It can be difficult to disentangle the influence of the model and the dataset parameters from the non-deterministic behaviour of the algorithm. This also poses problems for the reproducibility of our results. We will try to mitigate this effect in our future projects by following a suggestion from the developer community12.
To thwart the variability of the models, one solution may be to train several models and to infer the results using these models over the AoI. Intersecting the results can help to identify "strong" and "weak" detections, in addition to the confidence score, which is unfortunately not always reliable (Fig. 9). If a detection is present in several results, there is a good chance that it is a TP. On the other hand, if the detection only appears in one model, there is a good chance that it is a FP. To help the beneficiaries to select detections, we plan to provide a layer with a recurrence index in the future.
6.5. Detection of elevation changes¶
The proposed method allows the detection of human activities in aerial photographs. Image renewal is in the best case every 1 year, but often more than 3 years. Therefore, short or mid-term activities may be missed and the inventory may be incomplete.
An alternative method is to detect elevation changes456 induced by mass movement, either excavation or filling. This can be achieved by subtracting multi-temporal DEM/DSMs for an AoI.
In Switzerland, high resolution DEM (resolution: 0.5 m to 2 m and vertical accuracy: ± 0.3 m to 0.5 m) is available with the product swissALTI3D. It is derived from airborne LiDAR acquisitions made since 2012 and updated every 6 years.
DSMs for previous years can be calculated by photogrammetry from historical aerial images6. The production of large-scale DSMs by photogrammetry from aerial images is complex because it is subject to distortion. But recent advancements have demonstrated that DSM with an accuracy of 0.3 m to 0.5 m ± 3.9 m (RMSE) could be achieved in Switzerland6.
This method has the advantage over imagery of capturing all elevation changes occurring over the year range separating the DEM/DSMs. However, it is limited by the vertical resolution, which makes it impossible to detect ground movement at shallow depths. The use of this method as a complement to the image-based method presented here would improve the inventory of areas degraded by human activities.
7. Conclusion¶
The Swiss Cantons have to provide an inventory (a register or a map) of potentially rehabilitable soils, e.g. soils degraded by human activities, to be converted into LCR areas, if necessary, to maintain quotas. To achieve this objective, the STDL has developed a framework based on a deep learning approach to automatically detect soils degraded by human activities and classified into two classes, namely "non-agricultural activity" and "land movement". The model obtained promising, albeit modest, results, achieving a F1 score of 0.53, with better performance for the ground movements features than for those of non-agricultural activity. Only increasing the number of features in the ground truth seemed to significantly improve the performance of the model.
The final product delivered to the cantons consists of vector layers of detections inferred in SWISSIMAGE from 1946 to the present for each canton. Overall, the beneficiaries are satisfied with the results, despite the average performance of the model. Indeed, the results are very useful for identifying new areas of interest in the images within tens minutes to a few hours by year and by canton. We are aware that the results are accompanied by numerous FP detections that need to be carefully examined before use. This remains a tedious task and we are working on a review tool to speed up this stage.
For a land to be converted to a LCR area, a number of criteria must be met. Additional information is provided on the interaction with other features of interest that may potentially compete with LCR. This information will assist beneficiaries in making their land use decisions. Further investigations should be carried out by the beneficiaries such as identify the characteristics of the soil, assess the necessary rehabilitation measures and contact the landowner.
The method developed uses the SWISSIMAGE product, which is updated every year for a part of Switzerland and covers the whole country. Therefore, the inventory can be updated with new image acquisition and applied to other cantons.
Code availability¶
The codes are stored and available on the STDL's GitHub page:
- proj-sda: framework for detecting human activities affecting agricultural soils
- object-detector: object detector framework
Acknowledgements¶
This project was made possible thanks to a tight collaboration between the STDL team, the Canton of Ticino, and the Canton of Vaud. In particular, the STDL team acknowledges key contribution from Alex Sollero (Canton of Ticino), Marie Zoélie Künzler (Canton of Vaud), Michael Lanini (Canton of Ticino), Gioele Gentilini (Canton of Ticino), and Romane Claustre (Canton of Vaud). This project has been funded by "Stratégie suisse pour la géoinformation".
Addendum¶
June 2025
As the experts on this project favour exhaustiveness of the results rather than accuracy, the final number of detections for an entire canton is very high. It is possible to limit the total amount by filtering with the land uses and the distance to the nearest LCR in order to retain only areas to be remediated. This is what has been done for the canton of Ticino. However, the ultimate purpose of the project was not to identify potential LCR, but rather human activities. In addition, it was clarified with the expert that they did not need the exact correspondence between an object on one year and a detection. It would be enough for them to know that between 1950 and 2023, this surface was subject to human activities.
In this addendum, the project paradigm is shifted from detecting a specific human activity in a given year to delineating human activity on a given surface over a time interval. We sought to translate this change and improve the results by relying on two points:
- The many visualisation of each surface through the years;
- The variability of the models due to the use of detectron2.
Initially, it was decided to place the score threshold in post-processing as low as possible, to keep as many detections as possible and maximise the recall. However, because every area will be visualised 20 to 40 times across the years, it might be better to have a result more balanced between precision and recall. Even if an object was missed a certain year, it might be detected in another year, as large objects are present in several generations of images. In addition, it is important to avoid having a low precision, as the quantity of false positive will accumulate through the years. Therefore, we changed the score threshold to 0.5 in post-processing, which is the value maximising the F1 score.
Then, as the models produced by the STDL OD with the same parameters are always slightly different, we sought to combine the best models in order to improve precision and recall. If an object is missed by one model, chances are that it is detected by others. On the other hand, an object detected by only one model is probably a false positive.
In addition, we sought to improve the confidence score by integrating the percentage of presence of a detection across the models, as experts had complained that it was not representative enough.
Finally, in order to reduce the workload for the experts during the control, we merged the detections that overlapped each other, even if they did not come from the same year. We are no longer interested in the detections, but in the objects they represent, the aim being to visualise each object only once.
The changes made to the workflow can be summarised as follows:
- Change of the score threshold from 0.05 to the value maximizing the F1 score, 0.5 in the case of the chosen model;
- Combination of the detections of the best models;
- Determination of a new score, called the "merged score", based on the confidence score and the percentage of presence across models ;
- Merging of the detections that overlap each other, even if they did not come from the same year.
The results were assessed by year over the tiles in the GT dataset. The GT dataset contains objects for multiple years, but the same object is never digitised more than two to three times across the years and only the tiles for the digitised years included in the GT dataset. Therefore, the results were also assessed over the same extent after the inference over the full dataset, i.e. over all the years on the entire canton, to assess the impact of the additional neighbours and the fusion between overlapping detections of different years. For this second step, a modified version of the GT was used. Each object is assigned a single label, corresponding to the largest known area. The full workflow is shown on Figure A1.
In addition, a more detailed analysis of the inference results over the whole canton and years was performed.

Figure A1: Workflow for the training over the GT dataset (left) and the inference over the full dataset (right). In green are the steps added in this addendum, in yellow are the output metrics.
A.1 Method¶
To evaluate the results closer to the experts' expectations, additional metrics were defined. Then, we expose how the post-processing was extended to handle five models instead of one and how the filtering was modified to limit absurd false positives.
A.1.1 Metrics¶
The classes "non-agricultural activity" and "land movement" are both "human activities" and were defined only for the sake of simplifying the OD training. All the detection could be assigned to a single class "human activity". Consequently, the post-processed detections are assessed for this single class.
The precision, recall and F1 score as defined in Section 4.2 are still used to evaluate the results, but some new metrics are introduced to better assess the geometry of the results:
-
Geometric precision (\(P_{geom}\)): the ratio between the area of the intersection of detections and labels and the area of the detections;
\[P_{geom} =\frac{A_{detections \;\cap\; labels}}{A_{detections}}\] -
Geometric recall (\(R_{geom}\)): the ratio between the area of the intersection of detections and labels and the area of the labels;
\[R_{geom} =\frac{A_{detections \;\cap\; labels}}{A_{labels}}\] -
Relative Error on Area (\(REA\)): the ratio between the difference of the detection area and the label area, and the label area;
\[REA =\frac{A_{detections} - A_{labels}}{A_{labels}}\]
These metrics are only used for assessment with a single class, i.e. when all the detections are considered as "human activities".
In the following text, the metrics defined in Section 4.2 are referred to as the "numerical" precision and recall or simply precision and recall, while the metrics defined here are referred to as the "geometric" precision and recall.
A.1.2 Inference over the GT dataset¶
The initial decision was made to establish the score threshold in post-processing as low as possible, with the objective of maximising the number of detections and ensuring optimal recall. However, given that each area will be subjected to 20 to 40 visualisations across the years, it may be preferable to achieve a balance between precision and recall. In the event of an object being missed in a given year, there is a possibility that it may be detected in a subsequent year, given the presence of large objects across multiple generations of images. Furthermore, it is imperative to circumvent a scenario of low precision, as this will inevitably result in an accumulation of false positives over time. It was determined that a modification to the score threshold was necessary, with the decision being made to set this at 0.5 in the post-processing stage. This alteration was made in order to achieve the maximum F1 score.
When using the results from several models, the increase in the number of models leads to a greater level of detail in the measurement of the presence or absence of a detection, as the percentage of presence is expressed as a multiple of one divided by the total number of models. However, it should be noted that the increase in the number of models results in a concomitant increase in the processing time. The decision was taken to compute the results using five models, thus allowing for the possible values for the percentage of presence to be 20%, 40%, 60%, 80% and 100%.
A selection of five models was made from those with the highest F1 scores at the OD output. It is notable that all models share identical parameters, with the exception of the image batch size, which is set at 4 for the fourth model and at 2 for the remaining models. The metrics for the selected models are shown in Table A1. The results for each model were visualised to ensure that there was some variability between models.
Model | Score threshold | Global | Non-agricultural activities | Land movement | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 score | Precision | Recall | F1 score | Precision | Recall | F1 score | ||
1 | 0.5 | 0.60 | 0.37 | 0.46 | 0.63 | 0.27 | 0.38 | 0.58 | 0.46 | 0.52 |
2 | 0.4 | 0.43 | 0.36 | 0.39 | 0.44 | 0.27 | 0.33 | 0.42 | 0.44 | 0.43 |
3 | 0.2 | 0.40 | 0.45 | 0.42 | 0.44 | 0.37 | 0.40 | 0.38 | 0.53 | 0.44 |
4 | 0.25 | 0.45 | 0.37 | 0.41 | 0.44 | 0.32 | 0.37 | 0.46 | 0.41 | 0.43 |
5 | 0.35 | 0.44 | 0.40 | 0.42 | 0.37 | 0.30 | 0.33 | 0.49 | 0.50 | 0.50 |
Then the results of each model were merged based on the workflow illustrated on Figure A2. As the domain experts complained that the confidence score was not coherent, a new score, called the merged score, is calculated for each detection based on its confidence score and its percentage of presence.

The workflow can be decomposed as follows:
- Removing redundant detections:
- overlapping detections of the same year with a IoU > 0.5 are considered to be the same and only the best one is kept;
- detections with an insufficient percentage of presence or merged score are removed;
- the result is saved as "grouped dets".
- Merging detections of the same object:
- overlapping detections of the same year with a IoU > 0.01 are considered to represent the same object and are dissolved;
- the result is saved as "merged dets".
- Merging detections across years:
- overlapping detections of different years with a IoU > 0.1 are considered to represent the same object and are dissolved;
- the result is saved as "dets merged across years".
We evaluated the results based on the precision, recall, F1 score and on the visualisation of the reliability diagram, which plots precision of detections against the score. Although a detected class is preserved for each object, it is not considered for the evaluation as both classes are considered as human activities. The evaluation is performed in a binary manner, i.e. detected or not detected. The model used previously for inference was the model 1 and it is considered as the baseline.
The GT dataset encompasses objects from multiple years, yet the same object is never digitised on more than two to three occasions across the years. Furthermore, only tiles from the digitised years are included in the GT dataset. Consequently, merging across years over the GT dataset has no significant impact and was not performed for this dataset.
A.1.3 Inference over the full dataset¶
The inference was performed at cantonal scale. We first ran one model on one year and then the five models on all years. It should be noted that in Ticino, only tiles with a centroid below 1000 m altitude are taken into account, as the cantonal experts are only interested in potential LCR, which are surfaces lower than 900 m altitude with a slope of less than 18%. The resulting area of interest for each canton is shown in Figure A3.

Figure A3: Delimitation of all tiles considered during the inference in the canton of Ticino (left) and Vaud (right).
Based on the visualisation of the results at the cantonal scale, filters on the size and the merged score were included to filter out the large amount of FP among very large detections (> 100,000 m2) when merging across models (Fig. A2). In addition, the filtering script was improved with the following steps:
- remove detections with a shape close to a tile, namely with IoU > 0.5 between the detection and the tile, for detections with an area between 87,500 m2 and 200,000 m2 to avoid isolated false positives covering one entire tile;
- remove detections outside of the cantonal boundaries;
- remove artifacts due to spatial difference with objects of interest (OBI), on which no LCR can be created, based on the ratio between the original and final area and on the geometry compactness;
- do not explode multipolygons after spatial difference with the OBI.
The time per operation was measured and plotted for one year with one model and for the full workflow. When using several models, the time for the AoI and data preparation, tileset generation, inference and merging of detections across tiles were grouped as "Per-year/model operations". The time was measured on a machine with 16 CPU AMD EPYC-Rome Processor and 32 GB RAM, as well as a NVIDIA Tesla T4 GPU and 16 GB VRAM. The number and area of detections was plotted for each year and each canton. When a detection is present for several years, the year with the highest merged score is considered as the year of detection (Fig. A2).
The results after merging across years are very different depending on whether inference is performed on the GT dataset or on the dataset comprising all years. Indeed, no tile is visualised more than two or three times in the GT dataset, but the tiles are visualised for all the available years during the full inference. Therefore, the results of the full inference were assessed by clipping them to the extent of the GT tiles (Fig. A1).
A.2 Results¶
In this section, we first present the impact on the metrics of the higher score threshold, and the merges across models and across years over the GT dataset. The impact of these operation on the quality of the confidence score is also discussed. Then, we analyze the results over the full dataset, i.e. over all years for the whole cantons. The processing times and the corresponding metrics are provided, as well as some informative graphs about the distribution of detections over the years and the elevation in each canton.
A.2.1 Impact of the score threshold¶
Score threshold | Precision | Recall | F1 score |
---|---|---|---|
0.05 | 0.34 | 0.44 | 0.38 |
0.5 | 0.57 | 0.35 | 0.43 |
Table A2: Global metrics after merging detections on adjacent tiles for the model 1 depending on the score threshold.
The global metrics after post-processing (Table A2) are very similar to the ones before (Table 3). When the score threshold is set to 0.5, the gain on the precision (+ 0.23 points) is significantly higher than the loss on the recall (-0.09 points) compared to the recall corresponding to the 0.05 threshold. It is coherent with the increase of the F1 score by 0.05 points and confirms the decision to rather use the 0.5 threshold for the post-processing.
A.2.2 Merging detections of 5 models¶
Merging the detections across models was performed first for the model 1 only, to compute the baseline. The results are shown in table A3 for the global metrics. Then, the detections were merged across the five selected models.
Global | Single class "human activities" | ||||||||
---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 score | Precision | Recall | F1 score | Geometric precision | Geometric recall | Relative error on area | |
Merged across tiles for the model 1 | 0.57 | 0.35 | 0.43 | 0.69 | 0.42 | 0.52 | 0.78 | 0.51 | -0.34 |
Merged across models (model 1 only) | 0.57 | 0.34 | 0.43 | 0.69 | 0.42 | 0.52 | 0.78 | 0.51 | -0.34 |
Merged across models (5 models) | 0.44 | 0.42 | 0.42 | 0.61 | 0.58 | 0.59 | 0.77 | 0.64 | -0.17 |
Table A3: Metrics for the model 1, used as baseline, before and after merging overlapping detections across models, and metrics when the results are merged across the five models.
We note that "merging across models" using only one model, which basically has the effect of merging overlapping detections, leads to no performance improvement.Among the results of the model 1, no detections were overlapping and were merged together or that their fusion had no impact on the metrics.
When merging the detections of five models together, the recall increases by 0.07 points, but the precision drops by 0.13 points, resulting in a F1 score lower by 0.01. The results are then equivalent in overall quality for the two defined classes.
When looking at all the detections as "human activities", the recall increase is higher (+0.16 point) and the precision drop lower (-0.08 points), resulting in a improvement of the overall quality (+0.07 on the F1 score). Merging across results added to the class confusion, but the overall placement of detections was improved. It is confirmed by the stagnation of the geometric precision and the improvement of the geometric recall. The underestimation of the total area is lower, the REA is divided by two.
The experts expressed that the confidence score were too optimistic of the detection precision. To control this affirmation, we plotted the reliability diagram for the detections provided to them (Fig. A4), for which the lower threshold on the confidence score was 0.05. If we use a higher threshold of 0.5 like in this addendum, the result remains the same for elements above 0.5. The rest is removed.

Figure A4: Number of detections per bin of the confidence score as a barplot and precision of detections in each bin as a line plot. The diagram shows the post-processed results produced by the model 1 over the validation dataset with a unique class "anthropic activities". It corresponds to the results of Section 5.1
We note that the actual line approximates quite well the reference line, except for the large drop in precision between 0.6 and 0.8. It is otherwise 0.05 to 0.1 points too low between 0.2 and 0.4 and 0.05 points too high in all other cases. If the experts focus on controlling the detections with the highest confidence score, they might indeed be under the impression that the precision is lower than what they could have expected based on the confidence score, in reason of the precision drop in quality between 0.6 and 0.8.

Figure A5: Number of detections per bin of the confidence score as a barplot and precision of detections in each bin as a line plot. The diagram shows the merged results of the five models for the validation dataset with a unique class "anthropic activities".
After merging the results across the models, the precision in almost all bins of the confidence score is lower than expected (Fig. A5). We observe an improvement with the increase of the confidence score, as expected, between 0.3 and 0.6 and between 0.6 and 0.1. However, this trend starts lower than expected between 0.3 and 0.4 and there a surprising decrease in the precision after 0.6.
Detections with a confidence score lower than 0.3 were all filtered out, even when the optimal threshold for two of the five chosen model were lower (Table A1 & Fig. A5). When performing the detection selection across models, detections with a score lower than 0.3 were never matched together, and were thus never considered in the final results.

Figure A6: Number of detections per bin of the merged score as a barplot and precision of detections in each bin as a line plot. The diagram shows the merged results of the five models for the validation dataset with a unique class "anthropic activities".
Based on Figures A5 and A6, we can say that the reference line is better approximated by the merged score than by the confidence score, with the precision of four bins almost perfectly on the line. The precision is even higher than the reference, especially above 0.7 of merged score.
Even without confidence scores lower than 0.3, some merged score are as low as 0.1, which was the lower limit authorised by the applied filter. In addition, when considering the confidence score, most detections had a value higher than 0.6 after being merged across models (Fig. A5). Considering the merged score equalised the repartition on the whole score range (Fig. A6). The merged score is thus a better discriminator of the detection quality.
A.2.3 Inference on the whole cantons¶
The processing was run in the cantons of Vaud and Ticino. The processing time is illustrated in Figures A7 and A8.

Figure A7: Time in minutes needed to run each operation with the model 1 on one year. Average on two years among the ones covering the entire canton between 2015 and 2023.
When comparing the time needed for each operation for the model 1 on one year with images covering the whole canton (Fig. A7), it should be noted that the inference with the deep learning model is the most time-consuming operation, with 22 minutes for Ticino and 44 minutes for Vaud, followed by filtering the detections based on OBI with 12 minutes for Ticino and 37 minutes for Vaud. The processing time is significantly lower for the canton of Ticino than for the canton of Vaud.
The time needed for all the other operation, except for the tileset generation, is negligible when working on one year with one model. Per operation, it is less than 1 minute for Ticino and 1 minute and a half for Vaud.

Figure A8: Time in minutes needed to run the inference with 5 models on all years. Operations specific to a model and years are grouped as "per-year/model operations".
Most of the processing time is dedicated to the inference with the STDL OD (Fig. A8), i.e. 28 h for the canton of Ticino and 88 h for the canton of Vaud. This is mostly due to the tileset generation and the inference itself (Fig. A7). The final filter of the detections with OBI remains significant, with 2 h for the canton of Ticino and 7 h for the canton of Vaud. The other operations are negligible and are preformed in less than 20 minutes, even on the canton of Vaud.
A.2.3.1 Metrics over all years¶
As shown on Figure A1, the results as produced by an inference over all tiles and years were assessed by clipping them to the extent of the GT tiles. The results were then assessed by year with the original GT and after the merge across years with an adapted GT. The metrics are shown in Table A4.
Single class "human activities" | Precision | Recall | F1 score | Geometric precision | Geometric recall | Relative error on area |
---|---|---|---|---|---|---|
By year, merged across models - GT dataset | 0.61 | 0.58 | 0.59 | 0.77 | 0.64 | -0.17 |
By year, merged across models - full dataset clipped to GT tiles | 0.58 | 0.58 | 0.58 | 0.69 | 0.79 | 0.14 |
Merged across years - full dataset clipped to GT tiles | 0.17 | 0.71 | 0.27 | 0.30 | 0.92 | 2.02 |
Table A4: Metrics for the results over the GT years, as presented in Table A3, and over all years before and after the merge across years.
Additional considered years have not effect when merging across models. The variations in the metrics are entirely due to the fact that more neighbouring tiles were considered when merging detections on adjacent tiles, changing the disposition of the detections on the GT tiles.
The numerical metrics were slightly affected, while the geometric metrics were more affected with a decrease of 0.08 for the geometric precision and an increase of 0.15 in the geometric recall (Table A4). The relative error on the area keeps the same magnitude, but changes from an underestimation to an overestimation.
Merging across years when considering all years considerably decreases the numeric and geometric precision, while increasing the numeric and geometric recall. We note that the numeric precision and recall are lower than the geometric precision and recall respectively. The F1 score decreases, while the relative error on the area increases. Therefore, the results can be considered as worst when considering all years than when assessing the model by year.
A.2.3.2 Ticino¶
In total, there were 19,351 detections in Ticino before the modification of the method and 4,061 after. The latter cover in total 22 km2, or 0.8% of the territory.

Figure A9: Comparison of the detections in Mairano (Ticino) provided to the beneficiaries before (left) and after (right) the modification of the method. The labels for the new results indicate the number and the years of detection, as well as the merged score.
The new method decreases efficiently the number of times an expert would see the same area as there are less overlapping detections (Fig. A9). The border of each object is easier to understand as it is delimited only once for all years. However, several objects can sometimes be found under the same detection. As the results are easier to interpret, it might be simpler to notice where an area was excluded in reason of the land use, like in the center of the area shown on Figure A9.

Figure A10: Number of objects that were detected for one or several years over the canton of Ticino.
The majority of objects, i.e. 2,571 or 63%, were only detected for one year (Fig. A10). 634 objects were detected twice in all years and 60 were detected more than 10 times. No object was detected more than 15 times over the 33 years for which images were available.

Figure A11: Lifespan of the detected objects based on their first and last detection and depending on the total number of occurrences for the canton of Ticino. As several objects can have the same lifespan and number of occurrences, the points are coloured to indicate the total number of objects concerned.
The minimum lifespan increases in proportion to the number of occurrences (Fig. A11), and is typically three times greater the latter. Conversely, the maximum lifespan remains relatively constant at approximately 70 years, which is the upper limit with SWISSIMAGE Journey. Objects with two to five occurrences generally have a lifespan of 3 to 12 times the number of occurrences. Values that exceed this range may be designated as outliers. For objects with more than five occurrences, it would be challenging to define an typical range for their lifespans, due to the limited number of points and the upper limit of 70 years.
For the area considered in Ticino, image coverage is total for the years 2006 to 2021, with the exception of 2019 when coverage is residual at the border with the canton of Graubünden. It is almost complete for 1977, 1983, 1989 and 1995. The number of detections per year is shown in Figure A12.

Figure A12: Number of detections for each year between 1950 and 2022 over the canton of Ticino.
The number of detections is coherent with the image coverage of each year, with more detections in years with full image coverage (Fig. A12). Among those years, the ones with black and white images, i.e. before 1999, contain more detections than the ones with colour images.
The year 2004 has a remarkably large amount of detections. It covers the area south of Osogna, so most of the area under 900 m of altitude is covered, except for the Leventina Valley.

Figure A13: Median area of detections depending on the detection year in the canton of Ticino.
The median area of detections in each year remains below 2,600 m2, with an outlier in 1961 (Fig. A13). Between 1960 and 1980, the years with the fewest detections also had a lower median area (Figs. A12 & A13). The median detection area appears to stabilise at around 1,750 m2 from 1999 onwards.

Figure A14: Surface covered by detections for each year between 1950 and 2022 over the canton of Ticino.
The surface covered by detections (Fig. A14) is coherent with the previous figures. The impact of year 1961 remains limited even with its abnormally large median area, because there are only 25 detections that year.

Figure A15: Median merged score of detections depending on the detection year in the canton of Ticino.
The median merge score varies between 0.23 and 0.60, with a drop after 1993 and stabilisation at around 0.37 (Fig. A15). The years with a merged score below 0.30 are also those with fewer than 50 detections.

Figure A16: Median confidence score of detections depending on the detection year in the canton of Ticino.
The median confidence score is around 0.55 before 1995 and 0.5 after (Fig. A16). It follows the same trends than the median merged score (Fig. A15).

Figure A17: Altitude of the detection centroid rounded to 25 m in the canton of Ticino.
The largest quantity of detections is found around 200 m altitude in the plain of Magadino and around the lake of Locarno with approximately 250 detections and a share of 6% (Fig. A17). For the rest of the detections, we observed a regular distribution with a peak between 250 and 450 m altitude. The number of detections remains stable between 100 and 130 detections per cell between 525 and 875 m altitude.
The expert's visual inspection of the detections showed that the years 1971, 1989 and 1999 in particular had a large number of false positives. However, they do not present any specific behaviour in Figures A12 to A17.
A.2.3.3 Vaud¶
In total, there were 94,148 detections in the canton of Vaud before the modification of the method and 33,977 after. They cover in total 313 km2, or 9.7% of the territory.

Figure A18: Comparison of the detections in the north of Bière (Vaud) provided to the beneficiaries before (left) and after (right) the modification of the method. The labels for detections larger than 1,000 m2 for the new results indicate the number and the years of detection, as well as the merged score.
Like in the canton of Ticino, the new method decreases efficiently the number of times an expert would see the same area as there are less overlapping detections (Fig. A18). The border of each object is easier to understand as it is delimited only once for all years. However, some parts of the quarry are missing due to the increase in the score threshold and to the modification of the detections through the merge across models.

Figure A19: Number of objects that were detected for one or several years over the canton of Vaud.
The majority of objects, 23,766 or 69%, were detected in only one year (Fig. A19). 5,100 objects were detected twice in all years. Five objects were detected more than 20 times. No object was detected more than 22 times over the 42 years for which images were available.

Figure A20: Lifespan of the detected objects based on their first and last detection and depending on the total number of occurrences for the canton of Vaud. As several objects can have the same lifespan and number of occurrences, the points are coloured to indicate the total number of objects concerned.
The minimum lifespan increases in proportion to the number of occurrences (Fig. A20). Conversely, the maximum lifespan remains relatively constant at 71 years, which is the upper limit with SWISSIMAGE Journey. Objects with two to six occurrences have a minimum lifespan approximately equal to themselves and mostly under 12 times their number of occurrences. Values that exceed this range may be designated as outliers. For objects with more than five occurrences, it would be challenging to define an typical range for their lifespans, due to the limited number of points and the upper limit of 71 years.
For the canton of Vaud, image coverage is total for the years 1998, 2004, 2017, 2020 and 2023. It is almost complete for the year 1974. The number of detections per year is shown in Figure A21.

Figure A21: Number of detections for each year between 1952 and 2023 over the canton of Vaud.
Even though the image coverage is partial in 1980, that year is the one with the highest number of detections and by far (Fig. 18). It contains 6,845 items, while the second highest year is 1998, with around 3,175 detections and complete coverage of the canton. Around 20% of all detections in the canton of Vaud took place in 1980.
Most years have partial coverage and fewer than 1,250 detections. The years with complete coverage have at least 1,250 detections.

Figure A22: Median area of detections depending on the detection year in the canton of Vaud.
The median surface area of detections increased linearly between 1960 and 1993, reaching 4,500 m2 (Fig. A22). Detections before 1967 and after 1997 have a median area of less than 3,000 m2. 1993 has the highest median area at almost 5,000 m2, followed by 1980 at around 4,550 m2. These values are particularly high, but they appear to be in line with the trend from 1979 to 1993 and it is difficult to detect any anomalies.

Figure A23: Surface covered by detections for each year between 1950 and 2022 over the canton of Vaud.
The area covered per year is actually higher for the years in which the entire canton was covered (Fig. A23), except for 1980. As might be expected from previous observations concerning 1980, the combination of the high number of detections produced that year and their high median area means that the total area covered is absurd compared with other years. Almost 30% of the total area covered in the canton of Vaud was covered in 1980.
The year 1993 had a particularly high median area (Fig. A22). However, it did not stand out in terms of total area covered.

Figure A24: Median merged score of detections depending on the detection year in the canton of Vaud.
The merged score is constant at around 0.375 (Fig. A24). An outlier can be seen in 1956, a year in which only the Rhone plain was overflown and for which only 15 detections were produced.

Figure A25: Median confidence score of detections depending on the detection year in the canton of Vaud.
We note that the median confidence score is higher than the median merged score by more or less 0.28 (Figs. A24 & A25). Like the merged score, it is more or less constant with some clear outliers. All the values are grouped around 0.68, except for 1967 where it is around 0.78, as well as 1952 and 1980, where it is at 0.60 and 0.56 respectively. Among the outliers, 1980 is the only year with a large number of detections.

Figure A26: Altitude of the detection centroid rounded to 25 m in the canton of Vaud.
The number of detections per altitude follows a long-tailed bimodal distribution with a first peak at 425 m and a second at 600 m altitude (Fig. A26). The decrease in the number of detections in the tail of the distribution is not steady, because of the stabilisation between 250 and 300 detections for an altitude of between 1,000 and 1,350 m. The highest detection is at an altitude of 3,075 m, while the highest peak in the canton of Vaud is at 3,210 m.
A.3 Discussion¶
A.3.1 Merging results of 5 models¶
When conserving the two classes, land movement and non-agricultural activities, there is no interest in merging the results of the five chosen models as the F1 score is not improved (Table A3). However, when considering all the detections as "human activities", all the metrics except the numerical precision are improved or stagnant. Therefore, the operation has a positive impact on the results.
In addition, merging the results of the five chosen models allowed us to create the merged score, based on the confidence score and presence across models. Where the initial confidence score is lower than the precision expected in each confidence bin (Fig. A4), the merged score is higher (Fig. A6). Therefore, even if it is not actually better, it is more appreciated by the experts for the control of the detections. Besides, the initial confidence score should not have been used after the merge across models, since it was affected negatively by the merge between the five models (Fig. A5).
It is challenging to determine, based on the available metrics, whether the enhancement in metrics justifies the investment of resources required to execute five times the inference. However, the experts expressed satisfaction with the decrease in the shear amount of detections and the new merged score. Furthermore, it was reported that the results were more readily usable when the overlapping detections were merged.
A.3.2 Metrics over all years¶
The fact that only the geometric metrics and not the numeric ones changed when considering neighbouring tiles means that the area covered increased without a significant change in the number of detections (Table A4). The additional merged with detections on adjacent tiles allowed to keep more part of the same objects.
Although the loss in quality after merging across years is deplorable, it was expected as the error of each year is accumulated in the final result. Indeed, false positives detected each year could not be filtered out of the final result. However, let us note that the repetition of the detections across years had a very positive impact on the recall. A final geometric recall of 92% is excellent, even if a geometric precision of 30% is not so good.
Besides, the geometric precision and recall are higher than their numeric counterparts. It means that the false positive detections and false negative labels are generally smaller than the true positives.
A.3.3 Analysis of the inference statistics¶
The total number of detections decreased significantly with the new method thanks to the fusion of overlapping detections of the same year and across years. Moreover, the decrease is higher in the canton of Ticino than in the canton of Vaud. In the absence of an additional filter, it can be assumed that this larger reduction is due to the fact that multipolygons are now being kept as one detection instead of being exploded into several smaller objects. The impact is more significant in the canton of Ticino, where a greater number of clip operations are performed with OBI to ensure that only potential LCR are retained, frequently resulting in the separation of parts of an object.
In both cantons, no object is detected on more than half of the years in which the canton was flown over (Figs. A10 & A19). It was not until the 2010s that swisstopo began to systematically fly over whole cantons. It was therefore anticipated that no object would be detected on all considered years. It appears that the current limit is such that human activities can be detected at most in half the years under consideration. In some cases, they are not located within the overflown area, and in some other cases, they are overlooked by the algorithm. It would be interesting to test the impact of those two elements on the final result.
In the canton of Ticino, the detection analysis did not indicate any suspicious behaviour at a particular year or elevation. The only remarkable point is the high number of detections for year 2004 (Fig. A12), considering that the Leventina Valley north of Osogna was not covered that year. Visualisation of the results did not help uncover any exceptional trend in the detections. As most human activities take place in the south of the canton, which is more urbanized, a high number of detections is not abnormal. However, it seems strange that there are more numerous than for years were the canton is entirely covered. Either this period of time was particularly busy, or there is some missed bias.
The analysis of the detection lifespan (Fig. A11) revealed that even with a low number of occurrences, some objects have a long estimated lifespan based on the first and last occurrence. The exact return rate of swisstopo's flights varied trough time. However, a reliable estimate can be derived, suggesting that an object was captured approximately every six years prior to 2001 and every three years subsequently. Consequently, because it can be occasionally missed during the detection process, it is improbable that an object possesses a lifespan exceeding 12 times its number of occurrences. For example, an object detected in 1952 and 1964 could have been missed in 1958. Therefore, its lifespan is 12 and its number of occurrence two. This observation is consistent with the finding that objects with fewer than five occurrences generally have a lifespan between 3 to 12 times their number of occurrences (Fig. A11).
The expert's visualisation of the detections revealed that the years 1971, 1989 and 1999 show a particularly high number of false positives. In 1971 and 1989, the images were often blurred. For 1999, this year corresponds to the switch to colour with a film camera that will be used until 2005 (Table 1) and presents a particular colorimetry that is also found for the year 2001 in Ticino. Although the images for 2004 were produced with the same camera, their colorimetry is already closer to that of the digital images produced later. It is possible that the examples digitised for GT are not sufficient to process the years 1999 and 2001 in Ticino and that more data is needed.
In the canton of Vaud, there is a clear problem of the overproduction of detections in 1980, as they are too numerous compared to the other years (Fig. A21). Upon visualisation of the images, we note that they are clearly lighter between 1979 and 1983 than for the other years. It appears that 1983 is the year with the highest number of detections and the largest covered area in Ticino. However, it does not stand out as much as year 1980 in the canton of Vaud. Possibly, the impact of the colorimetry was mitigated by the exclusion of OBI.
The comments on the detected lifespan of objects in Ticino are also valid for the canton of Vaud. The minimum lifespan at each number of occurrences is coherent, but there are many aberrant high values (Fig. A20). We might consider that a lifespan more than 10 to 12 times the number of occurrences is indicative of some mistake in the detections or of different objects.
A.3.4 Further improvements¶
Several suggestions to improve the results were made in the initial discussion at Sections 6.2 to 6.5. They remain valid, except for the exploitation of the model variability which was explored in this addendum.
An additional improvement would be to remove isolated detections on tile corners. Those are detections following the border of their tile, including one or more corners, and without corresponding detections on the adjacent tiles. They are therefore characterized by at least one right angle and at least two horizontal or vertical straight sides.This type of artefact removal was already performed for the project on soil segmentation.
The difficulty here resides in the fact that polygons are simplified before being output by the object detector to limit the number of vertices and the data volume. The angles in tile corners are then not right. To apply such a strategy, it would be necessary to modify the OD to tag the concerned detections before the simplification. They would then be removed before merging across years if they were not merged with detections on adjacent tiles during the previous operation.
Another possibility would be to deactivate the Ramer-Douglas-Peucker simplification in the OD. However, the amount of detections in the project is very high and an machine with more RAM and more disk space might then be necessary.
Another possibility would be to valorise the fact that most objects are present in several generations of images. For those objects, we could take into account the number of times the object was detected and over which time interval to limit the number of false positives.
For objects detected only once, we could try to determine the maximum surface for them to plausibly appear and disappear between three generations of images.
A.4 Conclusion¶
In this addendum, several steps were added to the post-processing of the results. The threshold on the confidence score was modified to improve the precision and limit the total amount of detections produced through the years. The results of five models were merged together, a new score was created based on the confidence score and the presence across models, and the detections were merged across years.
Each new step has improved the results and the experts expressed their satisfaction with the new detections. However, a thorough control of the results remain necessary as only 30% of the delimited surface belongs in fact to anthropogenic soil.
The results of this project may be useful as a working basis for specialists who have to start from scratch, or as a complement to other sources, but its poor accuracy means that it cannot be used as a substitute for another method that already exists and without checking the detections produced. In addition, the STDL would only recommend its use if strict filters can be applied, for example according to land use or altitude, in order to limit the time required for a control.
The Geolablizer tool, developed by STDL and available in Python, was made available to the experts to facilitate the checking of results and thus limit the checking time needed to retain the most interesting cases.
References¶
-
Office fédéral du développement territorial ARE. Plan sectoriel des surfaces d'assolement (SDA). 2020. URL: https://www.are.admin.ch/are/fr/home/developpement-et-amenagement-du-territoire/strategie-et-planification/conceptions-et-plans-sectoriels/plans-sectoriels-de-la-confederation/sda.html (visited on 2024-12-11). ↩↩
-
Basler & Hofmann. Carte indicative des sols valorisables et réhabilitables pour des compensations SDA. March 2021. URL: https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://www.are.admin.ch/dam/are/it/dokumente/raumplanung/dokumente/bericht/anleitung-hinweiskarte-fff-20210312.pdf.download.pdf/anleitung-hinweiskarte-fff-20210312-fr.pdf&ved=2ahUKEwiqtKH2qZ-KAxV4TKQEHXyvHUsQFnoECBsQAQ&usg=AOvVaw16dmv2iJ4fO7dns6B6Wy57. ↩↩↩
-
Canton de Vaud. Comment identifier de nouvelles surfaces d'assolement? 2022. URL: https://www.vd.ch/territoire-et-construction/amenagement-du-territoire/proteger-les-surfaces-dassolement-sda/identifier-de-nouvelles-sda. ↩
-
Stephan Nebiker, Natalie Lack, and Marianne Deuber. Building Change Detection from Historical Aerial Photographs Using Dense Image Matching and Object-Based Image Analysis. Remote Sensing, 6(9):8310–8336, September 2014. URL: http://www.mdpi.com/2072-4292/6/9/8310 (visited on 2024-03-26), doi:10.3390/rs6098310. ↩↩
-
Giuseppe Esposito, Fabio Matano, and Marco Sacchi. Detection and Geometrical Characterization of a Buried Landfill Site by Integrating Land Use Historical Analysis, Digital Photogrammetry and Airborne Lidar Data. Geosciences, 8(9):348, September 2018. URL: http://www.mdpi.com/2076-3263/8/9/348 (visited on 2024-03-26), doi:10.3390/geosciences8090348. ↩↩
-
Christian Ginzler, Livia Piermattei, Mauro Marty, and Lars T. Waser. Four nationwide Digital Surface Models from airborne historical stereo-images. EGU-2024, March 2024. URL: https://meetingorganizer.copernicus.org/EGU24/EGU24-5142.html (visited on 2024-12-13), doi:10.5194/egusphere-egu24-5142. ↩↩↩↩
-
Alessandro Cerioni, Clémence Herny, Adrian Meyer, and Gwenaëlle Salamin. Object detector framework. December 2024. URL: https://tech.stdl.ch/TASK-IDET/. ↩↩
-
Clémence Herny, Shanci Li, Alessandro Cerioni, and Roxane Pott. Automatic detection and observation of mineral extraction sites in Switzerland. January 2024. URL: https://tech.stdl.ch/PROJ-DQRY-TM/. ↩↩
-
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. January 2018. arXiv:1703.06870 [cs]. URL: http://arxiv.org/abs/1703.06870, doi:10.48550/arXiv.1703.06870. ↩
-
Elisa Mariarosaria Farella, Salim Malek, and Fabio Remondino. Colorizing the Past: Deep Learning for the Automatic Colorization of Historical Aerial Images. Journal of Imaging, 8(10):269, October 2022. URL: https://www.mdpi.com/2313-433X/8/10/269 (visited on 2024-12-11), doi:10.3390/jimaging8100269. ↩
-
J Rausch. Repeated training not deterministic despite identical setup and reproducibility flags. https://github.com/facebookresearch/detectron2/issues/4260, May 2022. Issue #4260. ↩
-
Collin Mac Carthy. Manual seed does not work as expected. https://github.com/facebookresearch/detectron2/issues/4438, July 2022. Issue #4438. ↩↩
-
ASDen. A simple trick for a fully deterministic roialign, and thus maskrcnn training and inference. https://github.com/facebookresearch/detectron2/issues/4723, December 2022. Issue #4723. ↩
-
Thiên-Anh Nguyen, Marc Rußwurm, Gaston Lenczner, and Devis Tuia. Multi-temporal forest monitoring in the Swiss Alps with knowledge-guided deep learning. Remote Sensing of Environment, 305:114109, May 2024. URL: https://linkinghub.elsevier.com/retrieve/pii/S0034425724001202 (visited on 2024-12-11), doi:10.1016/j.rse.2024.114109. ↩