Automatic landslide length and width estimation based on the geometric processing of the bounding box and the geomorphometric analysis of DEMs

The morphology of landslides is influenced by the slide/flow of the material downslope. Usually, the distance of the movement of the material is greater than the width of the displaced material (especially for flows, but also the majority of slides); the resulting landslides have a greater length than width. In some specific geomorphologic environments (monoclinic regions, with cuesta landforms type) or as is the case for some types of landslides (translational slides, bank failures, complex landslides), for the majority of landslides, the distance of the movement of the displaced material can be smaller than its width; thus the landslides have a smaller length than width. When working with landslide inventories containing both types of landslides presented above, the analysis of the length and width of the landslides computed using usual geographic information system techniques (like bounding boxes) can be flawed. To overcome this flaw, I present an algorithm which uses both the geometry of the landslide polygon minimum oriented bounding box and a digital elevation model of the landslide topography for identifying the long vs. wide landslides. I tested the proposed algorithm for a landslide inventory which covers 131.1 km2 of the Moldavian Plateau, eastern Romania. This inventory contains 1327 landslides, of which 518 were manually classified as long and 809 as wide. In a first step, the difference in elevation of the length and width of the minimum oriented bounding box is used to separate long landslides from wide landslides (long landslides having the greatest elevation difference along the length of the bounding box). In a second step, the long landslides are checked as to whether their length is greater than the length of flow downslope (estimated with a flow-routing algorithm), in which case the landslide is classified as wide. By using this approach, the area under the Receiver Operating Characteristic curve value for the classification of the long vs. wide landslides is 87.8 %. An intensive review of the misclassified cases and the challenges of the proposed algorithm is made, and discussions are included about the prospects of improving the approach with further steps, to reduce the number of misclassifications.

The sliding/flowing of the material downslope produces a scarp and a displaced mass, which are usually longer than they are wide (Fig. 1d).While generalized for flows (Fig. 1d) (Cruden and Varnes, 1996), this mechanism is also quite frequent for rotational (Fig. 1a) and translational slides, falls, topples and spreads (Cruden and Varnes, 1996).I will call this type of landslide "long" landslides.Long landslides appear either as single events or as complex landslides (Figs. 1a,d,2).
While not characteristic of flows, falls and rotational slides, cases when the length of the displacement does not exceed the width of the displaced mass are relatively common for translational slides and spreads (Fig. 1c, e).This type of landslide appears in almost every landslide inventory, in different proportions (such cases are visible in landslide inventory snapshots from Dewitte and Demoulin, 2005;Schulz, 2007;Galli et al., 2008;Baldo et al., 2009;Hattanji et al., 2009;Guzzetti et al., 2012;Van den Eeckhaut et al., 2012;Mondini et al., 2013).I will call this type of landslide "wide" landslides (Fig. 1b, c, e).
Wide landslides most commonly appear along gully and river banks (Fig. 1b, e), but also on hillslopes (Fig. 1c), especially in regions with monoclinic structure along the scarps of cuesta landforms.Such a situation is found in the Moldavian Plateau (Niculiţǎ, 2015;Mǎrgǎrint and Niculiţǎ, 2016), where the monoclinic structure favors cuesta escarpments, which extend in width for long distances without being fragmented (Figs. 2, 3).
In both scenarios of landslide delineation (manual or automatic), the landslide length and width can be computed automatically in a GIS environment, using the dimensions of the minimum bounding box of the polygons to represent the landslides (Taylor and Malamud, 2012).While for long landslides, the length of the bounding box is a reasonable approximation of the length of the landslide, for wide landslides, the length is the width of the bounding box.
To overcome the flaw mentioned above, which can bias the analysis of landslide length and width (Taylor and Malamud, 2012;Taylor et al., 2015;Niculiţǎ, 2015;Mǎrgǎrint and Niculiţǎ, 2016), an automatic algorithm for the estimation of landslide flow direction is proposed.The algorithm uses the processing of the oriented bounding box, and the elevation and slope length computed from the DEM, for deriving the length along the landslide slide/flow direction and the associated width.The method is validated in a landslide inventory in which the length and width characteristics were assessed during landslide mapping.

Materials
For testing the proposed method, I have used a landslide inventory (Figs. 2, 3), created and described by Niculiţǎ and Mǎrgǎrint (2015), for a small area (131.1 km 2 ) from the Moldavian Plateau (eastern Romania), where single event and complex landslides were delineated.This inventory contains 1327 polygons, representing landslides with areas ranging from 90 to 800 000 m 2 .The delineation of polygon landslides was carried out using aerial imagery, a high-resolution lidar DEM and field validation.For every landslide, the long or wide type was recorded in the attribute table during inven- tory creation.The long and wide types were assessed through expert opinion during landslide delineation, also using threedimensional (3-D) views to better assess the slide/flow direction.Landslide typology in the study area is dominated by translational and rotational slides, with very few flows.A 5 m resolution lidar DEM was available for the study area.

Methods
The proposed method was implemented using R software application (R Core Team, 2015) and several user-contributed packages (Fig. 4).For more details, the script code is available (please see Sect. 6).
Oriented bounding boxes are in fact minimum-area enclosing rectangles for convex polygons (Fig. 5).Besides the rectangular minimum bounding box (whose sides are parallel to the coordinate axes), the oriented minimum bounding box can be defined as the bounding box of which sides are along the orientation of the polygon (Freeman and Shapira, 1975).The rotating calipers algorithm (Toussaint, 1983) is used to create these bounding boxes in the majority of GIS implementations.Its implementation in the R shotGroups package (Wollschläger, 2016) was used in the present approach, through the getMinBBox function.This function uses the coordinates of the polygon vertices, stored in an object of the sp package, called SpatialPolygonsDataFrame (Bivand et al., 2013), which is created from a shapefile containing the landslide polygons and a unique ordered sequence of autoincrement IDs, by import with the rgdal package (Fig. 4).Further, sp package functions are used to manipulate the bounding boxes.
The getMinBBox function creates the x, y coordinates of the corners (CP1 to CP4 -Fig.5) of the minimum oriented bounding box in a counter-clockwise order (CP1, CP2, CP3, CP4), which allows the creation of the rectangle with the help of the sp package functions.While the majority of the resulting rectangles have sides CP1-CP2 and CP3-CP4 (Fig. 4) as the long sides (length) and sides CP1-CP4 and CP2-CP3 as short sides (width), there are some cases where the situation is vice versa.Since the algorithm needs a consistent approach, the coordinates of the four corner points of the minimum oriented bounding box were presented in a data table, and using the length of the sides, as a rule, all the corner points were assigned to correspond to the above notation (Fig. 5

left).
The coordinates of the corner points were used to compute the midpoints of the sides (MP1 to MP4), using the Midpoint Formula, in such a way that MP1 and MP3 are on the long sides, and MP2 and MP4 are on the short sides of the minimum oriented bounding box (Fig. 5 left).Using these four midpoints, the algorithm defines the length of the minimum oriented bounding box (line connecting MP2 and MP4) and the width of the minimum oriented bounding box (line connecting MP1 and MP3) (Fig. 5).Using the elevations of the four corners (MP1 to MP4), taken from the DEM, the square root of the elevation difference between MP1 to MP3 and MP2 to Mp4 raised to the power of 2 (for obtaining only positive values) is computed.These values are used to clas- sify the landslides as long or wide according to the following rules.
-The landslides which have a larger elevation difference along the length of the minimum oriented bounding box (line connecting MP2 and MP4) than along the width (line connecting MP1 and MP3) are classified as long.
-The landslides which have a larger elevation difference along the width of the minimum oriented bounding box (line connecting MP1 and MP3) than along the length (line connecting MP2 and MP4) are classified as wide.
At this step, some wide landslides are classified as long because they are located on short hillslopes but which span a certain relative elevation along the valley (as the sides of small valleys incised on cuesta escarpments -Fig.6c).In this morphologic setting, the difference in elevation along the width of the landslides can be the same as the difference in elevation along the length.This is characteristic of the landslides developed on the short (under 500/800 m in length) and laterally extended (2 to 6 km) hillslopes of cuesta escarpments of the Moldavian Plateau (Mǎrgǎrint and Niculiţǎ, 2016), and along the banks of gullies which dissect these hillslopes (Figs. 2, 3).For these cases, a further step of processing is required.This step involves the cropping of the DEM for every landslide polygon, the hydrological preprocessing to enforce drainage flow routing and the computation of the slope length (Fig. 7).The slope length estimation requires the use of the RSAGA (Brenning and Bangs, 2016) R package, which is linked to SAGA GIS (Conrad et al., 2015).The DEM is preprocessed with the functions Sink Drainage Route Detection and Sink Removal from the Terrain Analysis -Preprocessing module library.The hydrological enforcement is done using the Deepen Drainage Routes method, and the flow path length was computed using the function Flow Path Length from the Terrain Analysis -Hydrology module library.Both flow-routing methods implemented in this tool, Deterministic 8 (O' Callaghan and Mark, 1984) and multiple flow direction (Freeman, 1991;Quinn et al., 1991), are computed (Fig. 6).The maximum slope length for every landslide polygon is computed using the function Grid Statistics for Polygons from Shapes -Grid Tools tool library.Using the following rules, only the long landslides class is reclassified.
-The landslides which have a length greater than the slope length are reclassified as being wide.
-The landslides which have a length smaller than the slope length remain classified as being long.
The confusion matrix and the associated statistics from Table 1 were computed in R stat.

Results and discussions
The results of the proposed algorithm need to be validated against the reality identified in the landslide inventory, to prove its efficiency.The long and wide type was assessed during landslide delineation.Beside field validation, the use  , 8).These two variables will be biased if the wide landslide type appears in a landslide inventory.The length and width obtained using the proposed algorithms are called estimated length and width because the algorithm fails to classify certain types of wide landslides (false positive -type I error) and certain types of long landslides (false negatives -type II error) correctly (Table 1).I argue that although some landslides are misclassified, from a pure morphometric point of view, the impact is minimized by the algorithm.This can be seen in Fig. 8.All four plots from this figure show that estimated length and width are closer to the correct length and width than the bounding box length and width.These plots show the scatter (Fig. 8a, b) of the difference between the correct landslide dimensions and estimated/bounding box dimensions and the frequency (Fig. 8c, d) of landslide dimensions.From the scatter plots it appears that bounding box dimensions are bigger than the correct dimensions since length is switched with width, the biggest impact being registered for elongated/elliptic landslides (flows).In the case of round landslides (rotational and translational slides), switching the length with width has a smaller impact.The histograms show the impact of changing the length with the width better and that the results of the proposed algorithm resemble the shape M. Niculiţǎ: Automatic landslide length and width estimation of the real dimensions' distributions with fidelity.The use of the dimensions of the bounding box instead has a huge impact not essentially on the shape, but on the magnitude of the distribution (since the wide type of landslides is dominant at 61 %; see Table 1).
While the interpretation of the differences of the dimensions is instructive and demonstrates the quality of the proposed algorithm in producing results which can be used in geomorphometric analyses, the assessment of the accuracy to predict the correct class is carried out using the confusion matrix (Stehman, 1997;Powers, 2011) and the Receiver Operating Characteristic curve (ROC) graphs (Fawcett, 2006;Powers, 2011).The confusion matrix is presented in Table 1, and its graphic synthesis is presented in the folded plots from Fig. 9.
The wide landslides wrongly classified as long in the first step (FP -type I error) are mainly developed along the banks of incised rivers and gullies, landslide scarps, landslide toes and cuesta scarps, in situations where there is a greater differ-ence in elevation between the flanks than between the scarp and the toe (Fig. 5c).The long landslides misclassified as wide (FN -type II error) are almost round, rhombic, hexagonal or elongated landslides, which are diagonal to the minimum enclosed bounding box and the downslope direction of the hillslope, which in general is gently sloped (Fig. 5a).This diagonal position generates a difference of elevation along the width, which is greater than the difference along the length.
Overall, the first step has an accuracy of 0.85 (Table 1).The ROC curve is used to visually assess the performance of a classification (Fawcett, 2006).The curve plots the true positive rate (benefits, sensitivity) vs. the false positive rates (costs, specificity) of the classification (Fawcett, 2006;Powers, 2011).The diagonal line is equivalent with randomly guessing the outcome of the classification (Fawcett, 2006;Powers, 2011).In this first step of the classification, the TP and FP rates are higher, showing that the long type is better classified.Overall, the area under the ROC curve (AUC; see Fig. 10) shows that 84.3 % landslides are correctly classified.The 95 % confidence interval was computed using 2000 stratified bootstrap replicates in the pROC sp package (Robin et al., 2011).This interval is in the proximity of the ROC curve, the range of the AUC being ±1.93, which shows the stability of the curve.
In the second step, the best performance was acquired by the multiple flow algorithm for computing slope length.This algorithm computes flow distance which is more similar to hillslope length because the flow is modeled as a diffusion process, while the Deterministic 8 algorithm creates a pattern of interrupted flows, although the DEM was preprocessed hydrologically (Figs. 6,7).Using D8, in the second step, 1309 landslides are classified as wide and only 18 as long.The multiple flow direction algorithm classifies 837 landslides as wide and 478 as long.
The majority of the wide landslides classified as long in the second step (FP -type I error) are almost round, rhombic or hexagonal, and diagonal to the downslope direction (Fig. 6), and only a few are developed along the gully banks or scarps.
The long landslides classified as wide (FN -type II error) are related to areas where the slope length computed by the flow algorithm fails to model the length of the hillslope, and in step two, their length is greater than the slope length (Fig. 6).These landslides have gullies or mounds, which interrupt the diffusion of the flow, and the flow algorithms do not route the flow from the crown to the toe of the landslide (Fig. 7).
Overall, the second step has an accuracy of 0.88 (Table 1).All the statistics associated with the confusion matrix (Table 1) show that in the second step, the classification performs slightly better.In this second step, the ROC curve shows that the TP rate is lower, and the FP rate is higher, meaning that the wide type is better classified.Overall, 87.8 % landslides are correctly classified (AUC value -Fig.10).The confidence interval is lower, ±1.85, showing an increase in the stability of the classification.
Although the proposed algorithm performs well, any inconsistency between the shape of the earth portrayed by the DEM and the delineated landslide morphology will give mis- classification cases.This scenario is to be expected from inventories where the landslides are delineated from highresolution aerial/satellite imagery, but the DEM does not have the same resolution, a frequent scenario in the case of automatically delineated landslides.Normally if all the data have the same resolution, the proposed algorithm can provide good results.We also tested the SRTM 1 elevation data with the proposed method for the landslide inventory, and the AUC values decreased to 86.6 %.
The quality of the landslide inventory is also of great importance since any geomorphologic topological inconsistency can introduce wrong identification cases such as the following: amalgamation of landslides by crossing channels; shifting of landslides and touching channels; landslides spread across several hillslopes.
In specific topographic situations, the midpoints will not be on the edge of the landslide, and not even the same hillslope as the landslide (Fig. 6c, d), but we believe that in general, this situation will not introduce errors since we have not met any such cases.
Inconsistencies of the type described above are to be expected in a substantial proportion for landslide inventories based on topographic maps and metric aerial imagery/DEM, their proportion decreasing in landslide inventories based on centimeter resolution aerial imagery/DEM.The proposed methodology could also be used for checking landslide inventories for geomorphological topological inconsistencies.
The main challenges of the method are as follows: the inability to separate wide from long landslides based on the flow (mainly related to the false positive cases, long landslides classified as wide); the failure to separate long landslides from wide landslides because of their diagonal position (mainly related to the false negative cases, wide landslides classified as long).
Possibilities to resolve the failure to distinguish between long and wide landslides based on the flow length estimation require a new approach in slope length computation.I have tried, in the second step, using the slope length estimated as the slope compensated along the length side of the bounding box, but this approach gave results worse than from the hydrological slope length estimation.
Possibilities to resolve the failure to distinguish between long and wide landslides because of their diagonal position are related to the computation of the bounding box.Using a 3-D approach in computing the bounding box, from a 3-D vector of the landslide boundary (vertex of the polygon with elevation altitude from the DEM), could probably resolve this, but research is needed.For resolving the flow length estimation, responsible for the type II error, better methods for slope length estimation are needed.

Conclusions
The proposed method can discern between long and wide landslides in 87.8 % of cases.The misclassified cases have a specific geometry and geomorphologic topology, and the algorithm fails to identify them in the right class.The majority of these landslides are round, rhombic or hexagonal, and oriented diagonally to the downslope direction of the hillslope.The slide/flow direction of these landslides is diagonal to the bounding box.Although the algorithms fail to recognize them as long or wide, the errors are minimal when their lengths and widths are used because the difference between them is not big.Of course, if the direction of flow is needed (as aspect value), the algorithm fails to identify the right direction.The minority of misclassified landslides is represented by the elongated landslides, which are either long (similar to flows) or wide (developed along gully banks and landslide scarps).For these landslides, the proposed method fails because the slope length estimation does not exceed the oriented bounding box length, a situation which requires further investigations of the possibilities of using other methods to estimate slope length.Overall, the presented approach managed to identify the majority of the long vs. wide landslides from a landslide inventory and represents a starting point in deriving algorithms for automatic geomorphometric analysis of landslides.

Code availability
The R stat code of the script which implements the algorithm is available in a Zenodo repository (http://doi.org/10.5281/zenodo.60885,Mihai, 2016a).

Data availability
Data associated with the present work consist of a DEM and a landslide inventory.The landslide inventory is available in a Zenodo repository (http://doi.org/10.5281/zenodo.60885, Mihai, 2016).While the lidar DEM data cannot be distributed, the elevation extracted from these data is available in the landslide inventory vector attributes (every landslide polygon vertex has a lidar DEM equivalent altitude, being a 3-D vector), and can be used as a reference for reproducing the algorithm results.The bounding boxes and their attributes are also available.

)Figure 1 .
Figure 1.Schematic drawings of long and wide cases of landslides: (a) rotational slide -long type, (b) gully bank slides -wide type, (c) translational slide -wide type, (d) flow -long type and (e) river bank slide -wide type.

Figure 2 .
Figure 2. Geomorphologic cases of long (L) and wide (W) types of landslides from the study area.The top panel is a 3-D view Google Earth image; the bottom panel is a 3-D view of the lidar DEM shading.Both views show the landslide polygons overlaid in red.

Figure 3 .Figure 4 .
Figure 3.The study area and the landslide inventory overlaid on shaded lidar DEM.The upper-left panel is an inset showing the position of the study area in Romania; below the inset, a set of 3-D views of the study area landforms from north and south is displayed.The red boxes from the map on the right indicate the insets from Figs. 6 and 7.

Figure 5 .
Figure 5. Geometry of the minimum oriented bounding box.The left panel shows the geometry of the oriented bounding box; the center panel shows the geometry of a long landslide bounding box; the right panel shows the geometry of a wide landslide box.

Figure 6 .
Figure 6.Three-dimensional views of different types of long vs. wide landslide classifications: (a) false positive cases (long landslides classified as wide landslides) in step 1, (b) false positives in step 2, (c) false negative cases (wide landslides classified as long landslides) in step 1 and (d) false negative cases in step 2.

Figure 7 .
Figure 7. Three-dimensional view of a long landslide (left) and the flow lengths for it (right) (MFD is the flow length computed with the Multiple Flow Direction algorithm, and D8 is the flow length computed with the Deterministic 8 algorithm).

Figure 8 .
Figure 8. Plots of the difference between the real length/width, length/width of the oriented bounding box and length/width estimated with the proposed algorithm: (a) scatter plot of the difference between correct length and the bbox/estimated length, (b) scatter plot of the difference between correct width and the bbox/estimated width, (c) histograms of the three types of lengths and (d) histograms of the three types of widths.

Figure 9 .
Figure 9. Four folded plots of the confusion matrices for the two steps of the algorithm (TP denotes true positive, TN true negative, FP false positive and FN false negative).

Figure 10 .
Figure 10.ROC curves for the two steps of the algorithm, their 95 % confidence interval and the AUC values.

Table 1 .
The confusion matrix for the classification performed in the two steps of the algorithm and the associated statistics.