A satellite-based global landslide model

. Landslides are devastating phenomena that cause huge damage around the world. This paper presents a quasi-global landslide model derived using satellite precipitation data, land-use land cover maps, and 250 m topography information. This suggested landslide model is based on the Support Vector Machines (SVM), a machine learning algorithm. The National Aeronautics and Space Administration (NASA) Goddard Space Flight Center (GSFC) landslide inventory data is used as observations and reference data. In all, 70 % of the data are used for model development and training, whereas 30 % are used for validation and veriﬁcation. The results of 100 random subsamples of available landslide observations revealed that the suggested landslide model can predict historical landslides reliably. The average error of 100 iterations of landslide prediction is estimated to be approximately 7 %, while approximately 2 % false landslide events are observed. landslide susceptibility. The approach is based on a weighted linear combination of landslide controlling factors including slope, soil type and texture, elevation, cover type, and drainage density.


Introduction
Each year, landslides cause thousands of casualties and billions of dollars in damages across the world. According to the US Geological Survey (USGS), landslides result in 10 of deaths and over 1-2 billion USD in property damages (USGS, 2006). For example, the Western United States has suffered from several storm-triggered landslides during the El Niño seasons of 1982-1983, resulting in millions of dollars in loss (Spiker and Gori, 2003;Hong et al., 2006b). In several other landslide events, thousands of people died or disappeared within a few minutes/hours (e.g., 1999 landslide in Vargas, Venezuela; see Larsen et al., 2000). Also, in Southeast Asia, landslides are one of the most widespread disasters mainly because of the climate condition, mountainous terrain and socioeconomic conditions (Apip et al., 2010). For instance, in 2006, after a period of heavy rainfall, a series of landslides on Leyte Island, Philippines claimed over 1000 fatalities .
The factors involved in the occurrence of landslides are divided into two categories: triggering processes and preparatory conditions (Dai et al., 2002). Triggering factors are dynamic processes which trigger a slope failure, such as heavy precipitation events (e.g., 1999 landslide in Vargas, Venezuela) and/or earthquakes (e.g., 2008 Wenchuan earthquake in Sichuan, China). Typically, hurricanes and typhoons lead to extensive rainfall over several days and thus, may trigger landslides. In 1998, Hurricane Mitch alone triggered over 9800 landslides across Guatemala resulting in over 14 000 casualties (Bucknam et al., 2001).
In addition to the presence of a triggering factor, preparatory conditions play important roles in the occurrence of landslides. These include conditions which make a region susceptible to landslides such as soil property, slope, topography, land-use land cover, hillslope saturation and vegetation. For example, the effect of pore water pressure and soil porosity on the occurrence of landslides has been discussed in Iverson et al. (2000).
Rainfall intensity duration curves and/or thresholds both in regional (Larsen and Simon, 1993;Godt et al., 2006;Martelloni et al., 2012;Mercogliano et al., 2013) and global scales (Caine, 1980;Hong et al., 2006bHong et al., , 2007a have been used in developing landslide models. Both ground-based and remote sensing rainfall data have been utilized for landslide monitoring and prediction. In a recent study, Rossi et al. (2013) review several remotely sensed data sets for landslide studies.
In a recent effort, the National Aeronautics and Space Administration (NASA) Goddard Space Flight Center (GSFC) released a valuable inventory of landslide events over the globe (Kirschbaum et al., 2009a). It can potentially be used for more detailed research on the relationship between landslide events, controlling factors and climate conditions. The NASA global landslide inventory has been evaluated in a number of landslide studies (e.g., Kirschbaum et al., 2009b;Hong et al., 2006b).
Most previous landslide studies have been in a local or regional scale (e.g., Lagomarsino et al., 2013). This study introduces a quasi-global (hereafter, global) landslide monitoring model using satellite precipitation data, land-use land cover maps, and 250 m topography information. This suggested landslide model is based on the Support Vector Machines (SVM) that can classify landslide and non-landslide events based on their climatological and geographical conditions.

Study area and data resources
The study area extends from −60 to +60 latitudes where realtime satellite precipitation data is available. The data sets used in this study include: -NASA global landslide inventory (Kirschbaum et al., 2009a): this data set represents landslides, mudslides, rockslides, debris slides and a combination of two or more of them. It includes nominal location information (country, county, city), time of occurrence, triggering factor, type of the event, relative size of landslide, geographic location (latitude and longitude) with a measure of location accuracy and impact information such as casualties and economic damage. The relative size classification is based on a scale of 1 (small landslide or mudslide) to 5 (massive landslide). The location accuracy classification is defined based on the radius of confidence on a scale of 1 (>75 km -little confidence in landslide location) to 5 (<5 km -high confidence in landslide location). Currently, the landslide inventory includes events that occurred in 2003, 2007, 2008 and 2009. For more information about this landslide inventory, please refer to Kirschbaum et al. (2009a).
-Precipitation data: satellite precipitation data is obtained from the real-time version of the Precipitation Estimation from Remotely Sensed Information Using Artificial Neural Networks (PERSIANN; Hsu et al., 1997;Sorooshian et al., 2000). This data set is primarily based on long-wave infrared imagery from geosynchronous satellite (GOES-IR) calibrated with satellite microwave data, and it has been validated in numerous studies (e.g., AghaKouchak et al., 2011a).
-Slope: topographical information is derived based on a digital elevation model (DEM) from the NASA Shuttle Radar Topography Mission (Jarvis et al., 2012). This data set is a high resolution elevation information with a spatial resolution of 250 m. Based on this elevation data set, a global slope map is created using Geographical Information System (GIS) techniques.
-Global land cover condition: land-use land cover information is derived from a global database described in Bartholome at al. (2002) and Fritz et al. (2003). This data set includes 1 km land-use land cover information with 23 classes.

Methodology
The model concept is based on the SVM which is a powerful method for classification. In fact, SVM is a decision support machine that can be used to as a two-class or multiple class classifier. In this study, SVM is used to classify landslide and non-landslide events based on historical observations (here, observed landslide events). SVM classification solves a convex optimization problem in which all local solutions (e.g., individual landslide events) are classified into a global optimum (Bishop, 1994). Throughout this study, a conventional approach of splitting data into a 70 % training and a 30 % validation is used. In this study, the linear classifier of SVM is used for classification of landslide from non-landslide events. Let the train- (Hearst et al., 1998). x i Represents the N dimensional patterns (here, 5 dimensions including three vectors of precipitation, topographical information and land use) and y i is the class label (i.e., 1 for landslide and −1 for non-landslide events). Figure 1 schematically presents the SVM model concept for classification. In the figure, blue points correspond to the landslide label, whereas the red points refer to the nonlandslide label. The green line in Fig. 1 is the optimal hyperplane classifier, which connects the two convex hulls of two classes (i.e., landslide and non-landslide events) and has the same distance from each of the convex hulls.
The general form of the optimal SVM classifier (the green line in Fig. 1) is (w.x) + b = 0, w ∈ R N and b ∈ R, with the and and AghaKouchak: A Satellite-Based Global Landslide Model 9 upport Vector Machine (SVM) model concept for classification. decision function for classification being (Hearst et al., 1998) f There exists a w (vector) and a b (scalar) such that for all training sets (Cortes and Vapnik, 1995): (2) The above inequalities can be written in the form (Cortes and Vapnik, 1995): Using the above inequality, one can show (see Cortes and Vapnik, 1995 for details and proof): where |w| is defined as √ w.w. The distance between two convex hulls of the two classes is termed as ρ (see Fig. 1) and can be expressed as The optimal classifier can be obtained by maximizing the distance ρ(w, b). Let us denote the optimal SVM classifier as (w 0 .x) + b 0 = 0 and hence, ρ(w 0 , b 0 ) = 2/|w 0 |. In other words, for classifying the two landslide and non-landslide labels, one needs to solve the optimization problem of maximizing the margin ρ(w 0 , b 0 ). To maximize ρ(w 0 , b 0 ), the term |w 0 | should be minimized under the constraint y i × ((w.x i ) + b) > 1, i = 1, . . . , n. This is a quadratic optimization problem that can be solved using the sequential minimal optimization(SMO) outlined in Platt (1999).
Figure 2 schematically describes the model structure. As shown, the input data include two types of static information (land-use land cover condition and topographical information) and one dynamic input (precipitation). It should be noted that the coordinates of the observed landslide locations in the NASA landslide inventory are in fact approximate locations of landslides (Kirschbaum et al., 2009a). Therefore, using the slopes of landslide coordinates could lead to misleading results. For this reason, instead of using the slope of the provided coordinates in the inventory, a topography index is used that indicates the 95th percentile of 250 m slope values in a 0.25 • box. Note that the 0.25 • is the original resolution of precipitation data. In other words, a topography index is used to distinguish topographically complex regions from relatively flat areas.
The suggested topography index is relatively larger for mountainous regions compared to flat areas. As an example, Table 1 lists the topography index for two areas: The Lut Desert (location 1 in Table 1), which is flat and not susceptible to landslides; and a mountainous region in Indonesia (location 2 in Table 1), which has previously experienced landslides. One can see that the slopes of the two locations are not significantly different (location 1: 0.09; location 2: 0.30 -see column 4 in Table 1). However, the topography index distinguishes the difference between the two regions (location 1 (flat region): 0.14; location 2 (mountainous region): 22.6 -see column 5 in Table 1).
In addition to topography index, precipitation is used as a dynamic input in the model. There are two key factors associated with a rainfall that could lead to a landslide: intensity and duration. Landslides may occur due to heavy precipitation rates in a relatively short period of time or even after a low intensity rainfall over a long period of time.     AghaKouchak et al. (2012) show that there are high uncertainties and systematic errors associated with satellite-based heavy precipitation rates at short temporal intervals (e.g., 3 h relative to daily estimates). Furthermore, Mehran and AghaKouchak (2013) argue that at higher temporal accumulations, satellite data capture extreme precipitation events more reliably. The study demonstrates that by accumulating precipitation over time, improvements can be achieved in detecting heavy precipitation events. For this reason, at any given time, rainfall accumulations over the past 24, 48, and 72 h are used as input to the model. That is, as satellite data become available in real-time, the model can be run with the past 24, 48, and 72 h accumulations from the time of observations.
The soil wetness condition is indirectly computed from the past three-day rainfall information. Figure 3 displays the 24 h precipitation accumulation (on the day of landslide occurrence) for the entire observed landslide events used in the model for both training and validation. One can see that the observations include 581 landslide events with 24 h rainfall  accumulations from 5 mm to over 200 mm. Note that the original NASA landslide inventory includes more landslide events. However, many of the events may not have been triggered by rainfall, as no rainfall has been recorded (earthquake triggered landslides). Alternatively, satellite data may have missed precipitation for a number of landslide events. Since the presented model is solely designed for rainfall triggered landslide events, those with 24 h rainfall accumulation of 5 mm or less were eliminated from the analysis. It should be noted that few landslide events are recorded in the landslide inventory with slopes and topography index near zero (below 10 %), and these were also eliminated. In other words, the presented mode is designed and validated for rainfall triggered landslides for areas with a topography index >10 %. Figure 4 displays the histogram of the topography index for the 581 landslides events that are used as input to the model. The horizontal axis shows topography index intervals, while the vertical axis displays the number of landslides in each interval.     Table 2.
As mentioned earlier, land-use land cover information is used as a static input to the model. Figure 5 shows the histogram of the observed landslides. The horizontal axis represents the 23 land-use land cover categories listed in Table 2, whereas the vertical axis indicates the number of occurrences in each land-use land cover category. The observed landslides are then recategorized into four major groups based on their land-use land cover conditions: tree cover (# categories 1 to 10); shrub cover (#categories 11 to 15); artificial surfaces (categories 16 to 18 and 22); and bare areas (# 19). Note that water bodies, snow and ice and no data (# 20, 21, 23) are eliminated from the analysis. This recategorization is based on similarities between land-use land cover conditions. Finally, the distribution of landslide occurrences in the recategorized land-use land cover conditions is presented in Fig. 6. Based on the recategorized data, artificial surfaces (46 %) and tree cover (38 %) are more susceptible to landslides as more events have occurred in the past. The four recategorized groups are scaled between 0 and 1 (Artificial Surfaces) with one being the most susceptible land use to landslides.

Results and discussion
The SVM is a machine learning algorithm that requires data from training and validation. In this study, 70 % of the 581 landslide observations are used for model training and 30 % for model validation and verification. The model builds a classifier, called the SVM classifier, based on the training data. The SVM classifier is then validated using the validation data set. The target of the SVM classifier is either 0 or 1. Zero represents a non-landslide condition, while one indicates the occurrence of a landslide event. If both model output and target lead to the same value (either 0 or 1), the

A. Farahmand and A. AghaKouchak: A satellite-based global landslide model
algorithm has successfully classified landslides from nonlandslide events. Otherwise, the model has failed to predict the event. Model output of 1 with a target of 0 indicates a false landslide prediction. On the other hand, a model output of 0 with a target of 1 indicates missed landslide prediction. In the following example, a total of 6391 events (581 landslide events and 5810 non-landslide events) are sampled from across the globe. The 5810 non-landslide events are sampled from precipitation areas and from different land-use land cover conditions and slopes from all over the world. Samples are randomly taken from 2003, 2007, 2008 and 2009 for which observations are available. Of course the target values of non-landslide events are set to 0 and observed events are set to 1.
In order to ensure stability of the results, the 70 percent training data was randomly sampled 100 times. In other words, the results are tested by running the model 100 times with different combinations of training and validation data. Figure 7a presents the overall error of the model landslide prediction in percentage. In Fig. 7a, the horizontal axis represents the iterations (i.e., 100), and the vertical axis displays the error (%) that includes the error of both landslide and non-landslide events. As shown, the average error is between 6 to 7 percent in 100 iterations. In order to provide more insight, two other error plots are presented: missed landslides (Fig. 7b) and false landslides (Fig. 7c). Here, false and missed events are calculated based on the common approach used for validation of remote sensing data as outlined in Wilks (2006). The missed landslide plot (Fig. 7b) indicates the error in the number of missed landslide events divided by the total number of landslide observations used for validation. On the other hand, Fig. 7c displays the error in the number of falsely predicted events divided by the total number of non-landslide samples. Note that this model does not attempt to simulate landslides where the slope is less than 10 degree and 24 h precipitation accumulation is less than 5 mm (same conditions applied to sample from landslide observations). One can see that the missed and false landslide errors are approximately 7 and 2 %, respectively (see Fig. 7b and c). Figure 7b indicates that the error of missed landslides at few iterations is very high. This is due to limited number of observed landslides that could lead in no or limited sample from certain types of landslides for training. For this reason, one needs to run the model with multiple randomly selected samples of training and validation to make sure the training data is sufficient for landslide modeling and prediction. In this example, one can see that many combinations of training and validation lead to a small averaged error in missed landslides (see the results of 100 random combinations of training and validation data in Fig. 7).
For better illustration, Fig. 8 displays the SVM-based model output for one iteration. In Fig. 8 red circles indicate landslides identified correctly, whereas blue circles show non-landslides identified correctly. For the same iteration,  It is worth pointing out there are a number of realtime satellite data sets that can potentially be used as input into the proposed model (e.g., TRMM-RT, Huffman et al., 2007;PERSIANN, Hsu et al., 1997;Sorooshian et al., 2000;PERSIANN-CCS, Hong et al., 2004;and CMORPH, Joyce et al., 2004). Previous studies show that different satellite algorithms have their own advantages and disadvantages (Turk et al., 2008;Tian et al., 2009), and none of the precipitation data sets can be considered as ideal, especially for detecting heavy precipitation rates (AghaKouchak et al., 2011b). Despite the uncertainties in satellite observations, the results of this research and previous studies (e.g., Hong et al., 2007b) indicate that a satellite precipitation data set can be utilized for landslide monitoring.
It is stressed that the NASA landslide inventory includes major landslides, and hence the presented model is not calibrated for modeling small scale landslide events. Practically, the model is suitable for the types of landslides it is calibrated for (here, NASA landslide inventory). Furthermore, it is acknowledged that the landslide events in the NASA landslide inventory are subject to errors and uncertainties that could affect the results. However, this data set is currently the only consistent global observational data that can be used for training and validation of large scale landslide models.

Conclusions
Landslides are devastating phenomena that cause huge damages around the world. This paper presents a quasi-global landslide model using SVM approach. The input data include    satellite precipitation data, land-use land cover maps, and 250 m topography information. The model was tested and verified against the NASA GSFC landslide inventory data. Throughout the study, 70 % of the data were used for model development and training, while 30 % were used for validation and verification. The model was used to simulate 100 iterations with random subsamples of 70 % training and 30 % validation. It should be noted that a large number of non-landslide events (10 times more than the observations) were randomly sampled to evaluate the performance of the model in detecting both landslides and non-landslide events. The results showed that the suggested landslide model can predict historical landslides reliably. The average error of 100 iterations of landslide prediction was estimated as approximately 6 to 7 %, while approximately 2 % false landslide and approximately 7 % missed landslide events were observed.
The authors point out that these conclusions are based on exploratory data analysis using observed records of landslide events. We acknowledge that remotely sensed precipitation events have uncertainties and biases (Hong et al., 2006a;AghaKouchak et al., 2011b;Hossain and Huffman, 2008) that could affect landslide monitoring and prediction. However, satellite data sets are the only source of real-time and consistent precipitation observations especially over remote and topographically complex regions (Sorooshian et al., 2011;Nadim et al., 2006). In fact, landslides typically occur in mountainous regions where other sources of information (e.g., radar and gauge measurements) are not available. For this reason, the model has been developed with satellite observation so that it can be applied to remote and topographically complex regions.
This model cannot be considered as a general landslide model as it does not consider earthquake triggered landslides. Furthermore, the model is not designed and calibrated for small scale landslides (local scale landslides not reported in the NASA Landslide Inventory). In addition to the data used in this model, other data sets such as soil type and/or soil A. Farahmand and A. AghaKouchak: A satellite-based global landslide model moisture can be utilized. However, high resolution soil type and soil moisture data sets are not available at a scale relevant to landslides.
The presented model can be coupled with a local physically based model to improve landslide monitoring prediction: (a) using the presented model for identifying landslide hotspots; and (b) using a local physically based model for modeling slope failure over the landslide hotspots identified in the first step. Finally, efforts are underway to further develop this model into a real-time landslide prediction model.