Spatio-temporal smoothing of lightning climatologies

Abstract. This study develops methods for estimating lightning climatologies on the d−1km−2 scale for regions with complex terrain and applies them to summertime observations (2010 – 2015) of the lightning location system ALDIS in the Austrian state of Carinthia in the Eastern Alps. Generalized additive models (GAMs) are used to model both the probability of occurrence and the intensity of lightning. Additive effects are set up for altitude, day of the year (season) and geographical location (longitude/latitude). The performance 5 of the models is verified by 6-fold cross-validation. The altitude effect of the occurrence model suggests higher probabilities of lightning for locations on higher elevations. The seasonal effect peaks in mid July. The spatial effect models several local features, but there is a pronounced minimum in the Northwest and a clear maximum in the Eastern part of Carinthia. The estimated effects of the intensity model reveal similar features, though they are not equal. Main difference is that the spatial effect varies more strongly than the analogous effect of 10 the occurrence model. A major asset of the introduced method is that the resulting climatological information vary smoothly over space and time. Thus, the climatology is capable to serve as a useful tool in quantitative applications, i.e., risk assessment and weather prediction.


Introduction
Severe weather, associated with thunderstorms and lightning, causes fatalities, injuries and financial losses (Curran et al., 2000).Thus, the private and the insurance sector have a strong interest in reliable climatologies for such events, i.e., for risk assessment or as a benchmark forecast of a warning system.For these quantitative purposes, it is crucial to separate signal and noise.Especially when the target variable, i.e., lightning, on the one hand varies strongly in space and time and on the other hand might be explained by other covariates, i.e., altitude.This holds in particular for regions with complex terrain.To this end it is desirable to identify smooth and potentially nonlinear functional dependencies between lightning and variables associated with space and time.This study aims at testing how generalized additive models can be applied in order to smooth lightning climatologies over complex terrain.
A study investigating ALDIS data for the period 1992 to 2001 (Schulz et al., 2005) found that flash densities over the complex topography of Austria are vary between 0.5 and 4 flashes km −2 yr −1 depending on the terrain.Data from the same lightning location system (LLS) was also analyzed to obtain thunderstorm tracks (Bertram and Mayr, 2004).The key finding is that thunderstorms are often initialized at mountains of moderate altitude and propagate towards flat areas afterwards.
Other studies focus on lightning detected in the vicinity of the Alps: A 6-year analysis of lightning detection data over Germany reveals highest activity in the northern foothills of the Alps and during the summer months, where the number of thunderstorm days goes up to 7.5 km −2 yr −1 (Wapler, 2013).Lightning activity is also very high along the southern rim of the Alps.Feudale et al. (2013) found ground flash densities up to 11 flashes km −2 yr −1 in northeast Italy, which is south of our region of interest, Carinthia.
These lightning climatologies provide averages of days with lightning activity and averages of counts of flashes, respectively, on the scale km −2 yr −1 .Since lightning varies strongly in space and time and only short time series of the order of 10 years are available, this procedure might yield spatial fields with strong fluctuations between neighboring cells.This issue is not a big drawback when the purpose of the analysis is to get an overall qualitative picture of the lightning activity, but it can become a problem in applications where a quantitative assessment is required, i.e., risk assessment or when the climatology has to serve as a benchmark weather forecast.A method is therefore needed to obtain climatologies smoothed in space and time.
In this study generalized additive models (GAMs, see, Hastie and Tibshirani, 1990;Wood, 2006) are applied to estimate a climatology of the probability of lightning, which could also be seen as a thunderstorm climatology with lightning as proxy (e.g., Gladich et al., 2011;Poelman, 2014;Mona et al., 2016), and a climatology of the expected numbers of flashes which can be interpreted as the intensity of thunderstorms.
GAMs have been used to compile climatologies of extremes (Chavez-Demoulin and Davison, 2005;Yang et al., 2016) and full precipitation distributions (Rust et al., 2013;Stauffer et al., 2016).An extension of GAMs is that not only expected values, but also other parameters of a distribution can be modeled (Stauffer et al., 2016).
The manuscript is structured as follows: The lightning detection data, the region of interest, Carinthia, and the pre-processing of the data is described in Sect. 2. The methods to estimate the lightning climatologies from the data is based on generalized additive models (Sect.3).The nonlinear effects estimated for occurrence and intensity model and exemplary climatologies are presented afterwards (Sect.4).Some special aspects of interest for end-users are discussed in Sect. 5.The study is summarized and concluded at the end of the manuscript (Sect.6).

Data
In this study 6 years of data (2010 -2015) from the ALDIS detection network (Schulz et al., 2005) are included.The data within this period is processed in real-time by the same lightning location algorithm ensuring stationarity with respect to data processing.The summer months May to August are selected, as this is the dominant lightning season in the Eastern Alps.The original ALDIS data contains single strokes and information which strokes belong to a flash.In order to analyze flashes, solely The aim of the statistical model is to explain the response, i.e., the probability of occurrence or counts of flashes, by appropriate spatio-temporal covariates, i.e., logarithm of the altitude (logalt), day of the year (doy) and geographical location (lon, lat).
Since the response might nonlinearly depend on the covariates we choose generalized additive models (GAMs) as a statistical framework, for which a brief introduction is presented in Sect.3.1.
It is assumed that the number of flashes detected within a cell and day are generated by a random process Y .Realizations of the random process are denoted by y i ∈ {0, 1, 2, . . .}, where i = 1, 2, . . ., n indicates the observation.Two distinct models are set up: first, a model for the probability of the occurrence of lightning Pr(Y > 0) within a cell and a day; second, a truncated count model to assess the expected number of flashes within a cell given lightning activity E[Y |Y > 0].This procedure refers to a hurdle model (Mullahy, 1986;Zeileis et al., 2008) which has the further benefit to be able to handle the large amount of zero valued data points (97.85%, in our case).The occurrence model and the intensity model are specified in Sect.3.2 and Sect.3.3, respectively.

Generalized additive model
The main motivation for using a GAM is the possibility to estimate (potentially) nonlinear relationships between the response and the covariates.In the following, the basic concept of GAMs is introduced for an arbitrary parameter θ of some probability density function d(•; θ) (PDF).A GAM aiming at modeling a spatio-temporal climatology over complex terrain would have the form, where g(•) is a link function that maps the scale of the parameter θ to the real line.The right hand side is called the additive predictor, where β 0 is the intercept term and f j are unspecified (potentially) nonlinear smooth functions that are modeled using regression splines (Wood, 2006;Fahrmeir et al., 2013).For each f j a design matrix X j containing spline basis functions is constructed.Thus the GAM can be written as generalized linear model (GLM), The coefficients β = (β 0 , β 1 , β 2 , β 3 ) are estimated by maximizing the penalized log-likelihood, where the first term on the right-hand side is the unpenalized log-likelihood.The second term is added to prevent overfitting by penalizing too abrupt jumps of the functional forms.λ j are the smoothing parameters corresponding to the functions f j , respectively.For λ j = 0 the log-likelihood is unpenalized with respect to f j .When λ j → ∞ the fitting procedure will select a linear effect for f j .The selection of the smoothing parameters λ j is performed by cross-validation.S j are prespecified penalty matrices, which depend on the choice of spline basis for the single terms.The reader is referred to Wood (2006) for more details.
Estimation of a GAM for such a large dataset, i.e., 7309152 data points, is feasible, e.g., via function bam() (for big additive models) implemented in the mgcv package (Wood et al., 2015;Wood, 2016) of the statistical software R (R Core Team, 2016).

Occurrence model
The first component models the probability of lightning Pr(Y > 0) = π to occur within a 1 km × 1 km cell and a day.The Bernoulli distribution with the parameter π of the PDF, will be fitted.Since the data is binomial y ∈ {0, 1} indicates no lightning and lightning within the cells, respectively.The model for π takes the form of Eq. 1 with π replacing θ.The complementary log-log g(π) = log(− log(1 − π)) is implemented as link function.

Intensity model
The second part is the truncated count component for the expected number of flashes given lightning activity.We will refer to this component as the intensity model.It is assumed that the positive counts of flashes within a spatial cell and day follow a zero truncated Poisson distribution with the PDF, where y ∈ {1, 2, . . .}, and d Pois (•; µ) is the PDF of the Poisson distribution with expectation µ.The conditional expectation is . The GAM for this component has the form of Eq. 1 with µ replacing θ.The logarithm serves as link function g(µ) = log(µ).
The family for modeling the zero truncated Poisson distribution ztpoisson() within a GLM or GAM framework is implemented in the R-package countreg (Zeileis and Kleiber, 2015).For more information on and a formal definition of hurdle models the reader is referred to Zeileis et al. (2008).

Results
This section presenting the results of the statistical models is structured as follows: first, the nonlinear effects of the occurrence model are described in Sect.4.1; second, the effects of the intensity model are presented in Sect.4.2.Finally, exemplary applications illustrate in Sect.4.3 how climatological information can be drawn from the models.

Occurrence model
The estimates of the effect of the occurrence model (Sect.3.2) are depicted in Fig. 3.The additive predictor, takes the value of the intercept term β 0 if the sum of all other effects is equal to zero.Its estimate is β 0 = −3.97(−4.15, −3.80) on the complementary log-log scale, which is 1.87% (1.56%, 2.21%) in terms of probability of lightning.The numbers in parenthesis are 95% confidence intervals computed from 1000 day-wise block-bootstrapping estimates.
The second term f 1 (logalt) of the additive predictor models the effect of the logarithm of the altitude.f 1 (logalt) varies from roughly −0.2 for low altitudes to values greater than 0.5 for altitudes above 2800 m (Fig. 3, top-left).This function takes 7.9 degrees of freedom.Its shape is close to exponential, which suggests that a linear term for the altitude would be sufficient.
For altitudes above 2000 m, however, the nonlinear term leads to larger values than a linear term β 1 alt would do.
The temporal or seasonal effect f 2 (doy), i.e., the dependence of the target on the day of the year (doy), shows a steep increase during May, reaches its maximum in mid July and decreases slowly during August (Fig. 3, top-right).This result indicates that the main lightning season in Carinthia lasts from mid June until end of August.The estimated degrees of freedom are 2.5, which leads to the simple shape of the seasonal effect with one clear maximum.
The spatial effect f 3 (lon, lat), which explains the spatial variations of the linear predictor that cannot be explained by the of Carinthia are located.The minimum with values less than −0.3 on the complementary log-log scale suggests that lightning activity is less pronounced in this region.This finding is inline with former analyses of the lightning activity in Austria (Troger, 1998;Schulz et al., 2005), which stated that the main alpine crest is an area with a minimum in flash density.The maximum zone with values exceeding 0.3 in the mid to eastern part of Carinthia covers the so-called Gurktal Alps.In comparison with the High Tauern, the Gurktal Alps are on lower average elevation and the mountains are not as steep.

Intensity model
The nonlinear effects of the intensity model (Sect.The spatial effect f 3 (lon, lat) varies strongly and requires 166 degrees of freedom which is more than the corresponding effect of the occurrence model.However, there are some features common for both effects.For instance, the prominent maximum visible in Fig. 4 (bottom) in the Gurktal Alps appears also in the spatial effect of the count model (Fig. 3, bottom).
Common is also the strong minimum in the western part of the domain.The most pronounced new feature is the strong local maximum with values exceeding 0.9 in the south of Carinthia.A 165 m radio tower is installed on the peak of the Dobratsch mountain (location C in Fig. 1), which triggers lightning strokes under suitable conditions, i.e., occurrence of a thunderstorm.
Other maxima of this effect could also be attributed to sites of radio towers, which suggests that the number of flashes is more sensitive to local constructions than the probability of lightning.

Applications
In order to illustrate how climatological information can be drawn from the GAMs, two different kinds of applications are presented.First, maps show spatial climatologies (Fig. 5 and Fig. 6).Here, the occurrence model and/or the intensity model are evaluated for one specific day.Second, the seasonal climatology for selected 1 km × 1 km grid cells are discussed (Fig. 7).
Here, the models are evaluated with respect to the geographical location of the point of interest and its altitude.
The spatial distribution of climatological probabilities of lightning to occur in a cell for July 20 (close to the seasonal peak) varies from 1.8% to 6.5% (Fig. 5).In the western part of the domain, local valleys and mountain ridges become visible through the altitude effect (Fig. 3, top-left).However, the highest probabilities do not occur over the highest terrain in the northwest, where the spatial effect counteracts the altitude effect leading to moderate probabilities around 2% to 3%.The spatial effect (Fig. 3, bottom) is responsible for the maximum over the moderate altitude region of the Gurktal Alps.Such a map can also serve as thunderstorm climatology when lightning is taken as a proxy for thunderstorms.
For the same day, July 20, the expected number of flashes is depicted in Fig. 6.This is the product of probabilities of lightning π from the occurrence model and the expected number of flashes given lightning activity, which is derived from the intensity model.Values are ranging from 0.028 to 0.166.The lowest values can be found in the northwestern part of Carinthia where the spatial effects of both models reveal a minimum.Next to the maximum in the Gurktal Alps, where also maxima in the spatial effects of both models can be found, a second peak appears at the Dobratsch mountain (location C in Fig. 1) which is due to the local maximum in the spatial effect of the intensity model (Fig. 4, bottom).
Next to the spatial information one can extract seasonal climatologies for different locations (Fig. 7).These are computed exemplary for five sites (Table 1).The left panel of Fig. 7 shows the climatologies of lightning probability.Differences between the annual cycles of the probabilities are due to the altitude effect and the spatial effect of the occurrence model (Fig. 3).The highest probabilities between 4% and 5% are modeled in July for location B (dashed line), which is located at the southwestern border of Carinthia in vicinity of a local maximum of the spatial effect (Fig. 3

Discussion
This section addresses two points helpful for end-users.The first one is on how to choose the cross-validation score in order to avoid overfitting of the seasonal effect (speaking technically the selection of its smoothing parameter λ).The second point is a discussion on how the introduced model (Eq. 1) can be extended towards a weather prediction tool, i.e., for warning purposes.
For illustration of the first point a subset of the large dataset is selected.We pick all data points in a 5 × 5 neighborhood around the location E. Thus, only 6 years × 123 days × 25 cells = 18450 data points remain.The probability of lightning π is the target variable.Furthermore, altitude and spatial effects are omitted for simplicity for such a small region (cf. the smooth spatial effect of the occurrence model in Fig. 3, bottom).Thus the GAM has the form, Fig. 8 shows the estimates of the two models.The estimate resulting from the cross-validation with day-wise blocks is much smoother (1.9 degrees of freedom) than the estimate resulting from the cross-validation without daily blocks (29.9 degrees of freedom).Thus the latter estimate takes roughly the maximal degree of freedom and is obviously overfitted.
The reason for the distinct estimates lies in the dependence structure of the data.For one cell the probability to detect lightning on one day given lightning was detected on the previous day is 6.7%.Spatial dependence is much stronger.Provided that lightning occurs in one cell, the probability of lightning to occur in the adjacent cell is 41%.This strong spatial dependence comes with a physical meaning.First, the preconditions for thunderstorms and lightning to take place vary much stronger from day-to-day than in the course of a single day.Second, thunderstorm systems, i.e., multi-cell thunderstorms or super-cell thunderstorms, cover a large area or even travel over a larger area (Markowski and Richardson, 2011).
For this reason we recommend to explore the dependence structure of the data first and to define the cross-validation score according to this dependence structure.
Finally, we discuss how the introduced model (Eq. 1) can be extended in order to serve as a weather prediction tool.It is possible to add further predictors from a numerical weather prediction system to the right hand side of Eq. 1.In the case of lightning and thunderstorms suitable predictors could be convective inhibition energy (CIN), convective available potential energy (CAPE), vertical shear of horizontal winds or large scale circulation patterns (e.g., Bertram and Mayr, 2004;Chaudhuri and Middey, 2012).Within the GAM framework nonlinear effects and interactions of these predictors can be modeled.Another major benefit of this procedure is that the climatology is nested within the additive predictor.Thus the performance of the prediction tool would be at least on the quality level of the climatology, but would not fall below.

Conclusions
This study presented how generalized linear models (GAMs) (e.g., Hastie and Tibshirani, 1990;Wood, 2006) provide a useful tool for building a lightning climatology or a climatology for the occurrence of thunderstorms.The main concept is to decompose the signal into different effects: an altitude effect, a seasonal effect and a spatial effect.The most beneficial aspect of this method is that smooth estimates for these effects are obtained, which makes the resulting climatology a valuable tool for quantitative purposes, e.g., risk assessment or benchmarking in weather prediction.
Moreover, a hurdle approach was employed which is also capable to handle the large amount of zeros in the data.Thus two aspects of lightning are captured by the models: the probability of lightning to occur and the number of flashes detected within a grid cell.The effects of the two models are similar though not equal.In particular the spatial effect of the intensity model varies more strongly than the corresponding effect of the occurrence model.Moreover, new features were exhibited like local maxima in vicinity of radio towers.
In sum, the occurrence model and the count model took roughly 150 and 180 degrees of freedom, respectively.This is a relatively small number compared to the degrees of freedom required by other methods, i.e., counting and averaging flashes with respect to a resolution of km −2 yr −1 would lead to 9904 degrees of freedom in the introduced case without capturing the seasonal cycle.Thus the GAM approach leads to a smooth, nonlinear and sparse quantification of the climatologies.0 500 1000 1500 2000 2500 3000 3500 q q q q q q q q q q A B C D id name lon (    q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q May June July Aug 0% 4% 8% 12% 16% q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q May June July Aug altitude term f 1 (logalt), requires 138 degrees of freedom.(Fig.3, bottom).Most prominent features are the minimum near the northwest and the maximum in the mid to eastern part of Carinthia.In the northwest the highest mountains, the High Tauern, Nat.Hazards Earth Syst.Sci.Discuss., doi:10.5194/nhess-2016-198,2016   Manuscript under review for journal Nat.Hazards Earth Syst.Sci.Published: 4 July 2016 c Author(s) 2016.CC-BY 3.0 License.
3.3) are depicted in Fig. 4. The estimate of the intercept term takes the value β 0 = −0.01(−0.19, 0.14) which leads to a expected number of flashes given lightning activity of 1.57 (1.47, 1.68) when the sum of all other effects is equal to zero.The altitude effect f 1 (logalt) (Fig.4, top-left), with 5.4 degrees of freedom, reveals a similar functional form as the altitude effect of the occurrence model (Fig.3, top-left).However, it has a flatter shape for the terrain between 600 m-1200 m and a steeper increase for high altitudes above 2000 m a.m.s.l..The seasonal effect f 2 (doy) is −0.5 in early May, reaches a maximum of 0.3 in early July and decreases to values around −0.3 until the end of August (Fig.4, top-right).Thus the amplitude of this effect is not as strong as the seasonal effect of the occurrence model (Fig.3, top-right) and the location of the maximum is earlier.It has 2.1 degrees of freedom.
, bottom).This climatology exhibits a strong seasonality, as probabilities fall below the 1% level.Though located at a similar altitude, the climatology of location A (solid line) reveals maximum values less than 2%.This difference is due to the spatial effect, which exhibits a clear minimum in northwestern Carinthia.The climatology of location D (dashed dotted line) in the lower plains in the eastern part of Carinthia show moderate chance of lightning with values around 3% during the peak of the season.The climatologies of the expected number of flashes are depicted in the right panel of Fig.7.The order of location has changed.In particular the highest number of flashed are expected for location C which is the Dobratsch mountain.This is caused by the strong local maximum in the spatial effect of the intensity model(Fig.4, bottom).Accumulating the probabilities over the season leads to values between 2.1 for location A and 7.6 for location C.These values are in good agreement with the analysis bySchulz et al. (2005, Fig 5. therein).
6)Nat.Hazards Earth Syst.Sci.Discuss., doi:10.5194/nhess-2016-198,2016   Manuscript under review for journal Nat.Hazards Earth Syst.Sci.Published: 4 July 2016 c Author(s) 2016.CC-BY 3.0 License.The model is fitted twice: first, with the selection of the smoothing parameter λ by six-fold cross-validation where the observations made on a single day are kept together in a block; second, λ is determined by six-fold cross-validation without the day-wise blocks.In both cases the maximal number of degrees of freedom is set to 30.

Figure 3 .Figure 4 .
Figure 3.The effects of the occurrence model on the scale of the additive predictor.Top-Left: The altitude (logalt) effect.Ticks on the x-axis are set in 100 m intervals.The gray lines show 1000 estimates from day-wise block-bootstrapping.The solid red line is the median of the 1000 estimates, the dashed red lines are the 95% confidence intervals.Top-Right: The seasonal (doy) effect.Bottom: The spatial (lon, lat) effect.The plot shows the median of 1000 estimates from day-wise block-bootstrapping.The difference between two contour lines is 0.1.Dashed contour lines indicate negative values.

Figure 8 .
Figure 8. Local fits for the location E. Circles show empirical estimates.For comparison the estimate of the full occurrence model is added (dashed line).Left: Solid line is the GAM evaluated by cross-validation with day-wise blocks.Right: Solid line is the GAM evaluated by cross-validation without day-wise blocks.
Figure 5. Climatological probability of lightning for July 20 in Carinthia on the 1 km × 1 km scale.14 Nat.