The assessment of road infrastructure exposure to extreme weather events is of major importance for scientists and practitioners alike. In this study, we compare the different extreme value approaches and fitting methods with respect to their value for assessing the exposure of transport networks to extreme precipitation and temperature impacts. Based on an Austrian data set from 25 meteorological stations representing diverse meteorological conditions, we assess the added value of partial duration series (PDS) over the standardly used annual maxima series (AMS) in order to give recommendations for performing extreme value statistics of meteorological hazards. Results show the merits of the robust L-moment estimation, which yielded better results than maximum likelihood estimation in 62 % of all cases. At the same time, results question the general assumption of the threshold excess approach (employing PDS) being superior to the block maxima approach (employing AMS) due to information gain. For low return periods (non-extreme events) the PDS approach tends to overestimate return levels as compared to the AMS approach, whereas an opposite behavior was found for high return levels (extreme events). In extreme cases, an inappropriate threshold was shown to lead to considerable biases that may outperform the possible gain of information from including additional extreme events by far. This effect was visible from neither the square-root criterion nor standardly used graphical diagnosis (mean residual life plot) but rather from a direct comparison of AMS and PDS in combined quantile plots. We therefore recommend performing AMS and PDS approaches simultaneously in order to select the best-suited approach. This will make the analyses more robust, not only in cases where threshold selection and dependency introduces biases to the PDS approach but also in cases where the AMS contains non-extreme events that may introduce similar biases. For assessing the performance of extreme events we recommend the use of conditional performance measures that focus on rare events only in addition to standardly used unconditional indicators. The findings of the study directly address road and traffic management but can be transferred to a range of other environmental variables including meteorological and hydrological quantities.

Reliable information about the exposure of road infrastructure networks to extreme weather events is of major concern for road authorities, governmental institutions and safety researchers all over the world (TRB, 2008; Koetse and Rietveld, 2009; Eisenack et al., 2011; Doll et al., 2013; UNECE, 2013; Meyer et al., 2014; Michaelides, 2014; Schweikert et al., 2014a, b; Matulla et al., 2017). In a changing climate (IPCC, 2012) and due to extensive soil sealing (Nestroy, 2006) the impacts of extreme weather events are likely to increase in both frequency and intensity (APCC, 2014). Against this background, the resilience of transport systems with respect to weather hazards has become increasingly important.

A basic requirement for foresightful road infrastructure management is data about both the probability and magnitude of severe weather events. This information can be derived from long-term records of weather quantities such as precipitation and temperature by means of statistical extreme value modeling. While extreme value theory provides a methodological framework that is commonly used in various scientific disciplines, such as hydrology (Katz et al., 2002), finance (Embrechts et al., 2003), engineering (Castillo et al., 2005) and climate sciences (Katz, 2010; Cheng et al., 2014), the application of these tools for road network exposure analysis is a relatively uncharted area. In particular, formal comparative assessments of the various statistical methods that can be applied for estimating return levels of extreme events are rare.

Two basic approaches have been proposed for deriving extreme value series (Coles, 2001), which are both widely applied in studying extreme meteorological events (e.g., Smith, 1989; Davison and Smith, 1990; Parey et al., 2010; Villarini, 2011; Papalexiou and Koutsoyiannis, 2013). On the one hand, the maximum value per year can be used in the block maxima approach, resulting in an annual maxima series (AMS). On the other hand, all values exceeding a certain threshold can be considered extreme, leading to the threshold excess approach based on partial duration series (PDS). Once the extreme value series has been derived, an appropriate distribution function is fitted to these observations by using different parameter estimation methods, such as maximum likelihood estimation, method of moments or Bayesian methods for parameter estimation. Clearly, there are a number of possible combinations of the approaches that may lead to different, often equally plausible results.

Several efforts have been made to compare the performance of block maxima
and threshold excess approaches. While some studies only provide a
qualitative description of resulting parameter estimates and estimated
return levels for both methods (Jarušková and Hanek, 2006), more
formal assessment approaches are based on the asymptotic variance of the

In this study, we compare the different extreme value approaches and fitting methods with respect to their value for assessing the exposure of transport networks to extreme weather impacts. Based on an Austrian data set from 25 meteorological stations representing diverse meteorological conditions, we assess the added value of partial duration series over the standardly used annual maxima series in order to give recommendations for performing extreme value statistics of meteorological hazards.

This study focuses on several meteorological indicators that can be used to assess the exposure of road networks to two main meteorological quantities: precipitation and temperature. These two variables are considered to have the most serious influence on damage to infrastructure (Matulla et al., 2017). They are measured by meteorological services on a regular basis so the data quality is usually high. Nevertheless, the methodology presented in this paper is applicable to various other meteorological quantities (e.g., maximum wind speed) when time series of about 30 years or more are available.

Four meteorological indices are used in this study. Temperature impacts are
considered by daily minimum (

In order to identify suitable meteorological stations that represent the main climate features of the highway network in Austria, all monitoring stations operated by the national weather service Zentralanstalt für Meteorologie und Geodynamik (ZAMG) served as a starting point. The selection of suitable stations was carried out in a stepwise procedure with respect to the following considerations. Firstly, the spatial proximity of available measuring stations to the highway network was considered by excluding stations with a distance greater than 10 km from the data set. Secondly, data availability and data quality were considered. As sufficiently long time series are a prerequisite for reliable return level estimation, only stations with more than 30 years of record (i.e., since 1 January 1985) and with less than 5 % missing values were selected. Finally, topographic conditions and regional peculiarities were taken into account for selecting evenly spread and climatically representative stations. This step was guided by visual inspection of climate maps (Hiebl et al., 2011) and the digital hydrological atlas of Austria (BMLFUW, 2007; Fürst et al., 2009). The data set so obtained consists of 25 hot spots representing climatically homogeneous regions of Austria (Fig. 1).

Location of the selected meteorological stations used for extreme value analysis.

The first approach for deriving extreme value series consists in selecting maximum (or similarly minimum) values of the observations within subsequent time intervals (blocks) of constant length. While the block size is freely selectable, a trade-off has to be made between bias (small blocks) and variance (large blocks). Most commonly, the length of the block is chosen to correspond to a calendar year (Coles, 2001), resulting in an annual extreme value series. This was also the case in our study.

Based on the Fisher–Tippett–Gnedenko theorem, a generalized extreme value
(GEV) distribution is appropriate for modeling the resulting annual maxima
series (Fisher and Tippet, 1928; Gnedenko, 1943). The cumulative
distribution function of the GEV is defined by

The GEV comprises three different types of distributions, which can be distinguished by the sign of their shape parameter: Gumbel, Fréchet and Weibull distribution (Fréchet, 1927; Gumbel, 1958; Coles, 2001; Embrechts et al., 2003; Basrak, 2014). The Gumbel distribution is commonly applied for maxima that are not limited towards an upper bound, whereas the Weibull case is more appropriate for minima that are often limited by a lower bound (Tallaksen and van Lanen, 2004).

In some cases, fitting distributions to block maxima data is a wasteful approach as only one value per block is used for modeling. A threshold excess approach potentially provides more information on extremes (Coles, 2001).

Analogous to the choice of the block size in the block maxima approach, the
selection of the threshold value in the threshold excess method is also
subject to a trade-off between bias (due to selecting non-extreme events if
the threshold is low) and variance (due to a small number of exceedances
when selecting a high threshold). Hence, the choice of a suitable threshold
is important. The basic aim is to select the potentially lowest threshold,
given the prerequisite that the extreme value model must provide a
reasonable approximation to exceedances above this threshold and shall not
contain non-extreme events (Coles, 2001). According to the
Pickands–Balkema–de Haan theorem, a GP distribution
is suited for modeling the resulting threshold excesses (Balkema and de
Haan, 1974; Pickands, 1975): It states that, for some large threshold

A number of approaches have been proposed for selecting an appropriate
threshold. Coles (2001) suggests to let the selection be guided by graphical
diagnostics about bias (i.e., mean excess; see Ghosh and Resnick, 2010, for a
detailed discussion) and stability of the scale and shape parameter. Despite
these criteria being well justified from a theoretical point of view, their
application involves substantial elements of subjectivity, leading to
ambiguous results (Scarrott and MacDonald, 2012; Northrop and Coleman,
2014). To overcome this problem, we employed the deterministic square-root
rule

Extreme value theory assumes that data are independent and identically
distributed (Coles, 2001; Gilleland and Katz, 2011; Katz, 2010, 2013;
Cheng et al., 2014). To test for non-stationarity in the expected value we
perform separate Mann–Kendall trend tests (Mann, 1945; Kendall, 1976;
Gilbert, 1987) at a significance level of

For climate variables independence of data is usually a minor issue for the annual maxima approach as multi-annual dependencies are usually low for most climates (Madsen et al., 1997a; Katz et al., 2002). Regarding the threshold excess method, threshold exceedances on consecutive days will likely violate the assumption of independence. Dependent values in the threshold excess series are eliminated by a declustering procedure that consists in removing threshold exceedances within the autocorrelation length on both sides of the local maxima (Jarušková and Hanek, 2006). Based on sensitivity analysis an autocorrelation window of 5 days was chosen for the three temperature indicators, while a window of 3 days was chosen for accumulated daily precipitation.

Once the extreme value series is available, a theoretical distribution needs to be fitted. Two different methods of parameter estimation are used within the scope of the present analysis.

The first method, maximum-likelihood estimation (MLE), was formally
introduced by Fisher in the early 20th century (Fisher, 1912; Aldrich,
1997; Hald, 1999). Let

The second method, L-moment estimation (LMOM), evolved from modifications
of probability weighted moments of Greenwood et al. (1979). They are linear
combinations of first-order statistics and are hence more robust to
measurement errors or sampling uncertainty than conventional moments
(Hosking, 1990). The

As far as reliability of the fitting results is concerned, confidence
intervals play a major role for assessing uncertainty. The most common way
to derive a

The approach assumes Gaussian distributed parameter estimators, which may be
inappropriate for extreme value distributions. For LMOM estimators
resampling methods have been recommended (Burn, 2003). Thus, nonparametric
bootstrapping with 500 iterations was applied in this study. MLE offers a
more accurate method for deriving confidence intervals based on the profile
likelihood (Coles, 2001). The profile log-likelihood for

There are various performance measures that are regularly employed in model evaluation, including the root-mean-square error (RMSE) and the mean absolute error (MAE). These metrics provide a comprehensible and objective basis regarding the assessment of the fitted functions.

However, most events of the extreme value series are only moderate and these
will have an overly excessive influence on the performance measure. In order
to specifically assess the accuracy of the fitted models for higher
quantiles (i.e., for larger return periods), we propose conditional variants
of the RMSE (

Distribution-fitting tests such as Kolmogorov–Smirnov, Anderson–Darling or
Cramér–von Mises are not used in this study. Such tests are primarily
useful for gaining an appreciation of whether a lack of fit is statistically
significant or an effect of sampling uncertainty, but they have
little discriminative power to identify the “true” or “best”
distribution to use (e.g., Stedinger et al., 1993). Instead, we perform
graphical diagnosis of the extreme value series and the fitted distributions
in quantile plots, which allow a more complete assessment. Plotting of
empirical distributions is straightforward. The return level (i.e.,
magnitude)

Extreme value series were checked for stationarity. Most of the temperature hot spots showed a significant trend in at least one of the temperature indicators, but often maxima and minima series were simultaneously affected. All significant trends were incorporated in the model. As illustrated by Fig. 2, the consequence of incorporating a trend model in the analysis is non-stationary return levels that refer to a specific time. We will give results for the end of the observation period.

Return level plot of temperature maxima at Pörtschach (Carinthia) with linear trend correction. The trend is visible in the lines depicting the 5-year return level (gray dashed line) and the 100-year return level (green solid line). This is an illustrative example of temperature trends that are commonly observed at the selected stations for both temperature maxima (increasing trend) and temperature minima (decreasing trend).

For precipitation, non-stationarity seems less important than for temperature indicators: about 85 % of the hot spots of our study area showed no trend in the annual extremes. This is consistent with the expectation of the Austrian Panel of Climate Change (APCC, 2014) that climate impacts on precipitation will mainly lead to seasonal shifts rather to changes in total annual precipitation.

The two approaches have been tested for the four meteorological indicators. In summary, it becomes apparent that the relative performances of MLE and LMOM are strongly situation dependent. For instance, while the return level plots for temperature maxima at Schwechat in the eastern lowlands show that the function fitted on the basis of LMOM behaves more robust, which appears to be beneficial in this case (Fig. 3), return level plots of daily rainfall at Brenner on the Austrian–Italian border indicate that the less robust MLE offers better fit for higher quantiles (Fig. 4).

Return level plot of temperature maxima at Schwechat. Return level estimation is based on the threshold excess approach with two different parameter estimation methods (MLE and LMOM estimation). Solid lines show the mean estimate, while dashed lines indicate the 95 % confidence intervals for the fitted functions.

Return level plot of daily rainfall events at Brenner. Return level estimation is based on the block maxima approach with two different parameter estimation methods (MLE and LMOM estimation). Solid lines show the mean estimate, while dashed lines indicate the 95 % confidence intervals for the fitted functions.

Table 1 summarizes the overall goodness of fit for the 100 climate records
(25 stations

Comparison of parameter estimation methods for the AMS approach based on goodness-of-fit measures RMSE and MAE. Numbers indicate success cases of MLE and LMOM.

The relative performances turned out to be more balanced with respect to the PDS approach. As indicated by Table 2, MLE performed better in 56 and 53 % of the cases when judged by the RMSE and MAE, respectively. Again, daily minimum temperature deviates from the general picture by showing clear advantages in favor of LMOM estimation in this case.

Comparison of parameter estimation methods for the PDS approach based on goodness-of-fit measures RMSE and MAE. Numbers indicate success cases of MLE and LMOM.

Apart from the overall goodness of fit it is interesting to assess how the
fit depends on the return period of events. This has been done by visual
inspection of the distribution plots, such as the examples shown in Figs. 3
and 4. In most cases there were only minor differences between MLE and
LMOM when considering return levels below 10 years, but often considerable
differences for larger return periods. For the 100-year events, for example,
results of the temperature indicators differed by about 0.5

As the objective of extreme value analysis is usually related to return
periods of 10 years or more, we specifically assessed the performance of the
extreme upper tail of the distribution by the conditional goodness-of-fit
measures CRMSE

Comparison of parameter estimation methods for the AMS approach
based on conditional goodness-of-fit measures CRMSE

Comparison of parameter estimation methods for the PDS approach
based on conditional goodness-of-fit measures CRMSE

In contrast, results for the PDS showed, again, a slight advantage of MLE when assessed with the goodness-of-fit measures for the conditional variants. Both measures indicate a preference towards MLE in 58 % of the cases. The better performance of the MLE method is against the expectation based on robustness and will be examined in more detail in the following section.

Table 5 presents the relative performances of AMS and PDS approaches based on the two parameter estimation methods. Although overall results show advantages for the AMS approach in terms of goodness of fit for the upper tail of the underlying distributions, results largely depend on the underlying meteorological indicators. While precipitation and daily maximum temperature difference offer a better fit when using GEV distributions of AMS, GP distributions of PDS appear better suited for modeling daily temperature maxima and minima.

Comparison of AMS and PDS approach based on conditional
goodness-of-fit measures CRMSE

To perform a direct comparison, Fig. 5 presents the deviations between return levels derived via AMS and PDS approach for the four meteorological indicators. Interesting patterns regarding the magnitude of the estimated return levels can be observed. For precipitation, PDS/GP estimates result in slightly higher return levels for lower return periods (indicated by negative deviations) and this behavior changes to the opposite for higher return periods. Maximum temperature shows the same tendencies as precipitation, but the PDS/GP always yields higher return levels than the AMS/GEV, suggesting that differences mainly occur at higher return periods. Temperature minima, however, show a rather constant overestimation (i.e., underestimation of negative magnitude) of PDS compared to AMS regardless of the frequency of events, with patterns of temperature difference being a combination of the effects of temperature maxima and minima. Overall, the average deviations between methods mostly increase with the return period, and the variability between cases increases as well. This issue will be further explored in the discussion section.

Differences in estimated return levels between GEV and GP models
for six selected return periods. These differences are calculated by
subtracting the GP estimate from the GEV estimate, given the same parameter
estimation method. This results in

Finally, Table 6 summarizes the success rates of all methods based on
CRMSE

Success rates of methods according to CRMSE

We compared the relative merits of the block maxima method and the threshold excess approach. In addition, two different fitting methods have been contrasted. This results in four possible combinations of extreme value model parameter estimation, all of which have certain strengths and weaknesses. Concerning the fit of the distributions to sample, we found a slight advantage of using LMOM instead of MLE, especially in combination with AMS/GEV. For PDS/GP there was a slight advantage of using MLE. However, overall, the differences were not huge.

The conditional assessment of the individual deviation between return levels of AMS/GEV and PDS/GP yielded deeper insight in the relative performances of methods. Most importantly, we found ambiguous systematic deviations between both approaches (Fig. 5), depending on the meteorological indicators under consideration: concerning temperature minima, PDS/GP was found to consistently overestimate return levels compared to the AMS/GEV approach, while results for temperature maxima and – albeit to a lesser extent – temperature differences show just the opposite. Regarding daily rainfall events, the PDS/GP approach tends to slightly overestimate return levels for low return periods (non-extreme events) as compared to the AMS/GEV approach, while an opposite behavior was found for high return levels (extreme events). To assess the reasons for this systematic behavior, we selected four example series that represent extreme cases, where results of approaches differ significantly.

The first two examples are daily precipitation at Sankt Michael (Fig. 6a) and Brenner (Fig. 6b), where extreme value series deviate from the ideal, smooth behavior of a homogeneous extreme value series. These fluctuations point to either measurement errors or process heterogeneity that will introduce uncertainty into extreme value analysis. In the case of Sankt Michael, the most extreme events appear as outliers that deviate from the general behavior of the sample. In general, LMOM will give lower weight to such leverage points but this seems not the case here where the GP fitted by LMOM seems more attracted. A plausible explanation would be that the upper-tail behavior results from the attraction of the distribution at the lower end because of the limited flexibility of the GP. In the case of Brenner, the extreme values seem to follow the same distribution as the remaining sample so one would have more confidence in the validity of these values. However, extreme values are always prone to higher uncertainty than the remaining sample. The MLE estimate gives more weight to these values and shows a better fit at the upper tail in this case, whereas LMOM gives less weight to these values and makes visible that they are not perfectly following the shape of the entire distribution. The choice of the parameter estimation method will finally depend on the weight one tends to give to the extreme values as compared to the remaining sample.

Return level plots of daily rainfall events at the hot spot
in

Return level plots of

It is also interesting to analyze extreme cases where AMS/GEV and PDS/GP
methods yield contrasting results (Fig. 7). When focusing on the empirical
distributions, we observe that only the more extreme events (three in the
case of Bruck an der Mur and two in the case of Graz) have almost identical empirical
probabilities in both extreme value series. At the lower end, we observe
that there are several events in the AMS/GEV below the threshold level of
PDS, which fit well to the distribution of the higher values so we find no
evidence to exclude them from the analysis. The shift in the distribution
can therefore be regarded as an effect of threshold level selection, which
determines the lower end and therefore the shape of the lower part of the
PDS/GP distribution. Between the undisturbed upper part and the disturbed
lower part a breakpoint at

Similar shifts may arise if the extreme value series contains dependent events. Non-extreme events are generally more likely to cluster than extreme events because they are generated by exceptional process combinations, which are unlikely to occur more often during one extreme weather situation. Thus, dependencies may possibly affect all parts (but more likely the lower part) of the distribution apart from the maximum, which remains unchanged. In consequence, the empirical distribution is stretched at the lower tail (shifted to the left), with similar consequences on lower and upper tail as described for the case of data uncertainty and leverage points. Such artifacts are difficult to detect in quantile plots of one extreme value series alone but are often visible from direct comparison of AMS/GEV and PDS/GP approaches. Although both AMS/GEV and PDS/GP may be affected by dependency of events, AMS/GEV behaves more robust since it selects only one event per year.

These findings are against our initial expectation and contradict the spirit of most existing studies that aimed to recommend the best-performing method for a variable or situation. Instead of recommending either block maxima or threshold excess method, we recommend performing both approaches, as their combined assessment by means of diagnostic plots together with overall and conditional goodness-of-fit measures offers a more complete diagnosis of the quality of extreme series and the resulting distributions.

Concerning the parameter estimation method, there are also benefits and disadvantages that have to be balanced against each other. MLE has some merit with respect to calculating reliable confidence intervals via profile likelihood. Confidence intervals for estimation via LMOM were derived with non-parametric bootstrapping, which is arguably less trustworthy for indicating the uncertainty of the estimates. However, LMOM estimation has been shown to yield more robust estimation results for small sample sizes (Hosking et al., 1985; Hosking et al., 1987), which can be especially beneficial when analyzing environmental data like temperature or precipitation indicators, which are derived from raw measurements at meteorological measuring stations. Regarding the overall results, LMOM estimation turned out to offer a better fit than MLE, which is consistent with previous findings (Hosking et al., 1985; Hosking et al., 1987; Bezak et al., 2014).

Mean residual life plots of

Concerning the comparison based on the goodness of fit of the distributions it shall be noted that a formal comparison of the two extreme value selection approaches is not straightforward. Measures of goodness of fit are not fully conclusive, as the underlying extreme value series are derived by different methods and thus are not directly comparable. Our analysis demonstrates that the choice between these approaches has to be based on the statistical properties of the extreme value series, which are related to the indicators under consideration and on data availability. The conditional measures proposed in this paper help to perform a more specific assessment for extreme events, but they are also not a remedy to overcome this problem. They are a way to assess the goodness of fit at the upper tail of the distribution and facilitate the comparison between AMS/GEV and PDS/GP. These metrics can assist, but not substitute, careful analysis of assumptions. We show that contrastive plotting methods can strongly support these analyses.

While the methodology of this study can be easily generalized and extended to cover other environmental variables, four possible limitations have to be discussed. First, the seasonality of temperature and precipitation extremes has not been taken into account. While maximum/minimum temperatures will always occur in the same season, which will factor out any seasonal heterogeneity, this is not genuinely the case for extreme precipitation events, where seasonal occurrence may be associated with diverging processes (Hundecha et al., 2009). In order to account for seasonal effects, a common approach is to split the events into process-homogeneous subsets. This can be based either on seasonality (e.g., Laaha and Blöschl, 2006, for low streamflows) or on a typology of processes (e.g., Merz and Blöschl, 2003, for floods based on rainfall types and catchment preconditions), or a temporal stratification of records is applied (e.g., Méndez et al., 2008, for wave height and Maraun et al., 2009, for heavy precipitation). For each subset extreme value analysis is performed separately, leading to process-specific return levels, such as summer and winter low flows in the case of minimum discharges. These quantities may be combined by a mixed distribution model to yield overall return levels (e.g., Hundecha et al., 2009). For further discussion of modeling dependent and non-stationary time series extremes, the reader is referred to Chavez-Demoulin and Davison (2012).

Secondly, threshold selection in the PDS/GP method is a legitimate subject for debate. In recent years, efforts have been made to overcome the problem of visual threshold selection, e.g., by robust threshold selection (Dupuis, 1999), likelihood-based visual diagnostics (Wadsworth and Tawn, 2012; Wadsworth, 2016), Bayesian approaches (Tancredi et al., 2006; Lee et al., 2014), approaches based on goodness-of-fit tests (Roth et al., 2016) and extreme value mixture models (MacDonald et al., 2011). In addition, attempts were made to develop more automated approaches for extreme value threshold estimation, including the automated threshold selection approach (ATSM) by Thompson et al. (2009), the multiple threshold method (MTM) by Deidda (2010) and the automatic threshold and run parameter selection by Fukutome et al. (2015). While these approaches are appealing from a theoretical perspective, their practical value is often reduced by numerical issues and sampling effects. At least for the time series tested in this study, both the ATSM and MTM yielded inconsistent results: threshold values of similarly distributed time series obtained by ATSM varied considerably, and parameter estimates were depending on range and resolution of the thresholds considered. One could think that automatic threshold selection procedures replace the threshold selection problem with that of selecting an appropriate range and resolution of the thresholds to be tested.

However, the authors are aware that also the semi-supervised method applied in this study may not be optimal in all cases. Rather than performing in-depth analysis of single time series, we have given priority to analyzing a large amount of time series covering a range of environmental conditions. Therefore, the application of the square-root rule in combination with graphical diagnostics is argued to be a feasible approach that led to satisfactory results in the present study.

Thirdly, the conditional performance metrics depend, to some extent, on the chosen plotting position. While the choice of plotting position formula is only of minor importance in many cases, it might be influential in the present case with emphasis on the upper-order statistics. However, a sensitivity analysis based on Beard (i.e., median) plotting positions has shown that effects on results in terms of return level estimates are small in this study, since changes mainly occur in cases where both estimation methods yield very similar parameter estimates.

Finally, it has to be noted that the conditional metrics are of limited
robustness, especially if time series are short and the condition is chosen
inappropriately. Since the variance of the order statistics strongly
increases towards the upper end of the ordered sample, the conditional
metrics may be subject to high uncertainty, particularly if inadequate (i.e.,
too high) return periods are selected. Thus, the authors want to emphasize
that an appropriate base value has to be chosen depending on the length of
the time series under consideration. For small samples, priority should be
given to robust error metrics such as

We compared statistical methods for extreme value analyses based on four climate indicators related to daily precipitation and temperature. While the indicators were selected for studying the exposure of road infrastructure to extreme weather events, the assessments are equally relevant for a range of other environmental variables including meteorological and hydrological quantities. We first analyzed the goodness of fit of distributions to extreme value series consisting of annual maxima (AMS/GEV) and threshold exceedances (PDS/GP) using two parameter estimation methods.

Results for the parameter estimation methods vary considerably between stations and approaches. For the AMS/GEV approach, LMOM yielded, on average, better fitted distributions than MLE. The goodness of fit turned out to be more balanced with respect to the PDS/GP approach, with a slight advantage of MLE. In most cases there were only minor differences between MLE and LMOM when considering return levels below 10 years, but often considerable differences for larger return periods.

Concerning extreme value selection, the relative performance of AMS/GEV and PDS/GP approaches varies between meteorological indicators. For precipitation and temperature difference the AMS/GEV data outperformed the PDS/GP approach. For temperature maxima and minima the PDS/GP approach appeared better suited.

Regarding goodness of fit for extreme events that are typically used as
design values (

We further examined the conditional performances of AMS/GEV and PDS/GP approaches with respect to the return period in more detail. From conditional performance measures and combined plots, we found systematic deviations between AMS/GEV and PDS/GP approaches. For low return periods (non-extreme events) the PDS/GP approach tends to overestimate return levels as compared to the AMS/GEV approach, whereas an opposite behavior was found for high return levels (extreme events). The assessment of extreme cases where approaches differed significantly suggests that this behavior may be related to two factors: sampling uncertainty and threshold selection.

Regarding sampling uncertainty, we found that outliers may not only attract the distribution at the tail where they occur, but they may also bend the curve at the opposite tail as a consequence of limited flexibility of the extreme value distributions. Such leverage effects can be handled by careful inspection of quantile plots. Regarding threshold selection, the analysis of extreme cases within the data set revealed that an inappropriate threshold may lead to considerable biases that may outperform the possible gain of information from including additional extreme events by far. Selecting a high threshold will determine the lower end of the extreme value distribution whereas the upper tail remains unchanged. This may introduce an inflection point in the distribution, which is against its ideal shape according to extreme value theory, resulting in poor estimates of the theoretical distribution. This effect was not visible from either the square-root criterion or the graphical diagnosis (mean residual life plot), which yielded no atypical biases for the analyzed cases. Similar effects may arise when the extreme value series contains dependent events that may stretch the empirical distribution at the part where they occur. These findings were against our expectations that the estimation of the theoretical distribution will greatly profit from the gain of information that is provided by the PDS/GP approach.

We emphasize that reliable extreme value statistics require controlling for
sample effects in order to avoid biased models. In our study, the
differences and relative merits of methods were best visible from a direct
comparison of AMS/GEV and PDS/GP approaches. We therefore recommend
performing both analyses and carefully analyzing the distribution fit relative
to the respective sample and relative to each other by means of combined
quantile plots. This will make the analyses more robust in cases where
threshold selection and dependency introduce biases to the PDS/GP approach as well as in cases where the AMS/GEV contains non-extreme events that may
introduce similar biases. For assessing the performance of extreme events we
recommend conditional performance measures such as CRMSE

All data
used in this study have been obtained from WMO measurement stations in
Austria. These stations are operated by the national meteorological and
geophysical service of Austria, the Central Institution for Meteorology and
Geodynamics (Zentralanstalt für Meteorologie und Geodynamik, ZAMG). Data
for the following measurement stations have been kindly provided by ZAMG:
Brenner, Bischofshofen, Bruckneudorf, Bruck an der Mur, Eisenstadt,
Feldkirch, Fürstenfeld, Gmunden, Graz, Haiming, Hörsching, Innsbruck
Universität, Kufstein, Langenlebarn, Langen am Arlberg,
Mönichkirchen, Pörtschach, Preitenegg, Salzburg, Sankt Michael im
Lungau, Schwechat, Semmering, Spittal an der Drau, Windischgarsten, and Zeltweg.
For access to these data, please contact

The authors declare that they have no conflict of interest.

This paper is a contribution to UNESCO's FRIEND-Water program. The authors would like to thank the Austrian Climate Research Program ACRP for financial support through the project DALF-Pro (GZ B464822). We thank the Central Institution for Meteorology and Geodynamics (ZAMG) for providing meteorological data. We thank Dan Rosbjerg and the anonymous reviewer for providing insightful comments and suggestions on an earlier draft of the manuscript. Edited by: V. Kotroni Reviewed by: D. Rosbjerg and one anonymous referee