Efficient bootstrap estimates for tail statistics

Bootstrap resamples can be used to investigate the tail of empirical distributions as well as return value estimates from the extremal behaviour of the sample. Specifically, the confidence intervals on return value estimates or bounds on in-sample tail statistics can be obtained using bootstrap techniques. However, non-parametric bootstrapping from the entire sample is expensive. It is shown here that it suffices to bootstrap from a small subset consisting of the highest entries in the sequence to make estimates that are essentially identical to bootstraps from the entire sample. Similarly, bootstrap estimates of confidence intervals of threshold return estimates are found to be well approximated by using a subset consisting of the highest entries. This has practical consequences in fields such as meteorology, oceanography and hydrology where return values are calculated from very large gridded model integrations spanning decades at high temporal resolution or from large ensembles of independent and identically distributed model fields. In such cases the computational savings are substantial.


Introduction
Bootstrap resamples of time series are commonly used to obtain non-parametric confidence intervals (CIs) on return values (Naess and Clausen, 2001;Naess and Hungnes, 2002) and to investigate the behaviour of the tail of the empirical distribution (Coles, 2001;Beirlant et al., 2006;Qi, 2008).Although non-parametric CIs tend to be too narrow, see Kyselý (2008), the procedure itself is algorithmically and numerically straightforward to implement and is thus a convenient technique for rapidly assessing the width of CIs without having to assume a certain parametric distribution.However, this approach quickly becomes cumbersome for large data sets as it demands random draws from the entire sample which subsequently must be sorted to get to the upper percentiles.When handling long model integrations in meteorology, hydrology and oceanography with spatially gridded fields of typically 10 6 grid points this brute-force approach becomes impractical.Such quantities are regularly encountered when estimating return levels from atmospheric reanalyses (Kalnay et al., 1996;Saha et al., 2010;Compo et al., 2011;Dee et al., 2011;Poli et al., 2016), wave hindcasts (Swail and Cox, 2000;Caires and Sterl, 2005;Gaslikova and Weisse, 2006;Breivik et al., 2009;Reistad et al., 2011;Aarnes et al., 2012) and long climate integrations that cover decades or even centuries (Hersbach et al., 2015).When even larger data sets are used, such as the ensembles of seasonal integrations (Stockdale et al., 2011;Molteni et al., 2011), as was done by Van den Brink et al. (2005) on a sample amounting to nearly 1000 years, finding ways to reduce the size of the samples becomes essential.That is the subject of this paper.
We will present a simple argument for why it is sufficient to retain only a small subset K 0 consisting of the highest entries in a sample when estimating tail statistics such as return levels and their associated CIs by means of non-parametric bootstrapping.These highest entries will normally only represent a small fraction of the total sample.This reduces the need for sorting and storage by several orders of magnitude.The method also reduces the task of sorting the original sample as only the K 0 highest entries are kept.
This paper is organized as follows.Section 2 presents the binomial argument for why we can bootstrap from a small subset consisting of the highest entries in the original sample.Section 3 presents three examples of bootstrapped CIs of various tail statistics for a data set of significant wave height from the central North Sea.Here we also show how Published by Copernicus Publications on behalf of the European Geosciences Union.
the method laid out in Sect. 2 can be used in practice to determine how many entries must be kept in order to perform an unbiased bootstrap.Section 4 summarizes the results and presents the conclusions.
2 Bootstrapping from the K 0 highest entries in a sample Consider the sample D 0 of independent and identically distributed (iid Diaconis and Efron, 1983;Efron and Gong, 1983).This method can be used to compute the CIs around extreme value estimates (Breivik et al., 2013(Breivik et al., , 2014)).The procedure is computationally intensive and memory consuming, as it involves bootstrapping and storing M × N numbers and performing M sorts, each a process of O(N log 2 N ) operations (Press et al., 2007, pp. 423-427).Since we are only interested in combinations of the k highest entries in the resamples D 1 , D 2 , . .., D M , we will explore the possibility of instead resampling from only the highest This will be referred to as the resample threshold and is sometimes more conveniently written as the percentage of data left out, P 0 = 100(1 − K 0 /N ).
The probability of drawing one of the highest K 0 entries in D 0 is a binomial problem with probability p = K 0 /N .The probability of making exactly k draws (with replacement) from the highest K 0 in N draws is thus given by the binomial probability mass function (Zwillinger, 1996, p. 581) where X is a random variable representing the number of draws.The probability of drawing fewer than the required k entries from the highest K 0 is given by the binomial cumulative distribution function A full bootstrap resample D i of length N from D 0 will contain K i entries from the highest K 0 , and K i ∼ Binom(N, p) where E[K i ] = K 0 since the expected (mean) value of the binomial distribution (Eq. 1) is The variance is Denote a short bootstrap resample from the K 0 highest entries in D 0 as D i .Two conditions must be met for D i to be an unbiased substitute for D i : 1.The number K 0 must be set large enough that the probability that we miss entries smaller than X N,N −K 0 +1 in D 0 is below a chosen threshold p c .
2. The length K i of D i must have the same mean and variance as K i (Eqs.3-4).
To fulfil condition (1) it is sufficient to decide on an acceptable level for p c .This probability can be found by consulting Eq. ( 2).It is important to note that choosing K 0 too small will bias the statistic θ = f ( D i ) since it will be estimated from bootstrap samples that miss entries smaller than X N,N−K 0 +1 .We will for this reason refer to p c as the probability of contamination as it gives the probability that the bootstrap estimate is biased because we have kept too few entries from the original sample D 0 .A very conservative bound on p, and thus on K 0 = Np, can be found quickly by consulting Hoeffding's formula (Hoeffding, 1963), valid when k ≤ Np.A useful quantity is the ratio r = K 0 /k of upper entries retained (K 0 ) and the minimum number k required to form a bootstrap estimate of the statistic in question for a given probability of contamination p c .This can be found from Eq. ( 2), but when N is large the Poisson distribution is a good approximation and more practical to work with, Figure 1 shows the minimum acceptable ratio K 0 /k as a function of k for levels of p c ranging from 10 −5 to 0.05.The probabilities can be computed from Eq. ( 2) (or more conveniently from Eq. 6).As can be seen, for all values of k, the ratio is comfortably below 15, and for values of k larger than 10 a ratio of 3 is sufficient even for a confidence level of 10 −5 .See the Appendix for a more detailed explanation of the ratio curves used throughout.
Condition (2) can be handled by randomly perturbing the size of the resamples, K i , such that it mimics the number of draws, K i ∼ Binom(N, p), that would have been made from the upper K 0 entries of D 0 in a full bootstrap D i .In practice, as we shall see, the statistics are quite insensitive to these perturbations as long as K 0 has been chosen sufficiently large.

Bootstrapping confidence intervals
Here we present worked examples of how the two conditions presented above can reduce the problem of estimating CIs on Ekofisk, all +240-h ensemble forecasts tail statistics for a data set of independent ensemble forecasts at long lead time (N = 330 000).We use archived ensemble forecasts (Molteni et al., 1996) of significant wave height in the central North Sea (near the Ekofisk oil field at 56.5 • N, 003.2 • E; a histogram of the data set used is shown in Fig. 2) at a forecast lead time of 240 h.One-hundred-year return values from these ensembles have previously been reported by Breivik et al. (2013Breivik et al. ( , 2014)).

Example 1: confidence intervals on in-sample return estimates
Consider as an example the problem of how to calculate insample return estimates from the sample of independent forecasts presented above.These forecasts can be considered iid (as they are not from correlated time series).An in-sample return estimate is calculated directly from the tail of the empirical distribution rather than by applying extreme value analysis.As explained by Breivik et al. (2013) the independent forecasts presented in Fig. 2 add up to the equivalent of 229 years under the assumption that each forecast represents a time interval t = 6 h.A 100-year return estimate is then a linear interpolation between X N,N −1 and X N,N−2 (the second and third highest entries in D 0 ), Now, clearly k = 3 since we need the second and third highest entries in our resamples to form a return estimate.Let us now tentatively keep the K 0 = 1000 highest entries and bootstrap from these instead of from the entire sequence to compute the CIs on the linear combination of the second and third highest entries given by Eq. ( 7).The size K i of the resamples, D i , is drawn from the binomial distribution (Eqs.3-4) with µ = K 0 and σ 2 ≈ K 0 .What is the probability p c that one of the three highest entries in a bootstrapped sequence should not have come from the 1000 highest entries that we have retained (i.e. should depend on entries contained in the bulk of the sample that we discarded)?It is clear that the probability of drawing one of the highest 1000 entries is p = 1000/330 000, and from Eq. ( 2) we find that the probability of picking too few (< 3) entries from the K 0 highest is which is indistinguishable from zero to double precision.Reducing the number K 0 to 10 (r ≈ 3) raises the probability of contaminating the resamples by entries from the lower N − K 0 to 0.002.This can also be confirmed by consulting Fig. 1 for the combination k = 3, r = 3.For M = 1000 resamples we may thus expect on average two resamples to be contaminated by values from the lower N − K 0 values in the original sequence.A very safe compromise in this case is K 0 = 100 (r ≈ 33).Consulting Fig. 1 shows that for k = 3, r = 33 we are well below a probability of contamination of 10 −5 .The quantile-quantile (QQ) plot in Fig. 3 shows that resampled return estimates of significant wave height from the full sample D 0 (see Fig. 2) have practically the same distribution as resamples from the upper K 0 = 100 entries.Condition (2) given above states that the size of the reduced resamples D 1 , D 2 , . .., D M should be randomly perturbed around the mean value K 0 .In practice this condition turns out to be rather insignificant as long as K 0 is chosen sufficiently large.This is demonstrated in the QQ plot in Fig. 4 where we see that perturbed-length estimates (abscissa) closely match the distribution of fixed-length estimates (ordinate).However, choosing K 0 too small will bias the statistic in question.This is illustrated in Fig. 5 where we see that bootstrap estimates from too-short subsets of the original sample (K 0 chosen too small) are biased high.As K 0 approaches 30 (r = 10), the mean and standard devia-tion of the return estimates approach their asymptotic values.These findings are in accordance with what we find by consulting Fig. 1 where we see that k = 3, r = 10 has a probability of contamination p c less than 10 −5 .It is also of interest to investigate just how many bootstrap resamples are actually needed to obtain CIs from a non-parametric bootstrap technique.In Fig. 5 we chose M = 10 000.As Fig. 6 shows, this is clearly excessive for reasonable thresholds K 0 .
In fact, Efron and Tibshirani (1993) state that 200 resamples are normally enough.We find this to be on the low side in our case, as Fig. 6 shows.However, 1000 resamples is sufficient in this case, but this should be investigated in each case.Breivik et al. (2014) found (see their Supplementary Fig. 7) that for a similar data set, 500 resamples would be sufficient when employing a generalized Pareto distribution (GPD) on threshold exceedances.

Example 2: confidence intervals on upper percentiles
A similar problem to the estimation of CIs for in-sample return values is how to obtain the CI for the highest percentiles, e.g. the 99th percentile (P 99 ).The upper percentile is frequently used when investigating trends in for example the wind and wave height climate (see e.g.Wang andSwail (2001, 2002)).In order to construct a bootstrap estimate of P 99 brute force it is necessary to resample the entire sample D 0 and sort the bootstrap to get to the N/100th highest entry.However, Fig. 1 tells us that when k = N/100 is large (as it will be when N is large), we can with extremely high certainty say that keeping the K 0 = 2k highest entries is enough to perform a bootstrap resample exercise for the CI on P 99 .In fact, K 0 = 1.2k is sufficient for all significance levels plotted in Fig. 1.This means that in order to obtain a CI for P 99 we need only find the entry X N,N−k that corresponds to P 99 from the original sample D 0 and retain entries higher than X N,N−1.2k .Figure 7 shows how the ratio r decreases as the sample size N increases.It is clear that for all probabilities of contamination investigated, a ratio of K 0 /k = 2 is sufficient when N is larger than 2000.Obviously, samples smaller than O(10 3 ) do not pose computationally demanding problems anyway and are of no interest to us in this context.Figure 8 illustrates for a fixed probability of contamination p c = 0.01 that even as we go to higher percentiles (the uppermost curve shows P 99.9 ), a ratio K 0 /k = 2 is sufficient as the sample size N exceeds 10 4 (see the Appendix for more details on the ratio curves).

Example 3: confidence intervals on return estimates from threshold exceedances
Consider now the problem of estimating CIs for return estimates from threshold exceedances from a data set of independent forecasts.This differs from a peaks-over-threshold approach which is how correlated time series must be han- dled to estimate return levels (Coles, 2001).GPD gives the relevant extreme value distribution for independent exceedances above a threshold u (Coles, 2001, 75-77), Here y = X i − u, y > 0 are exceedances above a threshold u = X N,N−k+1 (remember that X N,N−k+1 is the kth highest entry in the sample D 0 ) and σ is a scale parameter which is a function of the threshold u, and ξ is the shape parameter.Here, the minimum number of entries required is simply the upper 1 % (P 99 ), so k = N/100.Various levels of probability of contamination p c are shown, and for sample sizes larger than approximately 2000, a ratio r = 2 is sufficient.The curve representing 1 % probability of contamination is marked in red (with diamonds) as it represents a reasonable confidence level.
A brute-force approach would be to make N draws from D 0 (with replacement) and repeat this procedure M times.Then, GPD return estimates would be computed for each of the resulting bootstrap sequences D 1 , D 2 , . .., D M .Say we want to try to instead retain only the K 0 entries exceeding a threshold U 0 , where U 0 < u, corresponding to the entry X N,N −K 0 +1 in the original sample D 0 .From these we need to draw at least k entries, from which we will make return estimates.The question is again how many entries (K 0 ) must be kept to arrive at an acceptably low probability p c that the statistic should really be based on entries below the threshold U 0 .
This problem arises when estimating GPD return values from the independent ensemble forecasts (Fig. 2).For such a sample all exceedances above a given threshold can be used to form GPD return value estimates (Eq.9).CIs on the return values can likewise be obtained by bootstrapping from all entries exceeding this threshold.For a large sample this is orders of magnitude faster than bootstrapping from the entire sample.Assume again that we have kept all forecasts exceeding P 99.1 , i.e. the K 0 = 3000 highest entries (see Fig. 2).To form a return estimate we assume that we need at least k = 1000 entries, corresponding to P 99.7 .The probability of drawing (with replacement) k or fewer entries from the highest K 0 in N draws can again be found from Eq. ( 2) and is indistinguishable from zero to double precision with the given choice of N, K 0 and k.This is easy to verify by consulting Fig. 1 where we see that for k = 1000, r = 3 we are well above the 10 −5 level.Figure 9 shows that the confiwww.nat-hazards-earth-syst-sci.net/17/357/2017/Nat.Hazards Earth Syst.Sci., 17, 357-366, 2017 dence interval and the mean return value based on M = 1000 bootstrap resamples for various choices of resample threshold 100(1 − K 0 /N ) (i.e. the percentage of data omitted) are practically identical to the CIs based on D 0 (marked as asterisks).Only when r = K 0 /k comes close to unity do we experience fluctuations and biases (i.e. the resample threshold nearly coincides with the number of tail entries required to form a return estimate, in this case the threshold P 99.7 ).

Conclusions
CIs and other statistics of the extremes and the tail of empirical distributions are commonly found using non-parametric bootstrap techniques.Here we have shown that it is unnecessary to bootstrap from the entire original sample.The actual number K 0 highest entries that must be kept to make unbiased bootstrap estimates for the tail of an empirical distribution depends on K 0 = Np as well as on the number k highest entries that are required for the statistic in question.
The examples in the previous sections calculated p c given a predetermined number K 0 of tail entries that have been kept.This is a realistic approach as in practice we often retain a larger part of the tail of an empirical distribution than what is strictly needed since the same data set is used to compute other statistics.It is then sufficient to consult Eq. ( 2) to determine whether K 0 is sufficiently large.A quick estimate of the probability of contamination can be made by consulting Fig. 1.The advantages of restricting resamples to a small subset K 0 consisting of the highest entries in D 0 can be summarized as follows.First, only the upper K 0 entries need be kept and sorted in the original data set.This offers substantial savings in cases like those described by Breivik et al. (2013Breivik et al. ( , 2014) ) where a very large number of forecasts (> 300 000) are handled, each consisting of more than 60 000 grid points in space.Second, the size of the resamples is also reduced from N to an average size K 0 , where K 0 is usually a very small fraction of N , typically less than 1 %.Third, this reduction in resample size also means that the cost of sorting the resamples to get to the highest entries is greatly reduced, as the problem is now linear in the number of bootstrap resamples M since each sort is O(K 0 log 2 K 0 ), which is now a constant number independent of the size of the original sample (or a small fraction of it, as in the 99th percentile shown in Example 2).
We have investigated the conditions that must be met to form a non-parametric bootstrap for tail statistics such as return levels (which depend on all three parameters of the generalized extreme value distribution or the GPD) from a small subset of the highest entries in the original sample.As mentioned in the Introduction, an important question is whether non-parametric bootstraps yield CIs with sufficient coverage, i.e.CIs that are wide enough.This has been extensively studied by Kyselý (2008) who found that non-parametric bootstraps in particular, but also parametric bootstraps, tend to have too low coverage.This problem is not addressed by our study, and it is clear that alternative methods are often called for.In particular, the test inversion bootstrap (Carpenter, 1999) is a promising method where the test inversion refers to the duality between hypothesis testing and confidence intervals.Schendel andThongwichian (2015, 2017) show how this method, originally developed for estimation of statistics of single parameters in the presence of nuisance parameters, can be extended to handle return levels which depend on three parameters for both the generalized extreme value distribution and GPD by utilizing a maximum likelihood technique.However, non-parametric bootstraps represent a quick and hypothesis-free approach to obtaining CIs, and as the results presented show we can comfortably assume that the results will remain unchanged if we select a small subset of the original sample, provided we follow the procedure outlined in Sect. 2.
Data availability.The data sets presented in this study are archived in the MARS database of the European Centre for Medium-Range Weather Forecasts (ECMWF); see http://apps.ecmwf.int/archive-catalogue/?class=od.

Appendix A: Consulting the ratio curves
The ratio curves presented in Figs 1, 7 and 8 are convenient for quickly establishing how many entries (K 0 ) must be kept in order to form an unbiased resample that depends on the highest k entries.The relationship between Fig. 1 and Fig. 7 can be illustrated as follows.If we assume N large we can use Fig. 1.In practice we can choose N = 2 × 10 3 without violating the assumption that N is large.Now assume that the statistic in question is the 99th percentile, i.e. k = N/100 = 20.Let us choose a probability of contamination p c = 0.01 (this corresponds to the red curve marked with diamonds in Fig. 1).We find the ratio to be 1.6, i.e. we will need to keep 60 % more entries than the entry corresponding to P 99 .The corresponding curve in Fig. 7 is also marked in red.Here, the location on the x axis to read off is N = 2×10 3 , which lies on the y axis, and the ratio is again found to be 1.6.A more realistic example in terms of sample size would be N = 10 5 (and k = N/100 = 10 3 ).Now we find from either Fig. 1 or Fig. 7 that with a probability of contamination p c = 0.01 that the ratio is 1.13, i.e. we need only keep 13 % more entries than the one representing the 99th percentile.Figs. 1, 7 and 8 clearly illustrate that in almost all cases it is sufficient to retain at most twice as many entries K 0 from the tail of the sample distribution D 0 than what is required (k) for the statistic in question.

Figure 1 .
Figure1.The ratio K 0 /k as a function of k, the minimum number of bootstrapped entries needed for the statistic in question, for levels of probability of contamination ranging from 10 −5 (uppermost curve) to 0.05 (lowermost curve).The curve representing 1 % probability of contamination is marked in red (with diamonds) as it is a reasonable confidence level.

Figure 2 .
Figure 2. Histogram of the significant wave height from archived ensemble forecasts in the central North Sea (Ekofisk, 56.5 • N, 003.2 • E) at +240 h lead time.Entries above P 99.1 , corresponding to threshold U 0 , are coloured red whilst entries exceeding P 99.7 , corresponding to the upper threshold, u, are in black.The highest entries are individually marked with asterisks.

Figure 3 .Figure 4 .
Figure3.A quantile-quantile comparison of 10 000 bootstrapped direct 100-year return estimates of significant wave height taken from a forecast ensemble(Breivik et al., 2013) versus a bootstrap from the upper 100 entries in the sample.The 45 • line is shown in red.

Figure 5 .Figure 6 .
Figure5.Mean and standard deviation on 100-year in-sample return estimates based on M = 10 000 bootstrap resamples for various choices of resample threshold K 0 for the sample in Fig.2.A minimum of k = 3 entries are required to form the return estimate (see Eq. 7).For choices of K 0 smaller than 30 (corresponding to a ratio r = K 0 /k = 10) the bootstrap resamples are biased high.

Figure 7 .
Figure7.Bootstrapping the 99th percentile, P 99 .The ratio r = K 0 /k is shown as a function of sample size N .Here, the minimum number of entries required is simply the upper 1 % (P 99 ), so k = N/100.Various levels of probability of contamination p c are shown, and for sample sizes larger than approximately 2000, a ratio r = 2 is sufficient.The curve representing 1 % probability of contamination is marked in red (with diamonds) as it represents a reasonable confidence level.

Figure 8 .
Figure8.Bootstrapping the upper percentiles P = P 90 , P 95 , P 97 , P 99 and P 99.9 .The ratio r = K 0 /k is shown as a function of sample size N. Here, the minimum number of entries required is k = (1 − P )N/100.The probability of contamination is kept fixed at p c = 0.01.At sample sizes larger than approximately 10 4 , a ratio r = 2 is sufficient for all percentiles investigated.The curve representing the 99th percentile is marked in red (with diamonds) and corresponds to the red curve in Fig.7.

Figure 9 .
Figure9.The upper and lower 95 % CIs and the mean 100-year return estimates (dashed) based on M = 1000 bootstrap resamples for various choices of resample threshold K 0 for the sample in Fig.2.Upper panel: a GPD with shape parameter ξ = 0 (exponential distribution).Lower panel: a GPD with freely varying shape parameter.Individual bootstrap estimates are marked in grey.Estimates based on the full sample D 0 are marked as asterisks on the ordinate.