Extreme waves influence coastal engineering activities and have an immense geophysical implication. Therefore, their study, observation and extreme wave prediction are decisive for planning of mitigation measures against natural coastal hazards, ship routing, design of coastal and offshore structures. In this study, the estimates of design wave heights associated with return period of 30 and 100 years are dealt with in detail. The design wave height is estimated based on four different models to obtain a general and reliable model. Different locations are considered to perform the analysis: four sites in Indian waters (two each in Bay of Bengal and the Arabian Sea), one in the Mediterranean Sea and two in North America (one each in North Pacific Ocean and the Gulf of Maine). For the Indian water domain, European Centre for Medium-Range Weather Forecasts (ECMWF) global atmospheric reanalysis ERA-Interim wave hindcast data covering a period of 36 years have been utilized for this purpose. For the locations in Mediterranean Sea and North America, both ERA-Interim wave hindcast and buoy data are considered. The reasons for the variation in return value estimates of the ERA-Interim data and the buoy data using different estimation models are assessed in detail.

The Indian Ocean with two horns of the Arabian Sea and the Bay of Bengal has been playing a significant role in the regional economic development. This rapid progress is attributed to a variety of activities in the coastal and offshore sectors that include construction and development of major ports and fishing harbours, establishment of power plants, offshore exploration and exploitation of oil and gas, and tampering of ocean wave and tidal energy. To sustain these developments along the coast, the aforementioned activities require a variety of coastal and offshore structures such as groins, sea walls, breakwaters, offshore platforms, intake and outfall structures, submarine pipelines, etc. to be constructed in the marine environment. It is hence mandatory to design these structures for its life span which could be achieved by considering its survival conditions. The most dominant environmental forces that dictate this design of the structure are due to the maximum probable wave height of a site of interest (Massel, 1978).

Depending on the importance and lifespan of the structure, the return period of the extreme events could be selected as 30 years or 100 years. The lesser would be associated with lesser wave height but more risk and vice versa. It demands a better understanding of hydrodynamic characteristics of local wave environment, especially the extreme conditions. In the design of any marine structures, the first step is the extreme wave analysis for the determination of design wave heights with certain return periods (Goda, 2000). Estimation of appropriate design values indicates the level of protection and the scale of investment during the construction of the structure.

Fundamentally, extreme values are scarce and are necessarily outside the range of the available observations, implying that an extrapolation from the observed sea states to unknown territories is required. An estimate of anticipated wave height can be furnished using historical wave hindcast data or field observed data with the help of various distribution models, which enable extrapolation under the extreme value theory framework (Goda, 2000; Coles, 2001; Caires, 2011). Ferreira and Guedes Soares (2000) suggested that the estimation of extreme values should rely on methods based on extreme value theory which makes use of the largest of the observations in the sample. Coles (2001) obtained the detailed statistical results of extreme value prediction using the annual maximum (AM) (Castillo, 1988) and Peaks Over Threshold (POT) (Ferreria and Guedes Soares, 1998) sampled observations. Caires, 2011 rigorously compared the commonly used extreme value statistical methods (like generalized extreme value, or GEV, and generalized Pareto distribution, or GPD) with different parameter estimation methods for combination of different data sampling techniques.

Another approach that may be applied starting from a wave data time series
is that of equivalent storm models (Boccotti, 1986, 2000; Fedele and Arena,
2010; Laface and Arena, 2016), which is based on the concept of sea storm.
Specifically, these models consist of substituting the sequence of sea
storms at a given site (actual sea) with a sequence of equivalent storms
(equivalent sea) from a statistical perspective. The equivalent storms have
very simple geometric shapes such as triangular (Boccotti, 1986, 2000; Arena
and Pavone, 2009), power (Fedele and Arena, 2010; Arena et al., 2014) or
exponential (Laface and Arena, 2016). Depending on the shape the related
model gives an analytical or numerical solution for the calculation of the
return period

The accuracy of any methodology for extreme values significantly depends on the length of the recorded time series. It is believed that measurements from wave rider buoy offer the most reliable long historical record. However, the availability of such buoy data is limited to certain specific locations, mainly in the northern hemisphere. At a particular location of interest, the availability of buoy data is usually scarce, and often there will be no data. The oceanographic community has recognized the hindcasts with ocean wave models to complement the limited buoy observational records.

In the recent years, the performance of wave models has appreciably improved, with better quality of the wind fields and enhancement in numerical wave modelling. The meteorological centres like European Centre for Medium-Range Weather Forecasts (ECMWF), Australian Bureau of Meteorology and Meteo France that operate global wave models are currently using altimeter wind data for data assimilation purposes. The process combines numerical wave model and observations of diverse sorts in the best possible ways to generate a consistent, global estimate of the various atmospheric, wave and oceanographic parameters. At present in numerous meteorological centres, wind and wave simulated data are assimilated on a daily basis.

The simulated hindcast data have been adopted in numerous studies for the estimation of extreme wave conditions. Teena et al. (2012) applied a GEV distribution and GPD to the 31 years assimilated wave hindcast data based on MIKE-21, a spectral wave model for a location in the eastern Arabian Sea and extracted extreme wave for several return periods. Li et al. (2016) used a third-generation wave model, WAMC4, and simulated 35 years of wave hindcast data from two sets of reanalysis wind data, NCEP and ECMWF. In their study, Pearson III distribution method is used to analyse the extreme wave climate in the East China Sea. Polnikov and Gomorev (2015) proposed to use the extrapolation of a polynomial approximation constructed for the shorter part of the tail of probability function to estimate the return values of wind speed and wind-wave height. The wave field was computed from the wave model, WAM-C4M, from ECMWF global atmospheric reanalysis ERA-Interim wind field data.

Even though several studies have been carried out, a study on the identification of the most suitable approach for estimating extreme wave heights for a particular source of assimilated wave hindcast data is still lacking. In the present study, the investigation of different existing approaches and models is carried to assess its application and reliability for the Indian domain. Increased uncertainty in the model outputs questions the reliability of the estimation model, which is an important issue. Thus, the present study introduces a statistical approach to validate the reliability of the design wave height return values resulting from a particular extreme wave estimation method by considering variability criterion on the basis of measured maximum value. The variation in the extreme value estimates of the ERA-Interim data and the buoy data for different estimation models is also considered and examined. The objective of the present study is to identify a robust extreme wave height estimation method for the Indian domain using global atmospheric reanalysis ERA-Interim wave hindcast data.

Four offshore locations along the Indian coast (Fig. 1) are considered. The selection of these particular locations is based on their distance from the nearest coast and the water depth, two each on east and west coasts of the Indian peninsula. Both deep and shallow water locations are chosen to examine the application of the estimation model based on water depth.

ERA-Interim data locations and buoy stations.

Selected locations for ERA-Interim data along the Indian Coast.

The projected estimates using ERA-Interim data are compared with those
obtained from data from various buoy datasets to validate the performance of
ERA-Interim data in extreme wave analysis. The choice of the locations was
made according to the size of wave data that were available. Further, two
locations in North America, National Data Buoy Center Station 44005 in Gulf
of Maine and National Data Buoy Center Station 46050 west of Newport, and one of
the most energetic sites in the coasts of central Mediterranean Sea (Liberti
et al., 2013; Vicinanza et al., 2013; Arena et al., 2015) from the Italian
buoys network locations, Alghero (west coast of Sardinia Island), are
considered. A comprehensive comparison has been carried out by extracting
the ERA-Interim data of resolution 0.125

ERA-Interim data are produced by the ECMWF, which is a global atmospheric
reanalysis from 1979, continuously updated in real time and among the
most recent reanalysis data available (Berrisford et al., 2009).
ERA-Interim is the first to perform reanalysis using adaptive and fully
automated bias corrections of observations (Dee and Uppala, 2008). The
parameters such as significant wave height (

There have been several studies comparing the values of

The most reliable data for significant wave height are from the buoy measurements. The available length of buoy data is usually limited and the data prior to 1978 is scant. The available buoy data further require significant quality control on account of large gaps of missing data and outlier flagship measurements. In this paper data from two different buoys networks are processed: the Italian network RON (Rete Ondametrica Nazionale) and the US National Oceanic and Atmospheric Administration's National Data Buoys Center (NOAA-NDBC).

The Italian buoys network (RON) started measurements in 1989, with eight directional buoys located off the coasts of Italy. Later it reached 15 buoys moored in deep water. For each record, the data of significant wave height, peak and mean period and dominant direction are given.

The NOAA manages the NDBC, which consists of many buoys moored along the US coasts, both in the Pacific Ocean and in the Atlantic Ocean. Some buoys were moored in the late 1970s so that more than 35 years of data are available. The historical wave data give hourly significant wave height, peak and mean period. The NOAA buoy observations are readily available and are of proven quality. The measurements have passed through quality control by NOAA. It is, however, always recommended to perform some basic quality checks.

The return value estimates acquired from the ERA-Interim data are compared with that of NDBC stations 44005 and 46050 and at Alghero along the coast of central Mediterranean Sea. Table 1 provides the coordinates and data details of these buoy stations. ERA-Interim wave hindcast data have been used to assess the estimates in Indian waters.

The estimation models used in this study to obtain extreme wave return
values include the GEV and the GPD, which are currently being adopted for the
standard practice in mainstream extreme statistics. Each distribution was
fit to the data using the maximum likelihood estimate (MLE) method and the
probability weighted moments (PWM) method. Further, a new polynomial approximation (P-app) model
prescribed by Polnikov and Gomorev (2015) and ETS model (Boccotti, 2000) based on the concept of replacing
the sequence of actual storms extrapolated from a given time series of

According to extreme value theory, to form a valid distribution the sampled
observations should be independent which would mean that successive
observations should not be correlated with one another and should be
identically distributed (Goda, 2000). In general, for the sampling of data to
be used for extreme wave analysis, three different approaches are available.
The first approach uses all the recorded data of

According to theory of the GEV distribution, the sample has been selected by means of AM method.

The GEV distribution for a given random variable

The

This approach is based on fitting the GPD to the POT sampled data. The observations in a cluster above the threshold are considered and calculating return values has been done by taking into account the rate of occurrence of clusters (Davidson and Smith, 1990; Coles, 2001).

The cumulative distribution function of the GPD is given as

The

There are several parameter estimation methods for fitting the above candidate distribution functions to the sampled wave data (Goda, 2000). The method of moments (MM), PWM method and the MLE are more preferred estimation methods since these are more flexible, particularly when the number of parameters is increased. The MM yields a large bias particularly for small size samples and this method was not used in the present study. The parameters of the above distributions are derived according to the methods of MLE and PWM.

The threshold selection in GPD analysis is an important practical problem, which is analogous to the block size in the block maxima approach. The threshold value represents a compromise between bias and variance. Too low a threshold violates the asymptotic basis of the GPD model, leading to a bias. Too high a threshold will generate fewer values of excess to estimate the model, leading to high variance. There is extensive literature on the attempt to choose an optimal threshold by Neelamani (2009) and Caires (2011). In this study, the threshold selection is based on the mean residual life plots introduced by Davison and Smith (1990).

The mean residual life plot is based on the theoretical mean of the GPD given as

A mean residual life plot consists in representing points:

Polnikov and Gomorev (2015) proposed to use the extrapolation of polynomial approximation constructed for the shorter part of the tail of probability function to estimate the return values of wind speed and wave height.

This method involves the construction of an analytical approximation

The statistical distribution with the provision function is of the form

Another principal feature of polynomial approximation

The ETS model (Boccotti, 2000; Arena and
Pavone, 2006, 2009) is applied for calculating return values of significant
wave height for given thresholds of return period. The ETS approach is based
on the assumption that given a sequence of actual storms it may be replaced
by an equivalent storm sequence maintaining the same wave risk. The validity
of the above assumption is guaranteed by the statistical equivalence between
the actual storm and the related equivalent triangular one. The ETS
associated with a given storm is achieved by means of two parameters: the
triangle height

Considering all these aspects, it emerges that the actual storm and the ETS
sequences (actual and equivalent triangular seas) have the same number of
storms, each of them characterized by the same maximum significant wave
height and the same probability

Typical representation of actual storm and associated ETS.

The calculation of return values of

the base-height regression function,

the probability

Concerning the distribution of the significant wave height

In this study, return values from ERA-Interim data are compared with the values obtained from buoy data at the same location for different estimation models. Further study of the various uncertainties due to the parameter estimation method, the sample size, sample interval and location conditions involved in this analysis are also examined.

Estimated parameters from PWM and MLE methods for GEV model.

In the application of generalized extreme value distribution to the sampled AM data, the scale, shape and location parameters can be used to make statements about the probability of the annual maximum exceeding a particular level. A change in any of the parameters can affect the long-period return levels.

The parameter estimation is done by the MLE and PWM methods (Hosking et al., 1985) and the obtained parameters are shown in Table 2. It has been observed that the shape parameter is positive for ERA-Interim data, indicating that these data would follow the Frechét distribution and the tail of the cumulative distribution function decreases gradually.

The influence of estimated parameters in fitting the data to the GEV model is presented in Fig. 3a. It shows the level of fitting of the empirical CDF with the GEV PWM and GEV MLE models. The difference in the normal coordinates in their fitting with empirical CDF is insignificant. Figure 3b shows the variation in tail estimates of the PWM and MLE parameter estimation methods in logarithmic scale. The results show that for buoy and ERA-Interim datasets the PWM method of parameter estimation yields better estimates compared to the MLE method.

The statistical parameter, root mean square error (RMSE) was estimated in order to check the level of fitting of sampled data to the GEV distribution model. The RMSE is a residual between the empirical cumulative distribution obtained from the actual observed data and the theoretical GEV model cumulative distribution. The lower the value of RMSE, i.e. nearer to zero, the better the fit of sampled data to the GEV distribution model. The fitting of GEV to buoy and ERA-Interim data is found to be good for both PWM and MLE methods. The RMSE values of the MLE estimates are usually smaller than those of the PWM estimates for both buoy and ERA-Interim data.

In POT method, the selection of a suitable threshold value is the key in achieving a robust sample dataset. The mean residual plot, between the mean excess GPD and threshold, helps in determining a proper range of threshold to be selected (Coles, 2001). Such plots with 95 % confidence for the data ERA IN-1 (Fig. 4) appear to have two slopes with the major transition at the threshold range of 1.5 to 2.5, indicating the range of threshold could possibly be selected. However, attention should be paid because too-high thresholds can result in a less sampled dataset which results in a higher variance of the GPD model.

The sample used in the peaks over threshold method has to be extracted in such a way that the data can be modelled as independent observations. A process of declustering helps to collect only the peaks within the clusters of successive exceedances of a specified threshold and are retained in such a way that they are sufficiently apart (so that they belong to “independent storms”). Specifically, in the present applications, we have treated cluster maxima at a distance of less than 48 h apart as belonging to the same cluster (Caires, 2011). Table 3 provides the selected threshold and the number of exceedances of that specified threshold with a 48 h interval. It is seen that the threshold values are observed to be dependent on the length, location and interval of the datasets. The major factor has to be the location since the higher latitude locations are exposed to more severe wave and wind conditions than those at the lower latitudes.

For parameter estimation, the PWM and MLE methods are used. The MLE has a considerable statistical motivation but can turn out to be poor estimators, especially in the case where the number of estimated parameters is large. So the approach chosen here was to utilize a variety of techniques like PWM and MLE for exploratory fitting for the probability model and choose the best possible parameters.

To verify the estimated parameters for the GPD model, quantile–quantile (QQ) plots were used. In Fig. 5a, the QQ plots for the dataset NOAA44005 are shown, comparing the estimated GPD with the sample data for PWM parameter estimation method. In order to check the influence of parameters resulting from PWM and MLE parameter estimation models, the RMSE was estimated for GPD model also and presented in Table 3.

Comparing the estimates and the fits, one can conclude that the MLE fits seem less adequate and that the shape parameter estimates are lower than those of the PWM fits. These results support the recommendations of Hosking et al. (1985) to preferably use the PWM method for GPD or GEV estimation from the relatively short duration of data with limited heavy-tailed cumulative distributions. Figure 5b shows the return value GPD plot of PWM fit to the dataset NOAA44005.

P-app method has a distinct advantage of
selecting the optimum choice of the parameters

Estimated parameters from PWM and MLE methods for GPD model.

Mean residual plot for the dataset ERA IN-1 with 95 % confidence limits.

One can see the adaptation of P-app method to the real behaviour of the tails
for provision functions. For the Alghero location buoy data, the optimized
parameters obtained are

The optimum choice of parameters will also depend on the standard deviation

Polynomial approximation for series of wave heights

Selected optimum values of approximation parameters.

The calculation of the 100-year return values via ETS model is done by means
of Eq. (14), the base-height regression function Eq. (15) and the
probability distribution Eq. (16) of

In fact, Arena et al. (2013) have shown that as the time interval between two
successive

To determine the base-height regression function parameters, the actual
storm sequence is identified starting from

Specifically, considering an increase of

Base-height regression parameters

30-year return value estimates (m).

100-year return value estimates (m).

From the results, it is observed that the estimates from buoy observations
are higher compared to the estimates for ERA-Interim datasets. This trend is
being observed from all the estimation models. A variation of 20 to
30 % while comparing maximum observed

The underprediction of ERA-Interim data suggests that high wave events
mainly due to the cyclones are difficult to capture by ECMWF numerical
model. It is a familiar phenomenon and challenge that the smoothing effect
implanted in the numerical model will lead to the flattened variability at
relatively high frequencies, resulting in the missing peaks. An additional
potential explanation for the underprediction is that the simulated
ERA-Interim data contains 6-hourly intervals of

Final results on the 30- and 100-year extreme wave estimates, obtained by the GEV, GPD, ETS and P-app methods described above, are presented in Tables 6 and 7. The variation of these estimates from the measured maximum wave heights will give a statistical validation of the performance of the estimation models. The percentage of variation of 30- and 100-year return value estimates from measured 36-year maximum wave height is calculated for this analysis. Here one can observe the following principal peculiarities from the results of abovementioned statistical validation methodology.

The GEV and GPD methods show the 30-year return values smaller than the
measured maximum

The GEV model with AM sample resulted in overestimation of return
values compared to the GPD model with peaks over threshold approach. The
GEV estimation model considers only the highest

The results from the P-app method are remarkably closer to the measured
maximum values than those obtained by the GEV, GPD and ETS method, with
variation ranges between 5 and

This consistency of P-app method estimates is due to the
dependence of return values on the actual kind of the tail of provision
function, which is dependent on the entire sample of the time series. The
only disadvantage of the P-app method (

In this study, we chose the simulated ERA-Interim wave data for the two following reasons. First, they have more regular coverage for the whole World Ocean, and the Indian coast in particular. Second, numerically simulated datasets have long and regular continuous series, which is very important for the extreme value statistical aims.

This study focused on the estimation of the extreme significant wave heights only. The analyses carried out and result obtained will aid in the development of a 100-year extreme wave map for the Indian water domain, which may serve as a quick guide to identify regions where extremes lie within the design criteria of the coastal and offshore structures to be constructed.

We have considered four different approaches of the return value estimation: the GEV distribution model based on annual maxima sample, the GPD distribution model based on peaks over threshold sample, the ETS model based on storms and the P-app method based on extrapolation of the tail of the provision function. All of them have their own advantages and shortcomings.

The main drawback of the GEV and GPD methods is the high variation in underestimating or overestimating return values with respect to the measured maximum values in the time series. The shortage of the P-app method is related to the ambiguity of the return values estimations, obtained from different parts of the full time series. It is also found that the values estimated based on GEV model were slightly higher than those from the GPD. However, the GPD method with peaks over threshold sample is preferable in the locations of multiple storm events in a single year. In turn, the estimates through the P-app method, depend on the actual kind of tail of provision function, showed the consistency in 100-year estimated return values for both simulated and buoy wave height datasets, as these vary consistently between 7 and 13 % from the measured maximum values.

It is observed that the return value estimates from buoy observations are higher when compared to the estimates for ERA-Interim datasets. The underprediction of ERA-Interim data suggests that high wave events mainly due to the cyclones are difficult to capture by ECMWF numerical model. To overcome this, it is obvious that the ECMWF numerical modelling system needs further improvement in correction or calibration of the ERA-Interim data, especially when this hindcast is used for the extreme wave analysis.

ERA-Interim significant wave height data produced by the ECMWF can be accessed
from

The authors declare that they have no conflict of interest.

This paper has been developed by authors from IIT Madras and Mediterranean University during the Marie Curie IRSES project “Large Multi-Purpose Platforms for Exploiting Renewable Energy in Open Seas (PLENOSE)” funded by European Union (grant agreement no. PIRSES-GA-2013-612581). Edited by: T. Wagener Reviewed by: two anonymous referees