Assessment of reliability of extreme wave height prediction models

Extreme waves influence coastal engineering activities and have an immense geophysical implication. Therefore, their study, observation and extreme wave prediction are decisive for planning of mitigation measures against natural coastal hazards, ship routing, design of coastal and offshore structures. In this study, the estimates of design wave heights associated with return period of 30 and 100 years are dealt with in detail. The design wave height is estimated based on four different models to obtain a general and reliable model. Different locations are considered to perform the analysis: four sites in Indian waters (two each in Bay of Bengal and the Arabian Sea), one in the Mediterranean Sea and two in North America (one each in North Pacific Ocean and the Gulf of Maine). For the Indian water domain, European Centre for Medium-Range Weather Forecasts (ECMWF) global atmospheric reanalysis ERA-Interim wave hindcast data covering a period of 36 years have been utilized for this purpose. For the locations in Mediterranean Sea and North America, both ERA-Interim wave hindcast and buoy data are considered. The reasons for the variation in return value estimates of the ERA-Interim data and the buoy data using different estimation models are assessed in detail.


Introduction
The Indian Ocean with two horns of the Arabian Sea and the Bay of Bengal has been playing a significant role in the regional economic development.This rapid progress is attributed to a variety of activities in the coastal and offshore sectors that include construction and development of major ports and fishing harbours, establishment of power plants, offshore exploration and exploitation of oil and gas, and tampering of ocean wave and tidal energy.To sustain these developments along the coast, the aforementioned activities require a variety of coastal and offshore structures such as groins, sea walls, breakwaters, offshore platforms, intake and outfall structures, submarine pipelines, etc. to be constructed in the marine environment.It is hence mandatory to design these structures for its life span which could be achieved by considering its survival conditions.The most dominant environmental forces that dictate this design of the structure are due to the maximum probable wave height of a site of interest (Massel, 1978).
Depending on the importance and lifespan of the structure, the return period of the extreme events could be selected as 30 years or 100 years.The lesser would be associated with lesser wave height but more risk and vice versa.It demands a better understanding of hydrodynamic characteristics of local wave environment, especially the extreme conditions.In the design of any marine structures, the first step is the extreme wave analysis for the determination of design wave heights with certain return periods (Goda, 2000).Estimation of appropriate design values indicates the level of protection and the scale of investment during the construction of the structure.
Fundamentally, extreme values are scarce and are necessarily outside the range of the available observations, implying that an extrapolation from the observed sea states to unknown territories is required.An estimate of anticipated wave height can be furnished using historical wave hindcast data or field observed data with the help of various distribution models, which enable extrapolation under the extreme value theory framework (Goda, 2000;Coles, 2001;Caires, Published by Copernicus Publications on behalf of the European Geosciences Union.2011).Ferreira and Guedes Soares (2000) suggested that the estimation of extreme values should rely on methods based on extreme value theory which makes use of the largest of the observations in the sample.Coles (2001) obtained the detailed statistical results of extreme value prediction using the annual maximum (AM) (Castillo, 1988) and Peaks Over Threshold (POT) (Ferreria and Guedes Soares, 1998) sampled observations.Caires, 2011 rigorously compared the commonly used extreme value statistical methods (like generalized extreme value, or GEV, and generalized Pareto distribution, or GPD) with different parameter estimation methods for combination of different data sampling techniques.
Another approach that may be applied starting from a wave data time series is that of equivalent storm models (Boccotti, 1986(Boccotti, , 2000;;Fedele and Arena, 2010;Laface and Arena, 2016), which is based on the concept of sea storm.Specifically, these models consist of substituting the sequence of sea storms at a given site (actual sea) with a sequence of equivalent storms (equivalent sea) from a statistical perspective.The equivalent storms have very simple geometric shapes such as triangular (Boccotti, 1986(Boccotti, , 2000;;Arena and Pavone, 2009), power (Fedele and Arena, 2010;Arena et al., 2014) or exponential (Laface and Arena, 2016).Depending on the shape the related model gives an analytical or numerical solution for the calculation of the return period R(H s > h) of a sea storm whose maximum H s is greater than a given threshold h.Specifically, the triangular and exponential equivalent storm models give a closed form solution for R(H s > h), while the equivalent power storm model requires numerical calculation.In the paper the Equivalent Triangular Storm (ETS) model is utilized.
The accuracy of any methodology for extreme values significantly depends on the length of the recorded time series.It is believed that measurements from wave rider buoy offer the most reliable long historical record.However, the availability of such buoy data is limited to certain specific locations, mainly in the northern hemisphere.At a particular location of interest, the availability of buoy data is usually scarce, and often there will be no data.The oceanographic community has recognized the hindcasts with ocean wave models to complement the limited buoy observational records.
In the recent years, the performance of wave models has appreciably improved, with better quality of the wind fields and enhancement in numerical wave modelling.The meteorological centres like European Centre for Medium-Range Weather Forecasts (ECMWF), Australian Bureau of Meteorology and Meteo France that operate global wave models are currently using altimeter wind data for data assimilation purposes.The process combines numerical wave model and observations of diverse sorts in the best possible ways to generate a consistent, global estimate of the various atmospheric, wave and oceanographic parameters.At present in numerous meteorological centres, wind and wave simulated data are assimilated on a daily basis.
The simulated hindcast data have been adopted in numerous studies for the estimation of extreme wave conditions.Teena et al. (2012) applied a GEV distribution and GPD to the 31 years assimilated wave hindcast data based on MIKE-21, a spectral wave model for a location in the eastern Arabian Sea and extracted extreme wave for several return periods.Li et al. (2016) used a third-generation wave model, WAMC4, and simulated 35 years of wave hindcast data from two sets of reanalysis wind data, NCEP and ECMWF.In their study, Pearson III distribution method is used to analyse the extreme wave climate in the East China Sea.Polnikov and Gomorev (2015) proposed to use the extrapolation of a polynomial approximation constructed for the shorter part of the tail of probability function to estimate the return values of wind speed and wind-wave height.The wave field was computed from the wave model, WAM-C4M, from ECMWF global atmospheric reanalysis ERA-Interim wind field data.
Even though several studies have been carried out, a study on the identification of the most suitable approach for estimating extreme wave heights for a particular source of assimilated wave hindcast data is still lacking.In the present study, the investigation of different existing approaches and models is carried to assess its application and reliability for the Indian domain.Increased uncertainty in the model outputs questions the reliability of the estimation model, which is an important issue.Thus, the present study introduces a statistical approach to validate the reliability of the design wave height return values resulting from a particular extreme wave estimation method by considering variability criterion on the basis of measured maximum value.The variation in the extreme value estimates of the ERA-Interim data and the buoy data for different estimation models is also considered and examined.The objective of the present study is to identify a robust extreme wave height estimation method for the Indian domain using global atmospheric reanalysis ERA-Interim wave hindcast data.

Study locations
Four offshore locations along the Indian coast (Fig. 1) are considered.The selection of these particular locations is based on their distance from the nearest coast and the water depth, two each on east and west coasts of the Indian peninsula.Both deep and shallow water locations are chosen to examine the application of the estimation model based on water depth.
The projected estimates using ERA-Interim data are compared with those obtained from data from various buoy datasets to validate the performance of ERA-Interim data in extreme wave analysis.The choice of the locations was made according to the size of wave data that were available.Further, two locations in North America, National Data Buoy  Center Station 44005 in Gulf of Maine and National Data Buoy Center Station 46050 west of Newport, and one of the most energetic sites in the coasts of central Mediterranean Sea (Liberti et al., 2013;Vicinanza et al., 2013;Arena et al., 2015) from the Italian buoys network locations, Alghero (west coast of Sardinia Island), are considered.A comprehensive comparison has been carried out by extracting the ERA-Interim data of resolution 0.125 • × 0.125 • nearest to the selected buoy locations.The coordinates, period of data availability, interval and number of data points for these locations are presented in Table 1.

ERA-Interim data
ERA-Interim data are produced by the ECMWF, which is a global atmospheric reanalysis from 1979, continuously up-dated in real time and among the most recent reanalysis data available (Berrisford et al., 2009).ERA-Interim is the first to perform reanalysis using adaptive and fully automated bias corrections of observations (Dee and Uppala, 2008).The parameters such as significant wave height (H s ), mean wave direction and mean wave period can be obtained with 6-hourly fields covering the whole globe, with the best space resolution of 0.125 • × 0.125 • .There have been several studies comparing the values of H s between ERA-Interim dataset and buoy data at different locations around the world to evaluate the model performance (Shanas and Kumar, 2014;Kumar and Nassef, 2015).It has been found that at certain locations in the Arabian sea, the maximum H s based on ERA dataset in deep water is about 15 % less than that of buoy measured data, whereas in shallow waters the ERA dataset overpredicts the maximum H s by about 9 %.The underprediction in deep water suggests that extreme events attained mainly during cyclones are difficult to be captured by the model.The results show that H s of the model dataset is reliable in both deep and shallow water locations with a good degree of accuracy.The estimates in this study are based on ERA-Interim wave hindcast data, covering a period of 36 years .For nearest intersection buoy locations, the data period was selected based on buoy data availability.

Buoy data
The most reliable data for significant wave height are from the buoy measurements.The available length of buoy data is usually limited and the data prior to 1978 is scant.The available buoy data further require significant quality control on account of large gaps of missing data and outlier flagship measurements.In this paper data from two different buoys networks are processed: the Italian network RON (Rete Ondametrica Nazionale) and the US National Oceanic and Atmospheric Administration's National Data Buoys Center (NOAA-NDBC).
The Italian buoys network (RON) started measurements in 1989, with eight directional buoys located off the coasts of Italy.Later it reached 15 buoys moored in deep water.For each record, the data of significant wave height, peak and mean period and dominant direction are given.
The NOAA manages the NDBC, which consists of many buoys moored along the US coasts, both in the Pacific Ocean and in the Atlantic Ocean.Some buoys were moored in the late 1970s so that more than 35 years of data are available.The historical wave data give hourly significant wave height, peak and mean period.The NOAA buoy observations are readily available and are of proven quality.The measurements have passed through quality control by NOAA.It is, however, always recommended to perform some basic quality checks.
The return value estimates acquired from the ERA-Interim data are compared with that of NDBC stations 44005 and 46050 and at Alghero along the coast of central Mediterranean Sea.Table 1 provides the coordinates and data details of these buoy stations.ERA-Interim wave hindcast data have been used to assess the estimates in Indian waters.
3 Extreme wave height estimation methods

General
The estimation models used in this study to obtain extreme wave return values include the GEV and the GPD, which are currently being adopted for the standard practice in mainstream extreme statistics.Each distribution was fit to the data using the maximum likelihood estimate (MLE) method and the probability weighted moments (PWM) method.Further, a new polynomial approximation (P-app) model prescribed by Polnikov and Gomorev (2015) and ETS model (Boccotti, 2000) based on the concept of replacing the sequence of actual storms extrapolated from a given time series of H s with a sequence of equivalent triangular storms are used.

Generalized extreme value distribution model
According to extreme value theory, to form a valid distribution the sampled observations should be independent which would mean that successive observations should not be correlated with one another and should be identically distributed (Goda, 2000).In general, for the sampling of data to be used for extreme wave analysis, three different approaches are available.The first approach uses all the recorded data of H s during a number of years and fits a cumulative distribution to these data.This approach is called the initial distribution method (IDM).For the other two approaches, only the peaks of wave heights are engaged.The method of block maxima consists of partitioning recorded data in blocks, wherein the maximum value of each block is considered.Normally a block could be chosen as 1 year (Lionello et al., 1992).The POT method consists of the peaks of clustered data ex-ceeding a given threshold.IDM observations violate the conditions of identity and independence in distribution, which invalidates the application of the common statistical methods as well as the definition of return values (Anderson et al., 2001).The AM method and POT method both satisfy the obligatory of independency.
According to theory of the GEV distribution, the sample has been selected by means of AM method.
The GEV distribution for a given random variable H has the cumulative distribution function (CDF) as where µ, σ and ξ represent the location, scale and shape parameters of the distribution, respectively, and within the range of −∞ < µ < ∞, σ > 0 and −∞ < ξ < ∞.By setting the shape parameter, ξ , one can obtain the most common distributions like Gumbel (ξ = 0, Fréchet (ξ > 0) and Weibull (ξ < 0).The 1/T yr wave height return value, H T , based on the GEV distribution model is given as (2)

Generalized Pareto distribution model
This approach is based on fitting the GPD to the POT sampled data.The observations in a cluster above the threshold are considered and calculating return values has been done by taking into account the rate of occurrence of clusters (Davidson and Smith, 1990;Coles, 2001).
The cumulative distribution function of the GPD is given as where µ, σ and ξ represent the location, scale and shape parameters of the distribution, respectively, and within the range of 0 < x < ∞, σ > 0 and −∞ < ξ < ∞.When ξ = 0 the GPD is said to amount to the exponential distribution with mean σ ; when ξ > 0, it is the Pareto distribution; and when ξ < 0 it is a special case of the beta distribution.
The 1/T yr wave height return value based on the GPD distribution model, H T , is given as Nat. Hazards Earth Syst.Sci., 17, 409-421, 2017 www.nat-hazards-earth-syst-sci.net/17/409/2017/where λ = N u /N , with N u being the total number of exceedances above the selected threshold µ 0 and N the number of years in the record.
There are several parameter estimation methods for fitting the above candidate distribution functions to the sampled wave data (Goda, 2000).The method of moments (MM), PWM method and the MLE are more preferred estimation methods since these are more flexible, particularly when the number of parameters is increased.The MM yields a large bias particularly for small size samples and this method was not used in the present study.The parameters of the above distributions are derived according to the methods of MLE and PWM.
The threshold selection in GPD analysis is an important practical problem, which is analogous to the block size in the block maxima approach.The threshold value represents a compromise between bias and variance.Too low a threshold violates the asymptotic basis of the GPD model, leading to a bias.Too high a threshold will generate fewer values of excess to estimate the model, leading to high variance.There is extensive literature on the attempt to choose an optimal threshold by Neelamani (2009) and Caires (2011).In this study, the threshold selection is based on the mean residual life plots introduced by Davison and Smith (1990).
The mean residual life plot is based on the theoretical mean of the GPD given as The mathematical basis for mean residual life plots method is If H is following GPD distribution, then the mean excess over a threshold µ 0 (for y > µ 0 ) with slope ζ /(1 − ζ ) is a linear function of µ 0 .Thus, we can draw a plot in which the ordinate is the sample mean of all excesses over that threshold and the abscissa is the threshold.A mean residual life plot consists in representing points: where m is the number of observations (H i , i = 1, 2, m) above the threshold µ 0 , and H max is the maximum of the observations.According to the central limit theorem, confidence intervals are added to this mean residual life plot as the empirical mean to be normally distributed.However, this normality does not hold for high threshold as there are less and less excesses.

Polynomial approximation model
Polnikov and Gomorev (2015) proposed to use the extrapolation of polynomial approximation constructed for the shorter part of the tail of probability function to estimate the return values of wind speed and wave height.This method involves the construction of an analytical approximation F ap (H ), aimed for its extrapolation beyond the observed maximum value H M .The approximation should be restricted to a shorter domain lying above the uppermost mode of the histogram considered of the function F (H ).The domain suitable for approximation can be determined by the condition where H l and H h are the lower and the upper edges of the domain of F (H ), used for constructing approximation F ap (H ).
The number of points (N M ) considered in the histogram is H M / H and N S is defined as The number of points (N T ) used for building approximation F ap (H ) is defined as The approximation F ap (H ) should be built in the logarithmic coordinates due to the existence of fewer data at the tail of F (H ), providing importance to the tail values.It allows assessing the strong variability of the tail of function F (H ) near the maximum value of the series, depending on the length of the series.To exclude the application of fixed statistics, the approximation function F ap (H ) in the form of a polynomial of degree, n, is considered, the value of which may vary.The varying n allows obtaining the approximation F ap (H , n) with an accuracy higher than the case of using the fixed statistical distributions.
The statistical distribution with the provision function is of the form Once the approximation function F ap (H ) is obtained from Eq. ( 11), the return value, H R , appearing once for N R years, can be deduced by inverting the formula where t is the interval of discrete of data observations.Another principal feature of polynomial approximation F ap (W ) is the standard deviation δ, defined by the formula Obviously, the lower δ is, the higher accuracy of approximation can be achieved, which is more preferable.The ETS model (Boccotti, 2000;Arena andPavone, 2006, 2009) is applied for calculating return values of significant wave height for given thresholds of return period.The ETS approach is based on the assumption that given a sequence of actual storms it may be replaced by an equivalent storm sequence maintaining the same wave risk.The validity of the above assumption is guaranteed by the statistical equivalence between the actual storm and the related equivalent triangular one.The ETS associated with a given storm is achieved by means of two parameters: the triangle height a and its base b (Fig. 2).The former is an intensity parameter and is equaled to the maximum significant wave height during the actual storm; the latter is a duration parameter and it is determined following an iterative procedure that imposes the equality between the maximum expected wave heights of actual and triangular storms.It has been numerically proven that by imposing this equality not only is the area under the exceedance probability curves of the maximum wave height the same but those curves also tend to coincide (Boccotti, 2000;Arena and Pavone, 2006;Laface and Arena, 2016).Considering all these aspects, it emerges that the actual storm and the ETS sequences (actual and equivalent triangular seas) have the same number of storms, each of them characterized by the same maximum significant wave height and the same probability P (H max > H ) that the maximum wave height is greater than a fixed threshold H .The considerations above enable us to affirm that the return period of a sea storm with given characteristics is the same if calculated starting from the actual storm sequence or the ETS one.Referring to the equivalent triangular sea, an analytical solution for the calculation of the return period R(H s > h) of a sea storm whose maximum significant wave height is greater than a given threshold h has been developed by Boccotti (2000).

R (H
where b(h) is the base-height regression function of ETS, P (H s > h) is the probability of exceedance of the significant wave height H s at the considered site and dh is the probability density function of H s .
The calculation of return values of H s by means of Eq. ( 14) requires the determination of two functions: the base-height regression function, b(h), which gives the average value of the base b of ETS for a given storm height h; the probability P (H s > h).
The function b(h) is determined starting from the ETS sequence diving storm in classes of storm intensity a = h of where k 1 (h) and k 2 (m −1 ) are parameters depending on the characteristics of the storm at the considered site.
Concerning the distribution of the significant wave height P (H s > h), a three-parameter Weibull distribution is considered.

P (H
where u, h l and w are the characteristic parameters at the considered location.In particular, u and h l are the shape parameters and w is the scale parameter of the distribution.

Results
In this study, return values from ERA-Interim data are compared with the values obtained from buoy data at the same location for different estimation models.Further study of the various uncertainties due to the parameter estimation method, the sample size, sample interval and location conditions involved in this analysis are also examined.

GEV analysis
In the application of generalized extreme value distribution to the sampled AM data, the scale, shape and location parameters can be used to make statements about the probability of the annual maximum exceeding a particular level.A change in any of the parameters can affect the long-period return levels.
The parameter estimation is done by the MLE and PWM methods (Hosking et al., 1985) and the obtained parameters are shown in Table 2.It has been observed that the shape parameter is positive for ERA-Interim data, indicating that The influence of estimated parameters in fitting the data to the GEV model is presented in Fig. 3a.It shows the level of fitting of the empirical CDF with the GEV PWM and GEV MLE models.The difference in the normal coordinates in their fitting with empirical CDF is insignificant.Figure 3b shows the variation in tail estimates of the PWM and MLE parameter estimation methods in logarithmic scale.The results show that for buoy and ERA-Interim datasets the PWM method of parameter estimation yields better estimates compared to the MLE method.
The statistical parameter, root mean square error (RMSE) was estimated in order to check the level of fitting of sampled data to the GEV distribution model.The RMSE is a residual between the empirical cumulative distribution obtained from the actual observed data and the theoretical GEV model cumulative distribution.The lower the value of RMSE, i.e. nearer to zero, the better the fit of sampled data to the GEV distribution model.The fitting of GEV to buoy and ERA-Interim data is found to be good for both PWM and MLE methods.The RMSE values of the MLE estimates are usually smaller than those of the PWM estimates for both buoy and ERA-Interim data.

GPD analysis
In POT method, the selection of a suitable threshold value is the key in achieving a robust sample dataset.The mean residual plot, between the mean excess GPD and threshold, helps in determining a proper range of threshold to be selected (Coles, 2001).Such plots with 95 % confidence for the data ERA IN-1 (Fig. 4) appear to have two slopes with the major transition at the threshold range of 1.5 to 2.5, indicating the range of threshold could possibly be selected.However, attention should be paid because too-high thresholds can result in a less sampled dataset which results in a higher variance of the GPD model.
The sample used in the peaks over threshold method has to be extracted in such a way that the data can be modelled as independent observations.A process of declustering helps to collect only the peaks within the clusters of successive exceedances of a specified threshold and are retained in such a way that they are sufficiently apart (so that they belong to "independent storms").Specifically, in the present applications, we have treated cluster maxima at a distance of less than 48 h apart as belonging to the same cluster (Caires, 2011).Table 3 provides the selected threshold and the number of exceedances of that specified threshold with a 48 h interval.It is seen that the threshold values are observed to be dependent on the length, location and interval of the datasets.The major factor has to be the location since the higher latitude locations are exposed to more severe wave and wind conditions than those at the lower latitudes.
For parameter estimation, the PWM and MLE methods are used.The MLE has a considerable statistical motivation but can turn out to be poor estimators, especially in the case where the number of estimated parameters is large.So the approach chosen here was to utilize a variety of techniques like PWM and MLE for exploratory fitting for the probability model and choose the best possible parameters.
To verify the estimated parameters for the GPD model, quantile-quantile (QQ) plots were used.In Fig. 5a, the QQ plots for the dataset NOAA44005 are shown, comparing the estimated GPD with the sample data for PWM parameter estimation method.In order to check the influence of parameters resulting from PWM and MLE parameter estimation models, the RMSE was estimated for GPD model also and presented in Table 3.
Comparing the estimates and the fits, one can conclude that the MLE fits seem less adequate and that the shape parameter estimates are lower than those of the PWM fits.These results support the recommendations of Hosking et al. (1985) to preferably use the PWM method for GPD or GEV estimation from the relatively short duration of data with limited heavy-tailed cumulative distributions.Figure 5b shows the return value GPD plot of PWM fit to the dataset NOAA44005.

Polynomial approximation method analysis
P-app method has a distinct advantage of selecting the optimum choice of the parameters N S , N T and n.The detailed analysis demonstrates that all approximation parameters (n, N T and N S ) are equally important.Figure 6 shows the application of the P-app method for both buoy and ERA-Interim data at Alghero buoy station.In the abovementioned figure, the bottom level (ln(1 − F ) = −12.6)indicates the probability of occurrence once every 100 years and can be deduced by Eq. ( 12) with a 3 h interval of discrete data observations.For 1 and 6 h intervals of data observations, the probability of occurrence once in 100 years is −13.7 and −11.9 respectively.
One can see the adaptation of P-app method to the real behaviour of the tails for provision functions.For the Alghero location buoy data, the optimized parameters obtained are N S = 0, N T 8 (points used for approximation) and n = 2    (degree of approximation function) to arrive at the optimum return value as shown in Fig. 6.The optimum choice of parameters will also depend on the standard deviation δ (Eq.13), which resembles the residual between the actual tail of the provision function and the Papp tail fitted to it.The lower the value of δ, i.e. nearer to zero, the better the fit between the actual tail of the provision function and the P-app with tail fitted.The parameters N S , N T and n for all the datasets including the resulted standard deviation, δ, are provided in Table 4.

Analysis of ETS model
The calculation of the 100-year return values via ETS model is done by means of Eq. ( 14), the base-height regression function Eq. ( 15) and the probability distribution Eq. ( 16) of H s at the examined location.The base-height regression function is determined starting from the storm sequence at the considered site, while the probability distribution is achieved processing the whole dataset of H s .An important aspect to be taken into account in estimating the parameters of both Eqs. ( 15) and ( 16) is the time interval between two successive data of H s .A value of 3 to 6 h should be appropriate for estimating the parameters of the probability distribution in order to guarantee the stochastic independence between successive events, but this could be too high for determining the parameters of the base-height regression function.
In fact, Arena et al. (2013) have shown that as the time interval between two successive H s increases, the peak of the storm may not be well identified, involving flat storm history that led to an increase of the duration b of ETSs with respect to the case with the lower time interval between H s data.Such situations are those of wave model data.In this paper, both wave model and buoy data are considered.
To determine the base-height regression function parameters, the actual storm sequence is identified starting from H s time series, and for each actual storm the parameters a and b of ETSs are calculated (Boccotti, 2000).Then the ETSs are divided into classes of H s of 1 m width and the average value a m and b m of a and b for each class is considered.The sequence a m , b m is plotted in a Cartesian diagram and fitted by an exponential law as the Eq. ( 15).The determination of the base-height regression function, despite being very simple from a computational and mathematical point of view, requires careful attention because of its sensibility to the time interval between the data of H s used in the analysis.In this regard, it is worth noting that ETS duration parameter b is strongly dependent on the actual storm structure close to the storm peak.Specifically, it tends to increase as the storm structure becomes flat and it is quite small for steep storms.When the data sampling interval is more than 1 to 3 h, one may have very flat storms.Thus the calculation may lead to huge values of duration b.This aspect can cause return values of H s to be underestimated (Arena et al., 2013).This aspect strongly affects predictions when wave model data are processed (3 to 6 h between two successive data of H s ).For this reason, a further step is required for the calculation of b m when processing wave model data.A good practice is to do the analysis in conjunction with buoy data close to the location under study.In these cases, the base-height regression function calculated from buoys is utilized to correct the baseheight regression function obtained from model data.Specifically, considering an increase of b due to high time interval between H s data, the regression should be corrected considering a reducing factor r, defined as the ratio between the average values of the base calculated starting from buoy data moored close to the considered site and the average value calculated by means of wave model data.The regression parameters k 1 and k 2 at each considered site are summarized in Table 5 in conjunction with the parameters u, w and h l of the probability distribution Eq. ( 16).

Discussions
From the results, it is observed that the estimates from buoy observations are higher compared to the estimates for ERA- Interim datasets.This trend is being observed from all the estimation models.A variation of 20 to 30 % while comparing maximum observed H s of buoy data and ERA-Interim at NOAA44005, NOAA46050 and Alghero locations is observed.This, in turn, will result in underestimation of the return value of ERA-Interim data.
The underprediction of ERA-Interim data suggests that high wave events mainly due to the cyclones are difficult to capture by ECMWF numerical model.It is a familiar phenomenon and challenge that the smoothing effect implanted in the numerical model will lead to the flattened variability at relatively high frequencies, resulting in the missing peaks.An additional potential explanation for the underprediction is that the simulated ERA-Interim data contains 6-hourly intervals of H s data.It is possible that, because of the lower sampling rate, the maximum wave heights in a storm occur between observations will not be recorded.To overcome this, it is obvious that the ECMWF numerical modelling system needs further improvement in correction or calibration of the ERA-Interim data, especially when this hindcast is used for the extreme wave analysis.
Final results on the 30-and 100-year extreme wave estimates, obtained by the GEV, GPD, ETS and P-app methods described above, are presented in Tables 6 and 7.The variation of these estimates from the measured maximum wave heights will give a statistical validation of the performance of the estimation models.The percentage of variation of 30and 100-year return value estimates from measured 36-year maximum wave height is calculated for this analysis.Here one can observe the following principal peculiarities from the results of abovementioned statistical validation methodology.
The GEV and GPD methods show the 30-year return values smaller than the measured maximum H s for all the locations mostly by an extent of 10 to 25 %.In the cases of simulated data, these models exhibit high deviations from measured maximum H s .This peculiarity is because of neglecting the behaviour of the tails of provision functions, accepted in GEV and GPD methods.As a result, this leads to underestimating the return values.This is a reasonable shortcoming of these methods, as far as one cannot forecast extreme smaller return values one cannot forecast return values much smaller than those observed already.
The GEV model with AM sample resulted in overestimation of return values compared to the GPD model with peaks over threshold approach.The GEV estimation model consid-ers only the highest H s in the year, which might lead to the overestimation of AM in comparison with the other method.For most of the locations, there is not much variation in the results of the PWM and MLE parameter estimated GEV and GPD models.But Hosking et al. (1985) recommended to apply the PWM parameter estimation method for GPD and GEV distribution models from relative short datasets with not too heavy-tailed distributions.Furthermore, PWM works for a wider range of parameter values than MLE method.
The results from the P-app method are remarkably closer to the measured maximum values than those obtained by the GEV, GPD and ETS method, with variation ranges between 5 and −7 %.The P-app method shows consistency in 100-year estimated return values for both simulated and buoy wave height datasets, these varies consistently between 7 and 13 % from the measured maximum values.GEV, GPD and ETS methods fail in the abovementioned criterion as variation is as high as 56 % and as low as −19 %, which is not possible in nature.
This consistency of P-app method estimates is due to the dependence of return values on the actual kind of the tail of provision function, which is dependent on the entire sample of the time series.The only disadvantage of the P-app method (F ap (H s , n)) is the necessity to control reliability of its extrapolation, as far as the extrapolation of a polynomial with the order n > 1 may have twists and extremes.This wellknown fact could be provided by a considerable variability of the "tail" for F (W ).Such an extrapolation is implausible, of course.Therefore, it is necessary to vary the parameters N S , N T and the order of polynomial n in such a manner that the twists of extrapolation could be avoided.

Conclusions
In this study, we chose the simulated ERA-Interim wave data for the two following reasons.First, they have more regular coverage for the whole World Ocean, and the Indian coast in particular.Second, numerically simulated datasets have long and regular continuous series, which is very important for the extreme value statistical aims.
This study focused on the estimation of the extreme significant wave heights only.The analyses carried out and result obtained will aid in the development of a 100-year extreme wave map for the Indian water domain, which may serve as a quick guide to identify regions where extremes lie within the design criteria of the coastal and offshore structures to be constructed.
We have considered four different approaches of the return value estimation: the GEV distribution model based on annual maxima sample, the GPD distribution model based on peaks over threshold sample, the ETS model based on storms and the P-app method based on extrapolation of the tail of the provision function.All of them have their own advantages and shortcomings.
The main drawback of the GEV and GPD methods is the high variation in underestimating or overestimating return values with respect to the measured maximum values in the time series.The shortage of the P-app method is related to the ambiguity of the return values estimations, obtained from different parts of the full time series.It is also found that the values estimated based on GEV model were slightly higher than those from the GPD.However, the GPD method with peaks over threshold sample is preferable in the locations of multiple storm events in a single year.In turn, the estimates through the P-app method, depend on the actual kind of tail of provision function, showed the consistency in 100-year estimated return values for both simulated and buoy wave height datasets, as these vary consistently between 7 and 13 % from the measured maximum values.
It is observed that the return value estimates from buoy observations are higher when compared to the estimates for ERA-Interim datasets.The underprediction of ERA-Interim data suggests that high wave events mainly due to the cyclones are difficult to capture by ECMWF numerical model.To overcome this, it is obvious that the ECMWF numerical modelling system needs further improvement in correction or calibration of the ERA-Interim data, especially when this hindcast is used for the extreme wave analysis.
Competing interests.The authors declare that they have no conflict of interest.

Figure 1 .
Figure 1.Selected locations for ERA-Interim data along the Indian Coast.

Figure 2 .
Figure 2. Typical representation of actual storm and associated ETS.

Figure 3 .
Figure 3. (a) Comparison of GEV model CDF to the empirical CDF for NOAA 44005 and ERA IN-4 and (b) variation of tail GEV model CDF in logarithmic coordinates for NOAA 44005 and ERA IN-4.

Figure 4 .
Figure 4. Mean residual plot for the dataset ERA IN-1 with 95 % confidence limits.

Figure 5 .
Figure 5. (a) Quantile-quantile plot of GPD model for NOAA 44005 data and (b) return level plot of GPD model for NOAA 44005 data.

Figure 6 .
Figure 6.Polynomial approximation for series of wave heights H s at Alghero buoy station for buoy and ERA-Interim datasets.

Table 1 .
ERA-Interim data locations and buoy stations.

Table 2 .
Estimated parameters from PWM and MLE methods for GEV model.

Table 3 .
Estimated parameters from PWM and MLE methods for GPD model.

Table 4 .
Selected optimum values of approximation parameters.

Table 5 .
Base-height regression parameters k 1 and k 2 calculated by considering a storm sample with actual durations greater or equal to 18 h; probability distribution parameters u, w and h l .