\documentclass[10pt]{article}
\usepackage{mjms-style}
\usepackage{graphicx,epstopdf,epsfig}
%\usepackage[numbers]{natbib}
\usepackage{psfrag}
%-----------Title and Authors----------%
\title{\Large \bf Modelling Record Times in Sport with Extreme Value Methods}
\author[1]{author 1 $^*$}
\author[2]{author 2 }
%\author[2]{Author C}
\affil[1]{affil 1}
\affil[2]{affil 2}
\date{\vspace*{-2\baselineskip}}
\newcommand\ttitle{Modelling Record Times in Sport with Extreme Value Methods} % short title
\newcommand\authors{author} % authors
\date{\textit{E-mail: \\$^*$Corresponding author}}
%\date{Received: date \ Accepted: date}
%----------newcommands---------- %
%----------Abstract---------- %
\begin{document}
\maketitle
\thispagestyle{fancy2}
\begin{abstract} \noindent
We exploit connections between extreme value theory and record processes
to develop inference methods for record processes. Record processes have
trends in them even when the underlying process is stationary. We study
the problem of estimating the underlying trend in times achieved in
sports from record data. We develop new methods for inference,
simulating record series in non-stationary contexts, and assessing fit
which account for the censored characteristic of record data and we
apply these methods to athletics and swimming data.
\begin{keywords}
Athletics, extreme value theory, Poisson process, records, swimming.
\end{keywords}
\end{abstract} \vspace{\fill} \newpage
%----------Main body----------%
\section{Introduction} \label{introduction}
In many fields records are of particular interest such as in weather
(highest flood, hottest temperature, strongest wind speed) and in
finance (largest insurance claim, biggest stock market crash).
In sporting events, such as athletics and swimming, records are the vital
events in characterising the development of the sport, they are the
target for competitors, and they receive most publicity.
Records are however simply the most extreme values in a series at the
time of their occurrence and so there are natural connections
between the theory for records and the theory for extreme values, this
has been exploited in probabilistic results for records by
\cite{bar,Ahsan}, \cite{Glick78, Smith1988, Ahs, Dav, Bai, Sibuya}
and \cite{Benestad2004}. In this paper we apply and adapt
our knowledge of extreme value methods to the analysis of record data
for sporting events, focusing on athletics track events
and swimming competitions.
To illustrate the issues in this paper first consider the minimum record
process $\{Y_t\}$ which arises from a stationary sequence
$\{X_t\}$, i.e.\ $Y_t=\min(X_1, \ldots ,X_t)$. We assume that $\{X_t\}$
are continuous random variables.
Let $R_t=I(X_t<Y_{t-1})$ be an indicator
variable for the occurrence of a record at time $t$ and let $N_n$ be
the number
of records until time $n$, with $N_1=1$ by definition.
Then $R_t \sim \mbox{Bernoulli}(t^{-1})$ and
$E(N_n)=\sum^n_{t=1}E(R_t)= \sum^n_{t=1}\frac{1}{t}$, so for large $n$, $E(N_n)
\approx \log n + \gamma$, with $\gamma$ is Euler's constant
0.5772... . \citet{bar} point out that
about 7 records are expected to occur in a sequence of 100,000
observations of a stationary sequence, so when studying records of
stationary processes we need to recognise that records are broken
rarely. Furthermore, there is a decreasing trend in the $\{Y_t\}$
process as $Y_{t+1}\le Y_{t}$ for all $t\ge 1$.
\begin{figure}[!ht]
\begin{center}
\includegraphics[scale=0.3]{C0Fig1g.pdf}
\end{center}
\caption{
Plot A - women's athletics 3000 m with annual minima (circle) and
annual record data (dot) from 1972 to 1992;
Plot B - women's athletics 1500 m with annual minima (circle) and
annual record data (dot) from 1972 to 1992.}
\label{dat12}
\end{figure}
If an analysis is to be undertaken from a series of record data
$\{Y_t\}$ alone
then at first sight it may appear that the information in the series
is limited as many of the values in the series will be identical.
The few values in the series where the record changes, i.e.\
$Y_t<Y_{t-1}$ tell us directly that $X_t=Y_t$.
However, the record remaining unchanged does also provide information about
the underlying $\{X_t\}$ process as $Y_t=Y_{t-1}$ tells us that $X_t>Y_{t-1}$,
so this gives censored information about the underlying $\{X_t\}$ process.
Despite the information from censored data, for stationary series
the information about the record process is gathered very slowly
because when the probability of not breaking the record is high there
is very limited information in the censored data.
As sports have increasingly become professional, their training techniques
have been refined, dietary knowledge has been improved, more people
are training to high standard in every sport, and
it is clear that the underlying marginal distribution of
elite performance in sport has changed. This is seen in the
athletics and swimming data we study. Figure~1 \if 0 \ref{dat12} \fi
shows the annual best times achieved in recognised
international competitions over the period of 1972-1992 for
the women's athletics 1500 m and 3000 m track events, see
\cite{rob}. The record process, derived from these
annual data, is also on the figure. It shows that the number of records
being broken is greater than would be expected if the annual best
performance was stationary over time.
\begin{figure}[h]
\psfrag{Alpha}{$\alpha$}
\psfrag{male}{men's}
\begin{center}
\includegraphics[width=\linewidth]{C0Record1.pdf}
\end{center}
\caption{
The men's Olympic 400 m freestyle gold medallist's times (empty dot) and
the men's actual world record data for 400 m freestyle (dark dot) from 1912 to 2004.}
\label{Fig7Chap1}
\end{figure}
Similarly, Figure~2 shows the FINA world record series of the men's
400 m freestyle swimming event plotted against the actual times in the
year when the records are achieved.
Unlike the athletics data, this
type of record data shows when the record has
been broken more than once in a year.
Again the number of records is greater than would be expected
if the series was stationary, indicating a trend in elite performances.
We call the two types of data
annual record data and actual record data respectively.
As indicated earlier, the record process exhibits a trend even when
the underlying process is stationary. We can tell from the number of
records in the actual record data that there must be non-stationarity
in the underlying process but it is clear that separating the
component of the trend due to records from that due to
non-stationarity is not obvious from information on the record
process alone. For the annual records data in athletics we also have
annual best data which allows us to identify the separation as the
records decrease over time faster than the mean of the annual best
data.
Most often data on records takes the form of actual record data
as annual record data exclude records if the record is
broken more than once in a year. Furthermore, annual best data (from
which annual record data are typically derived)
are difficult to find for early periods in the history of an
event whereas all actual records are well documented.
As annual best data, and their associated annual record data,
provide information on both the underlying process and
the record process we first study these data,
this helps in identifying the level of loss of
information in analysing annual records in comparison to annual best
data and identifies the need for some independent information on the
trend in the underlying series.
As the data are minima times the appropriate distribution for
the annual best is the generalised
extreme value distribution for minima, GEVM($\mu,\sigma,\xi$),
which has a distribution function
\begin{align}
H(x) = 1 -\exp\left\{-\left[ 1 -
\xi\left(\frac{x-\mu}{\sigma}\right) \right]^{-1/\xi}_+\right\}
\label{C5eqn00}
\end{align}
with a location parameter $\mu \in \mathbb{R}$, a scale parameter
$\sigma > 0$, a shape parameter $\xi \in \mathbb{R}$ and
$[y]_+=\max(y,0)$, see \citet{Coles}. The negative
Gumbel is given by the GEVM distribution with $\xi \rightarrow 0$, the
negative Fr\'echet by $\xi >0$ and the Weibull by $\xi<0$.
The justification for this
distribution is that it is the only possible non-degenerate
distribution
for the limiting distribution of the minima of stationary sequences
after linear normalisation, see \citet{Rootzen1983}.%
Stationarity over a year may be a reasonable approximation to
justify the choice of the GEVM distribution, but the obvious
non-stationarity in the annual best times
suggest that at least some of the parameters of the GEVM
need to change over time, as in the modelling framework of
\citet{Davison1990} and \citet{Coles}.
Here we assume that the GEVM parameter $\mu(t)$ varies
smoothly over time $t$, with $\sigma$ and $\xi$ being
a constant, i.e. the non-stationarity is purely a location shift
in the elite performance in a year. This is justified by
findings in \citet{Adam2007c} and \citet{Adam2007b} more generally for
data of this type but also
given the limited data in our applications
it is also unlikely that we can find
evidence that the scale and shape parameter changes over time.
For the athletics example we take $\mu(t)$
as a simple exponential decay but for the swimming data no simple
parametric model seems appropriate so we use non-parametric methods
for estimating $\mu(t)$.
We study annual records using the information that they
are simply a partially censored series of annual best data.
Following the study of annual records we
then develop a general Poisson process limiting
characterisation for modelling the actual record data.
Specifically, point process results for extremes, \cite{Pickands1971}
and \cite{Smith1989}, are adapted to record values
recognising that records are a form of censored extreme value data.
Again modelling the non-stationarity of the underlying data
needs to be addressed.
Evidence from the analysis of annual records shows the need
for information on the trend in elite performance (in the form of
annual best data or something equivalent) to supplement
information from the annual record to get any form of useful
inference. In order to make inferences from the actual record data
we need additional data that provides information about the
trend of the underlying series. Annual best data are not available
generally for other record series so some form of proxy data are required.
For the actual record data for swimming
we use the Olympic gold medallists' times as proxy data for inference
on the trend in elite performances.
The structure of this paper is as follows. In Sections~2 and 3 we
derive models and inference methods for annual record data and
actual record data respectively. As a large amount of data are censored
this information needs to be accounted for not just in the fitting but also
in methods of goodness-of-fit for both methods. Then we apply the
theory to athletics data in Section~\ref{SecAth} and swimmming data in
Section~\ref{SecSwim}.
\section{Models and Inference for Annual Records}
If $Z_1, \ldots, Z_n$ is the sequence of annual minima from $\{X_t\}$
with $Z_t$ the annual minima in year $t$ then we define $Y_t = \min (
Z_1, Z_2, \ldots , Z_t)$ for $t \geq 1$ so that $\{Y_t\}$ is the annual
record process. We assume that $\{Z_t\}$ are independent with $Z_t \sim
\mbox{GEVM}(\mu(t),\sigma,\xi)$ and so $\mu(t)$, $t=1,\ldots,n,$
$\sigma$ and $\xi$ determine the distribution of the annual record
process. Suppose that we have data for $\{Y_t\}$ but not
$\{Z_t\}$. When $Y_t < Y_{t-1}$, i.e. a record occurs, then $Y_t=Z_t$
is an observation from a \mbox{GEVM}$(\mu(t),\sigma,\xi)$
variable. When $Y_t=Y_{t-1}$, i.e. not a record, then $Z_t>Y_t$ so
$Z_t$ is a \mbox{GEVM}($\mu(t),\sigma,\xi$) variable censored below at
$Y_t$. Consequently, the log-likelihood function for the record sequence
$\pmb{y}=(y_1,\ldots,y_n)$ is
\begin{eqnarray}
\ell (\pmb{y};\pmb{\mu},\sigma,\xi) & = & - \sum^n_{t=1}I_t \log \sigma -
\sum^n_{t=1} I_t \left(1 + \frac{1}{\xi}\right)\log \Psi_t \nonumber \\
& &-\sum^n_{t=1}I_t \Psi_t^{-1/\xi}
-\sum^n_{t=1}(1-I_t)\Psi_t^{-1/\xi} \nonumber\\
\label{eq:rec1}
\end{eqnarray}
where $\Psi_t = \left[1-\xi\left(\frac{y_t-\mu(t)}{\sigma}\right)\right]_+ $,
with $I_t = I(Y_t < Y_{t-1})$ and
$\pmb{\mu}=\left(\mu(1),\ldots,\mu(n)\right)$.
If $\mu(t)$ can be parametrically specified then maximum likelihood
estimates are obtained by maximizing (\ref{eq:rec1}) directly. However if
$\mu(t)$ is specified only to be a smooth function we use a penalized
log-likelihood function for the GEVM distribution of the form
\begin{equation}
\label{eq:c41d1}
\ell(\pmb{y};\pmb{\mu},\sigma,\xi)-\frac{\lambda}{2}\pmb{\mu}^TK\pmb{\mu},
\end{equation}
where $K$ is a symmetric $n
\times n$ matrix of rank $n-2$, defined in \citet{Green1994} which
depends only on $(1,2,\ldots,n)$
and $\lambda$ is a smoothing parameter we select using the
AICc criteria. The second term in the penalized log-likelihood (\ref{eq:c41d1})
imposes a penalty for the smooth $\pmb{\mu}$ values. We maximize
Equation~(\ref{eq:c41d1}) to get the best estimate values of $\hat{\pmb{\mu}}$,
$\hat{\sigma}$ and $\hat{\xi}$ using the Fisher's scoring method
explained in \citet{Adam2007}.
As the estimated parameters determine the distribution of $\{Y_t\}$
through $\{Z_t\}$ we need to derive the distribution of
$\{Y_t\}$. When $\{Z_t\}$ are independent and identically distributed
(IID) then
\begin{align}
\Pr( Y_t > y ) & = \Pr(Z_1 > y)\Pr(Z_2 > y) \ldots \Pr(Z_t > y) \nonumber \\
% & = & \exp\left\{
% -i\left[1-\xi\left(\frac{y-\mu}{\sigma}\right)\right]
% ^{-1/\xi}_+\right\} \nonumber\\
& = \exp\left\{
-\left[1-\xi\left(\frac{y-\mu^*_t}{\sigma^*_t}\right)\right]
^{-1/\xi}_+\right\},
\end{align}
where $\mu^*_t = \mu - \frac{\sigma}{\xi}(t^\xi-1)$ and $\sigma^*_t =
\sigma t^\xi$, so $Y_t \sim \mbox{GEVM}(\mu^*_t,\sigma^*_t,\xi)$, so
the expected value for $Y_t$ is $E(Y_t) = \mu^*_t + \frac{\sigma^*_t}{\xi}[1 -
\Gamma(1-\xi)].$ For the non-stationary case as $\Pr(Y_t>y)$ is
complex we resort to deriving the distribution of $\{Y_t\}$ by
simulation from $\{Z_t\}$.
We can assess the fit of the model through comparison of $Y_m$ and its
estimated expected value $\hat{E}(Y_m)$. This comparison is
complicated by the strong dependence in $\{Y_t\}$. Instead we prefer
to exploit the property that if $Z_t \sim \mbox{GEVM}(\mu(t),\sigma,\xi)$ then
$Z_t = \mu(t) + E_t,$ where $\mu(t)$ is the trend in $Z_t$ and $E_t
\sim \mbox{GEVM}(0,\sigma,\xi)$. Then the estimated sequence of
$\{E_t\}=\{Z_t-\hat{\mu}(t)\}$ for $t=1,\dots,n$ are IID. We use
estimated values of $\{E_t\}$ to construct the
P-P and Q-Q plots for assessing the goodness of fit of the record
progression data.
The initial step is to get the $E_t$ value.
If $Y_t$ is a record then $Z_t =Y_t$, we get $E_t = Y_t -
\mu(t)$. Otherwise, if $Y_t$ is not a record
then $Z_t > Y_{t-1}$ so $E_t > Y_t - \mu(t)$.
Data on $\{E_t\}$ are therefore censored when no record is broken. To
estimate the distribution of $E$ we need to account for censoring,
with the standard approach being the Kaplan-Meier estimator. Let
$e_{(1)}<\ldots<e_{(m)}$ denote the residuals of the observed records
and $d_i$ be the number of $\{E_t\}>e_{(i)}$.
The Kaplan-Meier estimate for a survival function is given by
$\hat{H}_{KM}\left(z\right)=\prod_{i:e_{(i)}<z} \left(1 -
\frac{1}{d_i}\right)$.
The model-based
estimate of the distribution function for $E$ is $
\hat{H}(z) =1- \exp \left\{ -\left [ 1-\hat{\xi} \left(
\frac{z}{\hat{\sigma}} \right) \right ]^{-1/\hat{\xi}}_+
\right \}.
$
By comparing the model-based survival probability and empirical Kaplan Meier the P-P and Q-Q plots can be constructed.
Let $e^{*}_{(j)}$ be the
$j$th largest of $e_1,\ldots,e_n$ (i.e. observed and censored values
together) then the P-P plot is constructed by plotting
\[\hat{H}(e^{*}_{(i)}) \quad \mbox{against} \quad \hat{H}_{KM}(e^{*}_{(i)}) \quad \mbox{for} \quad i=1,\ldots,n. \]
and the Q-Q plot, using the exponential quantile plot, for the $\{e_i\}$ is
\[ -\log \left[1-\hat{H}(e^{*}_{(i)})\right] \quad \mbox{ versus}\quad
-\log\left[1-\hat{H}_{KM}(e^{*}_{(i)})\right] \quad \mbox{for} \quad
i=1,\ldots,n .\]
The transformation exponential quantile scale changes the lower tail
into the upper tail.
\section{Models and Inference for Actual Records}
\subsection{Point process model}
First consider a stationary sequence $\{X_t\}$. We define the point
process $P_n = \{[i/(n+1),(X_i-b_n)/a_n]:i=1,\ldots,n\},$
where the sequence of constants $\{a_n > 0\}$ and $\{b_n\}$ are such that
\[\Pr\left[\frac{\min(X_1,\ldots,X_n)-b_n}{a_n} > x \right] \xrightarrow{n
\rightarrow \infty} 1- H(x)\]
with $H(x)$ a non-degenerate limit distribution. Then
$H(x)$ is GEVM$(\mu,\sigma,\xi)$ given by (1.1), see \citet{Rootzen1983}.
Let $P$ be the limit as $n \rightarrow \infty$ of $P_n$. Let $x_F =
\min\{x:F_X(x)>0\}$ where $F_X$ is the distribution function of
$\{X_t\}$, and $b_l = \lim_{n \rightarrow \infty} (x_F-b_n)/a_n$. For
$A \subseteq [0,1]\times(-\infty,b_l)$ let $N_n(A)$
be the number of points of $P_n$ that fall in $A$ and $N(A)$ be the
number of points of $P$ in $A$. Following \cite{Pickands1971} and
\cite{Smith1989} it follows that $P$ is non-homogeneous Poisson processes
with intensity
\begin{equation}
\lambda(t,r) = \frac{1}{\sigma}\left[
1-\xi\left(\frac{r-\mu}{\sigma}\right) \right]_+^{-1-1/\xi}.
\label{lambdatr}
\end{equation}
Now, if we let $A=[t_0,t_1] \times (-\infty,r)$, with $0<t_0<t_1<1$ and
$r < b_l$, the limiting distribution of $N_n(A)$ is
Poisson$\big(\Lambda(A)\big)$ with
$\Lambda(A)=(t_1-t_0)\left[1-\xi\left(\frac{r-\mu}{\sigma}\right)\right]_+^{-1/\xi}$.
Now, we consider the non-stationary case with location parameter
$\mu(t)$. The intensity measure for this non-homogeneous Poisson
processes is then $ \lambda(t,r) =
\frac{1}{\sigma}\left[1-\xi\left(\frac{r-\mu(t)}{\sigma}\right)\right]_+^{-1-1/\xi}$
where here time is scaled onto $[0,1]$ each for $t$ in $\mu(t)$.
Suppose that between times $t_0$ and $t_{m+1}$ the times and values of a
sequence of $m$ consecutive actual records are $(t_1,r_1), \ldots,
(t_m,r_m)$, with $r_i$ the record value obtained at time $t_i$ with
$ r_1 > \cdots > r_{m-1} > r_m$. It is assumed that the current record
at time $t_0$ is $r_0$. We derive the likelihood for this actual
record sequence in stages. Using the Poisson processes the
probability of no new record in the time interval $(t_0,t_1)$ is
\[\exp \left [-\int^{t_1}_{t_0}\int_{-\infty}^{r_0}\lambda(t,r)dr dt
\right] =
\exp \left \{-\int^{t_1}_{t_0}\left [1-\xi\left ( \frac{r_0 -
\mu(t)}{\sigma}\right)\right]^{-1/\xi}_+ dt \right\}. \]
The probability of one record in the interval $(r_i - \delta r,r_i)$
in the time interval \newline $(t_i,t_i + \delta t)$,
\[\approx \lambda(t_i,r_i)\delta r \delta t \exp \left\{-\int^{t_i
+ \delta t}_{t_i} \int^{r_i}_{r_i-\delta
r}\frac{1}{\sigma}\left[1-\xi \left ( \frac{r -
\mu(t)}{\sigma}\right)\right]^{-1-1/\xi}_+ dr dt \right\} \]
for small $\delta r > 0$ and $\delta t > 0$.
The probability of no record in the time interval $(t_i + \delta t,
t_{i+1})$ is
\[ \if 0 = \fi \exp \left\{ -\int^{t_{i + 1}}_{t_i + \delta t}
\int^{r_i}_{-\infty}\frac{1}{\sigma}\left[1-\xi\left ( \frac{r -
\mu(t)}{\sigma}\right)\right]^{-1-1/\xi}_+ dr dt \right\}.\]
So the joint probability of a record $r_i$ at time $t_i$ and then no
record until $t_{i+1}$ is proportional to
\[ \lambda(t_i,r_i)\exp\left\{ - \int^{t_{i
+1}}_{t_i}\int^{r_i}_{-\infty}\frac{1}{\sigma} \left[1-\xi\left ( \frac{r -
\mu(t)}{\sigma}\right)\right]^{-1-1/\xi}_+ dr dt \right\}. \]
The likelihood for the record progression from $t_0$ to $t_{m+1}$ is
proportional to
\begin{equation}
\label{eq:c41c}
\left [\prod^m_{i=1} \lambda(t_i,r_i) \right ]\exp \left \{-\sum^m_{i=0}
\int^{t_{i+1}}_{t_i}
\left [1-\xi\left(\frac{r_i
-\mu(t)}{\sigma}\right) \right ]^{-1/\xi}_+ dt \right\}.
\end{equation}
As $\mu(t)$ changes slowly we expect the integral in Equation~(\ref{eq:c41c}) to also
change smoothly with $t$, and so for numerical simplification we
approximate the sum of integrals in the log-likelihood function.
In order to get an accurate value of the approximation of the integral, we
break the time from $t_j$ to $t_{j+1}$ with $t_{j+1} = t_j +
\varDelta k_j$, where $k_j$ is the number of partitions from $t_j$ to
$t_{j+1}$ with equal distance of $\varDelta$ and
$s_{j,i}=t_j+(t_{j+1}-t_j)(i-1)/k_j,i=1,\ldots, k_j+1$
to give
\begin{equation} \sum^m_{i=0}
\sum^{k_i}_{j=1}\varDelta \left [
1-\xi\left ( \frac{r_i
-\mu(s_{j,i})}{\sigma}\right) \right]^{-1/\xi}_+ .
\label{eq:jointfunction}
\end{equation}
As in Section~2 non-parametric inference for $\mu(t)$ can be obtained
by penalized likelihood with a penalty term identical to that in Equation~(\ref{eq:rec1}).
\subsection{Constructing tolerance bands}
In order to develop new methods for diagnosing whether the model
is acceptable or not for the actual record data, we need to deal
with two aspects which are the time when the record occurred, $t$, and the
value of the new record, $r$, given that a record occurs at time $t$.
Our first approach is to generate tolerance intervals for the record
progression under the assumption that the fitted model is correct.
In constructing such tolerance intervals, the fitted model is used
to simulate replicates of the record progression data.
The resulting pointwise tolerance intervals produced provide
a region of values of record progressions which
can be used as a criteria for judging whether our fitted model
is acceptable when compared to the observed record progression data.
The observed record progression data
should be within the region to indicate that the model is good.
Critical to this approach is the need to be able to simulate a
record progression for the fitted model. For efficiency reasons we
only want to simulate record times and values. To
generate the record progression there are two stages given a current
record time $t_j$ and a current record value $r_j$. The first stage is
to generate the next record time $t_{j+1} \,(t_{j+1}>t_j)$
and the second stage is to generate the next record value
$r_{j+1} \,(r_{j+1}<r_j)$. Furthermore, we need an initialisation stage
given $(t_0,r_0)$ and a way for terminating the simulation.
We give details of all these stages below.
Fundamental to our approach in each case is the use of the limiting
non-homogeneous Poisson process $P$. Refer to Equation~(\ref{lambdatr}).
First let us derive the waiting time distribution for the next record
given that a record value of $r_j$ occurred at time $t_j$. Let this
waiting time random variable be $W_{j+1}$. Then
$
\Pr(W_{j+1}>w)=\Pr( \mbox{ no points of $P$ in }A_j)
$
where $A_j = [t_j,t_j+w] \times (-\infty,r_j)$.
\begin{eqnarray*}
\Pr(\mbox{ no points of $P$ in }A_j)
& = & \exp\{-\Lambda(A_j)\} = \exp
\left[-\int^{t_{j}+w}_{t_j}\int^{r_j}_{-\infty}\lambda(s,r)drds.
\right]
\end{eqnarray*}
To simulate $T_{j+1}$, the time of the $(j+1)$th record value,
we use the probability integral transformation
for the survivor function of $W_{j+1}$. Specifically if $u$ is a
realisation of $U \sim U(0,1)$, then solving $ u=\Pr(W_{j+1}>w)$
gives a realisation $w$ from the waiting time distribution. The next
record time is then generated as $t_{j+1}=t_j+w$, this is a
realisation of $T_{j+1}$. Hence we solve for
$t_{j+1}$ in
$
u = \exp \left [-\int^{t_{j+1}}_{t_j} \int^{r_j}_{-\infty} \lambda(s,r) dr ds\right],
$
we get
\begin{eqnarray}
\label{eq:twoA}
-\log u & = & \int_{t_{j}}^{t_{j+1}} \int^{r_{j}}_{-\infty} \lambda(s,r) dr ds \approx \sum^{k_{j}+1}_{i=1}\varDelta
\left \{1-\frac{\xi}{\sigma}[r_{j}-\mu(s_i)]\right\}^{-1/\xi}_+.
\end{eqnarray}
The second stage is to derive the distribution of the decrease $V_j$ in the
record value given that the $(j+1)$th record value occurs at time $t_{j+1}$.
The previous record value $r_j$ is now a threshold
for any further data to be the next new record value $r_{j+1}=r_j-v$
for $v>0$. The probability of getting a decrease in record level
greater than $v$ at time $t_{j+1}$ is
$
\Pr(V_{j+1}<v |T_{j+1}=t_{j+1})= \Pr(R_{j+1} < r_{j}-v |R_{j+1}< r_j,
T_{j+1}=t_{j+1})
$
where $R_{j+1}$ is the new record progression at $t_{j+1}$. However,
\[
\Pr(R_{j+1} < r_{j+1}|R_{j+1}< r_j, T_{j+1}=t_{j+1})% & = &
= \left\{1-\xi \left[ \frac{v}{\sigma - \xi(r_j-\mu(t_{j+1}))}
\right]\right\}^{-1/\xi}_+,\]
is the Generalised Pareto type distribution,
GPD($\tilde{\sigma},\xi$) with parameter $ \tilde{\sigma} = \sigma -
\xi(r_j - \mu(t_{i+1})).$ We simulate a new record value from the decrease
random variable be $V_i$. To simulate $r_{j+1}$ we use the probability
integral transform for the distribution function of $V_i$. Using the
realisation $u$ of $U \sim U(0,1)$. Then solving $ u= \Pr(V_j < v) $
gives a realisation from getting new record value distribution given
time of new record occur. The next record value is then
generated as $r_{j+1} = r_{j} - v$. Hence solve for $v$ from
$ u = \left\{1-\xi \left[ \frac{v}{\sigma - \xi(r_j-\mu(t_{j+1}))}
\right]\right\}^{-1/\xi}_+,$
we get,
\begin{equation}
\label{eq:threeB}
r_{j+1} = r_{j} + \left[\frac{\sigma}{\xi}-(r_{j}-\mu(t_{j+1}))\right](1-u^{-\xi}).
\end{equation}
The algorithm for constructing the tolerance bands is as follows:
\texttt{\begin{enumerate}
\item Initiate $\sigma = \hat{\sigma}_{rec}$, $\xi = \hat{\xi}_{rec}$
and $\pmb{\mu} = \hat{\pmb{\mu}} =
\{\hat{\mu}_1,\ldots,\hat{\mu}_n\}$.
\item Selection of $r_0 \,(\leq r_1)$ and $t_0 \,(\leq t_1)$.
\item First stage: Simulate $t_{j+1}$ from Equation (\ref{eq:twoA})
using $t_j$ and $r_j$.
\item Second stage: Simulate $r_{j+1}$ from Equation
(\ref{eq:threeB}) using $r_j$ \newline and $t_{j+1}$.
\item If $t_{j+1} < T$ repeat steps 3 and 4 with $T$ is the
maximum time of \newline interest. Stop if $t_{j+1}\geq T$.
\end{enumerate}}
We replicate the algorithm above $k$ times to get $k$ realisations of
the record progression process of fitted model. We used the pointwise
quantiles as the upper and lower bound of possible record value for
the observed record progression. If the real record progression falls
reasonably within the pointwise acceptance bounds, it indicates than
the estimated value of the parameters are appropriate to the selected
model and the model is reasonable description of the data.
As already we knew that
$Z_i \sim \mbox{GEVM}(\mu_i,\sigma,\xi)$ then $Z_i = \mu_i +
E_i,$ where $\mu_i $ is the trend in $Z_i$ and $E_i \sim
\mbox{GEVM}(0,\sigma,\xi)$. Then the sequence of $\{E_i\}$ for
$i=1,\dots,n$ are independent and identically distributed by
separating the trend from the data. The $E_i$ is used to
construct the P-P and Q-Q plots for assessing the goodness of fit of
the record progression data.
The initial step is to get the $E_i$ value. % from either $R_j$ the
% record or $Y_i$.
If $Y_i$ is a record then $y_i < y_{i-1}$ where $Z_i =
Y_i$, we get $E_i = Y_i - \mu_i$. Otherwise, if $Y_j$ is not a record
then $y_j = y_{j-1}$ as $Z_j > Y_{j-1}$. We get $E_j > y_{j-1} -
\mu_{j-1}$.
Data on $E_j$ are therefore censored when no record is broken. To
estimate the distribution of $E$ we need to account for censoring,
with the standard approach being the Kaplan-Meier estimator. Initially
we need to plot the Kaplan-Meier estimate, $KM_j=\prod_{j} (1 -
\frac{d_j}{t_j}) $, to estimate the empirical survivor function of
$E$, where $d_j $ is the number of records occur until time $j$ and
$t_j$ is the number of events left from time $j$. The model-based
estimate of the survivor function for $E_i$ is $
\hat{S}(z) = \exp \left\{ -\left [ 1-\hat{\xi} \left(
\frac{z}{\hat{\sigma}} \right) \right ]^{-1/\hat{\xi}}_+
\right \} = 1 - \hat{H}(z).
$
By comparing the model-based survival probability and empirical Kaplan Meier
estimate from graph, the P-P and Q-Q plots can be constructed.
When we plotting $KM_j$ against $S(e^{*}_j)$, where $e^{*}_j$ is the
$j$th largest of $e_1,\ldots,e_n$ value, or transformations of these
quantities give P-P and Q-Q plots.
The P-P plot is constructed by plotting
\[\hat{H}(e^{*}_i)=1-\hat{S}(e^{*}_i) \quad \mbox{versus} \quad 1- KM_i \quad \mbox{for} \quad i=1,\ldots,n. \]
When studying the minima data, the fit of the
upper tails is needed to relocate to the right hand of the graph. The
Q-Q plot using the exponential quantile plot for the $\{e_i\}$ is
\[ -\log \left[\hat{S}(e^{*}_i)\right] \quad \mbox{ versus}\quad -\log\left[ KM_i\right] \quad
\mbox{for} \quad i=1,\ldots,n .\]
In summary, to overcome the non-iid
difficulties, we separate the trend from the data to get the iid residual
data. Then construct the P-P and Q-Q plots from the residual data.
\section{Estimating Trend from Additional Data }\label{Chap5Sec4}%Using Non-Parametric Method}
As the record data do not provided enough information to fully
separate the record progression from a trend we need another different
set of data from similar type of process which is able to give
additional information to enable better separation. This additional
source of data should be able to represent whole data including the
record data.
\citet{Adam2007} shown the difficulties of analysing extreme
value data using parametric methods compared to non-parametric
methods. The non-parametric approach allows the data to "speak
for themselves" with less assumptions being made. Complex trends
can be handled better using non-parametric approach compared to
parametric approach.
We assume that $Z_1,Z_2, \ldots $ are the extreme additional data
which can provide an information of the model trend in location for
the record progression data. In this case we will using the
annual minima type of data.
\subsection{Non-parametric approach for GEVM}
%We let $Z_1,\ldots, Z_n $ be a sequence of iid, random variables of
%annual minima following the GEVM distribution. % of the form
%\[H(z) = 1 - \exp\{-[1-\xi(z-\mu)/\sigma]^{1/\xi}_+\}\]
%where the parameters are $\mu \in (-\infty,\infty)$ a location
%parameter, $\sigma \in (0,\infty)$ a scale parameter and $\xi \in
%(-\infty,\infty)$ a shape parameter, and $\{x\}_+ = \max(x,0)$.
To allow for non-stationarity we replace $\mu$ with the smooth
function $g$ as we are estimating the trend in location using
non-parametric method in this case of study. Then the density function
of the GEVM at time $t$ with smooth function $g(t)$ is defined as
\[h_t(z) =
\frac{1}{\sigma}\left[1-\xi\left(\frac{z-g(t)}{\sigma}\right)\right]^{-1-1/\xi}_+
\exp\left\{-\left[1-\xi\left(\frac{z-g(t)}{\sigma}\right)\right]^{-1/\xi}_+\right\}.\]
The penalized log-likelihood function for the GEVM distribution is
\begin{equation}
\label{eq:c41d}
\ell_{\lambda_g}(\pmb{z};\pmb{g},\sigma,\xi)-\frac{\lambda_g}{2}\pmb{g}^TK\pmb{g}
\end{equation}
with $\pmb{g}=\big(g(t_1),\ldots,g(t_n)\big)= (g_1,\ldots,g_n)$ and $\ell_{\lambda_g}(z_i;g_i,\sigma,\xi)$ is the log-likelihood contribution for
$\pmb{z}=\{z_1,\ldots,z_n\}$ where
\begin{align}
\ell_{\lambda_g}(\pmb{z};\pmb{g},\sigma,\xi) = & -n\log\sigma
-(1+1/\xi)\sum^n_{i=1}\log\left [1-\xi\left(\frac{z_i-g_i}{\sigma}
\right) \right]_+ \nonumber \\
& - \sum^n_{i=1}\left [1-\xi\left(\frac{z_i-g_i}{\sigma}\right)\right]^{-1/\xi}_+
\label{C4label2}
\end{align}
and $\lambda_g$, the smoothing parameter value, is selected using the
AICc criteria. We smooth $g$ through the penalty function of
(\ref{eq:c41d}) and maximize (\ref{C4label2}), to get the best estimate
values of $\hat{\pmb{g}}$, $\hat{\sigma}$ and $\hat{\xi}$ using the
Fisher's scoring method explained in \citet{Adam2007}. As our priority
is to model and to analyse the record progression data, the only information
relevant from this inference is the estimated $\hat{\pmb{g}}$.
\subsection{Non-parametric approach to estimate $\mu_t$ in a
Poisson process model of record progression}\label{NonparaPois}
To model $\mu_t$ \citet{Adam2007} shown that using a
non-parametric approach is preferable. When a non-parametric is
implemented in the analysis, the $\mu_t$ values in
Equation~(\ref{eq:jointfunction}) are replaced by the
smooth function $g_t$ at time $t$ with $t=1,\ldots,m$. The
log-likelihood function with $g_t$ is
\begin{align}
\ell(\pmb{r};g_t,\sigma,\xi) & = &
+ \sum^m_{i=1} \log \lambda(t_i,r_i)
- \sum^m_{i=0}\sum^{k_i}_{j=1}\varDelta \left [
1-\xi\left ( \frac{r_i
-g_j}{\sigma}\right) \right]^{-1/\xi}_+ .
\end{align}
In order to get an accurate value of approximation of the integration, we
break the time from $t_i$ to $t_{i+1}$ to a small equal distance of
$\varDelta$ from partitions $[s_0,s_1]$ to $[s_{k_i-1},s_{k_i}]$ we
get the modified log-likelihood as
\begin{eqnarray}
\ell(\pmb{r};g_t,\sigma,\xi) & = &
+ \sum^m_{i=1} \log \lambda(t_i,r_i)
- \sum^m_{i=0}\sum^{k_i}_{j=1}\varDelta \left [
1-\xi\left ( \frac{r_i
-g_{s_j}}{\sigma}\right) \right]^{-1/\xi}_+
\label{loglike1}
\end{eqnarray}
where the distance from $s_0$ to $s_{k_i}$ is the distance from
$t_{i}$ to $t_{i+1}$ for $i = 1, \ldots, n$ with the distances between
$s_0$ to $s_{k_i}$ has been partitions to $[s_0,s_1] ,\ldots ,[s_{k_i -1}
,s_{k_i}]$. % with interval of $\varDelta$.
\section{Application to Athletics Annual Records}\label{SecAth}
The five fastest annual order statistics for women's athletics 1500 m and
3000 m track events from recognized international events over the period of
1972-1992 are used in this study. See \cite{rob}.
For this analysis we assume times run by every athlete in all
events are independent. There is a trend for both events, it is based
on the exponential decay of the annual minima over time.
We use the trend, $ \mu_t = \alpha - \beta[ 1- \exp(-\gamma t)]$
where $\beta > 0$, $\gamma >0$ and $t$ is the year, taking 1971
as the base year. This model was used by Robinson and Tawn (1995).
We redefined our record as $Y_i =
\min(Z_1,\ldots,Z_i),i = 1972, \ldots,1992$ as in \mbox{Figure~\ref{dat12}.}
Let $Z_i \sim GEVM (\hat{\mu}_i,\hat{\sigma},\hat{\xi})$ for
$i = 72, \ldots , 92,$ where this is independent but not
identically distributed random variables. The random variable for the
number of record broken over the 21 years is
$N_{21} = \sum^{92}_{i=72} I_i$, where $I_i = 1$ if
$ Z_i < \min(Z_{72},\ldots ,Z_{i-1})$ and is zero otherwise.
\begin{figure}[!b]
\begin{center}
\scalebox{0.435}{\includegraphics[angle=360]{pic8.pdf}}
\vspace{- 0.5cm}
\end{center}
\caption{The box-plot of numbers of record breaking for annual minima
for iid case and non-iid case for the women's athletics 3000 m and 1500 m,
evaluated by Monte Carlo with 100000 repetitions.}
\label{box1}
\end{figure}
In Figure~\ref{box1} the distribution of the number of records in
annual minimum is shown for three situations: firstly if the annual
minima are iid; second for the annual minima with trend and year
to year variation as estimated for the women's athletics 1500 m data;
finally the trend and year to year variation as estimated for the
women's athletics 3000 m data. For iid case,
it is possible for the record not to be broken for
long time intervals. The distribution of the number of records is
skewed to the right. For non-iid case, at least
six records broken is predicted for the women's athletics 3000 m (5
times for the women's athletics 1500 m).
This is because for the athletics, which is not identically
distributed as $\hat{\mu}_i$ gives some obvious trend early in the
period, we predict that the record tends to be broken frequently.
The box-plot in Figure~\ref{box1} summarise the discussion of a
numbers of breaking record for the women's athletics 1500 m and 3000 m.
In 1500 m track event the records have been broken 3 times since 1972
and some breaking records is predicted to happen. The women's
athletics 3000 m at 1992,
current number of breaking record is 6 times and more record also
will be predicted to happen.
%Using empirically rescaled plot (ERP) see \citet{Waller1992}
%\[\hat{F}_u(y) \quad \mbox{versus} \quad \hat{F}_u(\hat{F}_0^{-1}[\hat{F}(y)])\]
We now consider the women's athletics 1500 m and 3000 m events of
Figure~\ref{dat12}, focusing on the record data only. This led to the
maximum likelihood estimate from Equation~(\ref{eq:rec1}), we get the
parameter estimates in Table~\ref{tab:twotwo}. The 95 \% CI for $\xi$ for
1500 m and 3000 m events are (-5.585,5.018) and (-2.196,2.156). This
is are unacceptably large confidence intervals. When maximizing
the log likelihood with fixed $\alpha, \beta $ and
$\gamma$ values at their estimated values (using all annual minima
data), the standard error for the shape are lowered, this is because
separating the trend from the record process is hard if only the
record are observed.
\begin{table}[h]
\caption{The parameter estimation for record for the women's athletics
1500 m and 3000 m track events from 1972 to 1992.}
\vspace{5mm}
\centering
\begin{tabular}{|r|ccccc|} \hline
Parameter & $\alpha$ & $\beta$ & $\gamma$ & $\sigma$ & $\xi$ \\ \hline
1500 m & 248(29) & 8.9(31.1) & 0.251(0.609) & 4.79(14.03) &
-0.567(2.546)\\ \hline
3000 m & 548(9) & 37.7(13.9) & 0.212(0.212) & 4.12(7.23) &
-0.020(1.110) \\ \hline
\end{tabular}
\label{tab:twotwo}
\end{table}
\begin{figure}[!hbtp]
%\psfrag{Alpha}{$\alpha$}
\begin{center}
\scalebox{0.45}{\includegraphics[scale=0.9]{PlotQQ4bKM.pdf}}
\vspace{-1cm}
\end{center}
\caption{Kaplan-Meier estimates for the women's athletics 1500 m and 3000 m
of $e_i$ with the dotted line is survival function,
$\hat{S}(e^{*})$. }
\label{KMC2}
\end{figure}
\begin{figure}[!hbtp]
%\psfrag{Alpha}{$\alpha$}
\begin{center}
\includegraphics[width=\linewidth]{PlotPPQQ4b.pdf}
\vspace{-1cm}
\end{center}
\caption{The P-P and Q-Q plots for the women's athletics 1500 m and
3000 m, with record points marked with black circle. }
\label{KMPPQQC2}
\end{figure}
In order to fit the model, as the record data involving the censor
data, we replace the empirical density function with a Kaplan-Meier
estimate in P-P and Q-Q plots, see Figure \ref{KMC2}.
The P-P and Q-Q plots in Figure \ref{KMPPQQC2} shows that the model's
fit is acceptable for the women's athletics 1500 m and 3000 m events
as there are signs of linear pattern for in plots for censored data and
the record (solid circle). We noted an important point, i.e. a
number of the record depend wholly on the form of the
trend of the mean.
%Graph from file PlotKMSurv.txt
\section{Application to Actual Records Swimming}\label{SecSwim}
We applied the methods of analysis developed in this chapter using
data on the records progression of the men's 400 m Freestyle swimming event
with data which are recognised all around the world
by FINA. Additional information is given by the men's Olympic
400 m Freestyle data.
We implement the theory and fitting the model propose in
Sections~2-3. We
used the men's annual record progression for 400 m Freestyle swimming
event within 97 years time from 1908 to 2004. In the early stage of
the competition, the Freestyle event is also included the Breaststroke
and the Butterfly until 1953, when these two styles where separated
from the Freestyle event.
Dealing with record data always led us to loose information regarding
the trend in mean of true performance of the men's 400 m
Freestyle. In order to overcome this missing information from
record progression data, we will used the gold medallist of the men's
Olympic 400 m Freestyle swimming events from 1908 to 2004 to obtain the
trend on the mean.
We initially minimize the Equation (\ref{eq:c41d}) for the Olympic
event to obtain the non-parametric trend in mean $\hat{g}_t$ (not shown here). This
Fisher scoring method is capable in dealing with complexes trend in
mean compared to parametric method, see \citet{Adam2007}. % The solid
%line Figure \ref{chap3a} is the estimated value of $\hat{g}_t$ \if 0 from
%Equation~(\ref{C4label2}) \fi with the expected value of $\hat{g}_t$ is
%represented by dashed line with $\hat{\sigma}_{Olympic} =
%5.625$ and $\hat{\xi}_{Olympic} = -0.500$.
\begin{figure}[!h]
\psfrag{datarec}{year}
\psfrag{datarec}{second}
\psfrag{2}{}
\psfrag{1}{}
\begin{center}
%\scalebox{0.70}{\includegraphics[angle=-90]{C4400AArea.pdf}
\vspace{- 1.55cm}
\hspace{-32mm}\includegraphics[width=100mm,height=150mm,angle=-270]{Chapter4figAcc.pdf}
\vspace{- 1.85cm}
\end{center}
\caption{
The maximum (upper line) and the minimum (lower line) value of the
simulation record progression with $m=33$ where $m$ is the number of
record have been simulated using $\hat{g}_t$, % in Figure \ref{chap3a},
$\hat{\sigma}_{rec}=5.32$ and $\hat{\xi}_{rec}=-0.202$. }
\label{chap3c}
\end{figure}
Figure \ref{chap3c} shows that using the parameter estimates for the record
progression using information from the men's Olympic events gives a
fairly good fit compared to the observed records. More than 80\% of
the observed records are within the acceptance area (between the
maximum and the minimum of simulated records). In Figure
\ref{chap3d}, the proposed P-P and Q-Q plots showed a good fit for the
model.
\begin{figure}[!h]
\begin{center}
\includegraphics[width=\linewidth]{Chapter4plotKMSurv.pdf}
\vspace{- 1.5cm}
\end{center}
\caption{
The P-P (left) and Q-Q (right) plots using Kaplan Meier estimation for record
progression and censored data using method II in Section 3.}
\label{chap3d}
\end{figure}
We treated the event in 1908 as $r_0$, which means that it is not a
considered as a record. Then there are 45 record have been recorded
until 2004 for the recognised men's 400 m Freestyle swimming events.
From simulation of $m=33$ times of record progression, the
record have been broken between 34 to 50 times tabulated in Table
\ref{C4table1}, inclusively with the average record breaking is 43
times (further
indicate that the simulated model is approximately similar with the
real model). Noted that we only considered the annual progression of
the best performance by men's 400 m Freestyle swimmer each years.
\begin{table}[!h]
\caption{Summary result of the number of records broken from $m=33$ of
simulation data using $\hat{g}_t$, $\hat{\sigma}_{rec}$ and
$\hat{\xi}_{rec}$.}
\vspace{0.5cm}
\centering
\begin{tabular}{|cccccc|} \hline
$m$ & true & minima & median & mean & maximum \\ \hline
33 & 45 & 34 & 43 & 43.15 & 50 \\ \hline
\end{tabular}
\label{C4table1}
\end{table}
From the minimizing the log-likelihood of Equation (\ref{loglike1}), the
parameter estimates for $\hat{\sigma}_{rec}$ is 5.32(0.79) and for
$\hat{\xi}_{rec}$ is -0.202(0.089). The estimate of $\sigma_{rec}$
values is quite similar with $\sigma_{Olympic}$ (5.625) indicate that the
model proposed from Equation (\ref{eq:c41c}) is good.
%The $\lambda_g = 826.5306$ and $AICc = 8.480299$.
%The estimated $\sigma_{olympic} = 5.625$ and $\xi_{olympic} = -0.5$.
%The estimed $\sigma_{rec} = 5.32(0.786)$ and $\xi_{rec} = -0.202(0.089)$
%Checking for the record progression...
\section{Concluding Remarks}\label{Chap5Sec6}
We have introduced a new method to analysis the progression of record
data. We use the
Poisson processes in order to build the joint density function for the record
data. We have also introduced a new test for goodness-o-fit for the records. To
account for trends in the underlying data the records are drawn from we
used information from another existing events e.g. for the swimming
record analysis we used the performance of the swimmers from Olympic
event to estimate the non-parametric trend in mean. We limited the
study to annually data which mean that if in a year few record broken
events occurred we will stick to the latter record in our study.
We have also introduced two goodness-of-fit methods how to test the
fitting model for the record progression data i.e.
the tolerance regions of the model using some simulation data from
estimated parameters and if we treated the record data as a censored
type of extreme data, we used the P-P and Q-Q
plots as alternative model diagnostics. To proceed the tolerance
regions and plots we still need an information from another existing
events.
The only drawback of the methods proposed here is that they require a
higher capability of computer power. Two main aspects consume most
computer ability: Fisher scoring method in estimating $\hat{g}_t$
and when constructing the tolerance region for the new proposed
goodness of fit procedure.
%-----------Bibliography---------- %
%\bibliographystyle{mjms-bibstyle}
\bibliographystyle{apa}
%bibliography{Reference}
\bibliography{mjms-template}
\end{document}