Welcome to FindTheTail’s documentation!¶
Find The Tail is a procedure that is able to separate a given data set into subsets which belong to the body or the tail of an unknown underlying parent distribution. The data belonging to the tail are used to fit a generalized Pareto distribution as tail model. The tail model is evaluated further to determine the high quantiles, for risk assessment.
Theory¶
In many disciplines, there is often a need to adapt a statistical model to existing data to be able to make statements regarding uncertain future outcomes. In particular, when assessing risks, an estimate of major losses must be based on events that, despite having a low probability of occurrence, have a high impact. Since the actual distribution of data – the parent distribution – is generally unknown, statisticians can fit a generalized Pareto distribution (GPD) to the data belonging to the tail.
For a very large class of parent distribution functions, the generalized Pareto distribution (GPD) can be used as a model for the tail. A certain threshold divides the parent distribution into two areas: a body and a tail region. Above the threshold, analyses are performed using the GPD as a model for the tail.
This FTT procedure (Find-The-Tail) provides a suitable and efficient method for determining the optimal threshold.
FTT uses a specially weighted MSE teststatistic \(AU^2\) for determining the optimal value. The weights are chosen, such that the deviations (between EDF and the fitted GPD) at high quantiles are stronger penalized than at lower quantiles. This upper tail teststatistic is evaluated successive over a sorted timeseries.
Teststatistic:
With \(n\) is the sampe length, \(x_{(i)}\) is the descending sorted data and \(F\) is the fitted GPD.
For further information see Reference.
How it works¶
The FTT algorithm consists of the following 9 steps.
- The loss data are sorted in descending order \(x_1 \geq x_2 \geq \ldots \geq x_n\) .
- Starting with the largest losses \(x_1, x_2\) (k = 2) the generalized Pareto distribution \(\hat{F}(x)\) (= Estimation) is adapted to the data.
- With this distribution the measure of deviation \(AU^2_k\) is calculated.
- Then the next smaller loss is added \((k \rightarrow k + 1)\) and the parameters of the generalized Pareto distribution are re-estimated.
- It continues with point (3) until the last loss value has been processed \((k = n)\).
- Depending on \(k\), a time series of the deviation measure results: \(AU^2_k\).
- The minimal deviation at a particular \(k^* \in [1, n]\) indicates the best fit of a model for the tail to the given loss data and the associated \(x_{k^*}\) corresponds to the sought threshold u.
- To assess the result, the confidence level is determined. (This describes how large the likelihood of a wrong decision would be, if the adapted distribution and thus the decision for the threshold were rejected)
- For further assurance, the statistics of the standard goodness-of-fit test (CM and AD test) are evaluated.
Note
- FTT strictly separates the procedure for detecting the threshold and the goodness of fit to evaluate the quality of the fit.
- There is no need for any external parameters, all information are gained form the data alone.
Installation and Requirements¶
Input/Output¶
Input¶
The input has to be a 1-dimensional time series. The data has to in the form of a numpy array. There is no need for any preperation of the data in the form of sorting or making sure no value is given more than twice. All this prepatrions will be done by the FTT procedure.
The preperation procedure of consits of two steps:
- Check if any value exists more than once, if so add small random number below the significant digit of the data
- Sort the data in descending order
Output¶
From the information of the analysis a html report is generated which contains all important information. For a more detailed infomation on the report see the next section.
Reports¶
The report consists of five parts.
- Tail Detection
- Goodness-of-Fit Tests
- Fit Values
- Risk assessment
- Plots
Tail Detection¶
In this part contains the size of the data set that was analyzed, the minimal value for the \(AU^2\) as well as information on the estimated tail.
Goodness-of-Fit Tests¶
Goodness-of-Fit values for the Cramér-von Mises and Anderson-Darling teststatistics.
Fit Values¶
Fitted values of the generalized Pareto distribution (GPD).
Risk assessment¶
This part contains the Value-at-Risk and the Conditional-Value-at-Risk for the 0.95, 0.97, 0.99 and 0.999 quantiles.
Plots¶
A plot of the input data, a plot of the three teststatistics (\(AU^2\), Cramér-von Mises and Anderson-Darling) for the different tails and a plot of the fitted tail.
Examples¶
To show how FTT procedure works, we provide two examples using public domain data sets from the field of hydrology, which are commonly used for model testing.
Load FindTheTail¶
[1]:
import findthetail.ftt as ftt
import pandas as pd
import numpy as np
[2]:
# disable warning to keep the output clean
# the warning result form divergences in the logarithms in the test statistics
np.warnings.filterwarnings('ignore')
Read data¶
[3]:
river_nidd_data = pd.read_csv('data/river_nidd_data.csv', names=['x'])
Load data into model and run analyse¶
[4]:
# instanciate the Ftt (Find the tail) class with data,
# number of mc steps and a name for the report
mod = ftt.Ftt(river_nidd_data['x'].values, mc_steps=500, data_name='River Nidd')
mod.run_analysis()
Runnig fit
Running Montecarlo simulation
Calculating q and cvar
Generating plots
Example River Wheaton¶
The data consits of 72 exceedances of flood peaks in \(m^3s^{1}\) fo the Wheaton River near Carcross in Xukon Terrtory, Canda. The 72 exceedances, for the years 1958 to 1984, are rounded to one decimal place. This data set is commonly used in hydrology for model testing.
[1]:
import findthetail.ftt as ftt
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
[2]:
# disable warning to keep the output clean
# the warning result form divergences in the logarithms in the test statistics
np.warnings.filterwarnings('ignore')
Load data¶
[3]:
data1 = pd.read_excel('data/river_wheaton_data.xlsx', index_col='Position')
Load data into model and run analyse¶
[4]:
# instanciate the Ftt (Find the tail) class with data,
# number of mc steps, a name for the report
# and the number of threads to use for the montecarlo simulation
mod = ftt.Ftt(data1['x'].values, mc_steps=500, data_name='River Wheaton', threads=2)
# additional quantils can be calculated by giving the run_analysis
# function an array with specific p-values
mod.run_analysis(p_values=np.array([0.95, 0.99, 0.9999]))
Runnig fit
Running Montecarlo simulation
Calculating q and cvar
Generating plots
Take a look at the p-value.
[5]:
mod.p_value_a2
[5]:
0.678
The error of the p-value is given by \(\sigma = \frac{1}{\sqrt{\#MC_{steps}}}\)
[6]:
mod.mc_error
[6]:
0.03162277660168379
Using the run_montecarlo_simulation
function the p-value can be calculated with a smaller error. The threads
argument specifies the number of threads to use for the simulation. With a value of 4 for threads and 1000 mc_steps 4000 additional montecarlo points will be generated.
[7]:
mod.run_montecarlo_simulation(mc_steps=1000, threads=4)
999/1000
[8]:
mod.mc_error
[8]:
0.01414213562373095
Cooperation Partners¶
FTT was made in cooperation from:
Bergische Universität Wuppertal, Germany Frederik Strothmann Gaußstraße 20 42119 Wuppertal frederik.strothmann@uni-wuppertal.de |
![]() |
|
|
Heinrich-Heine-Universität Düsseldorf, Germany Faculty of Business Administration and Economics Dr. Ingo Hoffmann Universitätsstraße 1 40225 Düsseldorf www.fidl.hhu.de ingo.hoffmann@hhu.de |
![]() |
|
|
Heinrich-Heine-Universität Düsseldorf, Germany Faculty of Business Administration and Economics Prof. Dr. Christoph J. Börner Universitätsstraße 1 40225 Düsseldorf www.fidl.hhu.de christoph.boerner@hhu.de |
![]() |
Release Notes¶
1.1¶
Small changes for the plotting and some performance optimization for the montecarlo simulation.
Plotting¶
Changed from the plt interface to the fig interface for all plots.
Montecarlo simulation¶
Added multiprocessing for the montecarlo simulation. __init__()
now takes a new keyword argument threads
.
threads
specifies how many threads should be used for the montecalo simulation, the default value is 1.
The function run_montecarlo_simulation
also has the keyword threads
, to specify the number us used threads.
If no value is set the default value, set in the __init__()
will be used.
Examples¶
The river wheaton example as been updated to show the usage of the multiprocessing for the montecarlo simulation.