Table of Contents
|
Introduction
actuar is a package for R which adds additional actuarial science functionality. This is a guide for actuar based on reference materials provided by R and the creators of the actuar package.12
Here is a companion script to go along with this article if you read it as a step-by-step tutorial. Save time by having the examples pre-typed and ready to execute!
Install and load the actuar package.
library(actuar)
Data
actuar includes some data from Klugman's Loss Models. Load it and check it out (we'll use it later)
data(dental)
data(gdental)
dental
gdental
Probability Laws
Functions
Arguments
Since all of the probability law functions share many of the same arguments, we will present them first for convenience and simplicity.
- x, q: vector of quantiles.
- p: vector of probabilities.
- n: number of observations. If length(n) > 1, the length is taken to be the number required.
- shape, scale parameters: Must be strictly positive.
- rate: an alternative way to specify the scale.
- log, log.p: logical; if TRUE, probabilities/densities p are returned as log(p).
- lower.tail: logical; if TRUE (default), probabilities are P[X x], otherwise, P[X > x].
- order: order of the moment. 1 will return the first moment, 1:4 will return the first 4 moments.
- limit: limit of the loss variable.
Probability Density Function (d)
Generates relative likelihood for this random variable to take on a given value.
droot(x, shape1, shape2, shape 3, rate = 1, scale = 1/rate, log = FALSE)
For example, to use this function to find out the probability that x = 1 for a pareto(10,5) distribution, execute the code:
dpareto(1.0, shape = 10, scale = 5)
Cumulative Probability Function (p)
Generates the probability that a real-valued random variable X with a given probability distribution will be found at a value less than or equal to x.
proot(q, shape1, shape2, shape 3, rate = 1, scale = 1/rate, lower.tail = TRUE, log.p = FALSE)
To find the cumulative probability at x=1 for the above example, execute:
ppareto(1.0, shape = 10, scale = 5)
Quantile Function (q)
Specifies, for a given probability, the value which the random variable will be at, or below, with that probability.
qroot(p, shape1, shape2, shape 3, rate = 1, scale = 1/rate, lower.tail = TRUE, log.p = FALSE)
For example, ppareto above returned .838:
> ppareto(1.0, shape = 10, scale = 5)
[1] 0.8384944
We can use the quantile function to find the 83.8th percentile:
qpareto(.838, shape = 10, scale = 5)
This should generate .99999…. = 1.
Random Data (r)
Generates random observations from the specified distribution.
rroot(n, shape1, shape2, shape 3, rate = 1, scale = 1/rate)
To generate 1000 observations from our pareto distribution and create an object:
r <- rpareto(1000, shape = 10, scale = 5)
Raw Moments (m)
mroot(order, shape1, shape2, shape 3, rate = 1, scale = 1/rate)
You can use the moments function to find the mean, standard deviation, and variance of your model:
m1r <- mpareto(1, shape = 10, scale = 5)
m2r <- mpareto(2, shape = 10, scale = 5)
variancer <- m2r - m1r^2
stdevr <- variancer^.5
m1r
stdevr
variancer
Limited Moments (lev)
levroot(limit, shape1, shape2, shape 3, rate = 1, scale = 1/rate,
order = 1)
Let's impose a theoretical claim limit of 1.0 to this pareto variable.
m1l <- levpareto(1.0, shape = 10, scale = 5, order = 1)
m2l <- levpareto(1.0, shape = 10, scale = 5, order = 2)
variancel <- m2l - m1l^2
stdevl <- variancel^.5
m1l
stdevl
variancel
Let's compare the raw and limited moments:
rvl <- matrix(c(m1r,m1l,stdevr,stdevl,variancer,variancel),ncol=2,byrow=TRUE)
colnames(rvl) <- c("Raw","Limited")
rownames(rvl) <- c("Mean","Std Dev","Variance")
rvl <- as.table(rvl)
rvl
> rvl
Raw Limited
Mean 0.5555556 0.4478852
Std Dev 0.6211300 0.3420482
Variance 0.3858025 0.1169970
Moment Generating Function (mgf)
mgfgamma(x, shape1, shape2, shape3, rate = 1, scale = 1/rate, log = FALSE)
mgf functionality is only available for the following distributions:
- chisq
- exp
- gamma
- invgamma
- invgauss
- norm
- unif
Distributions
Distributions will be presented as
root
- parameter 1
- paremeter 2…
showgraphs
showgraphs will produce several useful graphs for a distribution. This function will sometimes not work depending on which arguments are or are not specified, so it is probably best to use the showgraphs code that is listed with each distribution. The parameters currently listed in the code mainly serve as placeholders, and also so you can execute the code and view the graphs.
Before you can execute the graph code, you must execute this user-defined function (which is written in C, and an explanation is beyond the scope of this article—just run the code):
showgraphs <- function(fun, par, what = c("d", "p", "m", "lev"), xlim)
{
dist <- switch(fun,
trbeta = "TRANSFORMED BETA DISTRIBUTION",
genpareto = "GENERALIZED PARETO DISTRIBUTION",
burr = "BURR DISTRIBUTION",
invburr = "INVERSE BURR DISTRIBUTION",
pareto = "PARETO DISTRIBUTION",
invpareto = "INVERSE PARETO DISTRIBUTION",
llogis = "LOGLOGISTIC DISTRIBUTION",
paralogis = "PARALOGISTIC DISTRIBUTION",
invparalogis = "INVERSE PARALOGISTIC DISTRIBUTION",
trgamma = "TRANSFORMED GAMMA DISTRIBUTION",
invtrgamma = "INVERSE TRANSFORMED GAMMA DISTRIBUTION",
invgamma = "INVERSE GAMMA DISTRIBUTION",
weibull = "WEIBULL DISTRIBUTION",
invweibull = "INVERSE WEIBULL DISTRIBUTION",
invexp = "INVERSE EXPONENTIAL DISTRIBUTION",
pareto1 = "SINGLE PARAMETER PARETO DISTRIBUTION",
lgamma = "LOGGAMMA DISTRIBUTION",
genbeta = "GENERALIZED BETA DISTRIBUTION",
phtype = "PHASE-TYPE DISTRIBUTION",
gamma = "GAMMA DISTRIBUTION",
exp = "EXPONENTIAL DISTRIBUTION",
chisq = "CHI-SQUARE DISTRIBUTION",
lnorm = "LOGNORMAL DISTRIBUTION",
invgauss = "INVERSE GAUSSIAN DISTRIBUTION",
norm = "NORMAL DISTRIBUTION",
beta = "BETA DISTRIBUTION",
unif = "UNIFORM DISTRIBUTION")
if (missing(xlim))
{
qf <- match.fun(paste("q", fun, sep = ""))
formals(qf)[names(par)] <- par
xlim <- c(0, qf(0.999))
}
k <- seq.int(4)
limit <- seq(0, xlim[2], len = 10)
mfrow = c(ceiling(length(what) / 2), 2)
op <- par(mfrow = mfrow, oma = c(0, 0, 2, 0))
for (t in what)
{
f <- match.fun(paste(t, fun, sep = ""))
formals(f)[names(par)] <- par
main <- switch(t,
"d" = "Probability Density Function",
"p" = "Cumulative Distribution Function",
"m" = "Raw Moments",
"lev" = "Limited Expected Value Function",
"mgf" = "Moment Generating Function")
if (t == "m")
plot(k, f(k), type = "l", col = 4, lwd = 2, main = main)
else if (t == "lev")
plot(limit, f(limit), type = "l", col = 4, lwd = 2, main = main)
else if (t == "mgf")
curve(f(x), xlim = c(0, 2), col = 4, lwd = 2, main = main)
else
curve(f(x), xlim = xlim, col = 4, lwd = 2, main = main)
title(main = dist, outer = TRUE)
}
par(op)
}
Transformed Beta Family
Transformed Beta
trbeta
- shape1 (α)
- shape2 (γ)
- shape3 (τ)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("trbeta", list(shape1 = 3, shape2 = 4, shape3 = 5, scale = 10))
Generalized Pareto
genpareto
- shape1 (α)
- shape2 (τ)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("genpareto", list(shape1 = 10, shape2 = 4, scale = 10))
Burr
burr
- shape1 (α)
- shape2 (γ)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("burr", list(shape1 = 3, shape2 = 4, scale = 10))
Inverse Burr
invburr
- shape1 (τ)
- shape2 (γ)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("invburr", list(shape1 = 3, shape2 = 6, scale = 10))
Pareto
pareto
- shape (α)
- scale (θ)
showgraphs("pareto", list(shape = 10, scale = 10))
Inverse Pareto
invpareto
- shape (τ)
- scale (θ)
showgraphs("invpareto", list(shape = 4, scale = 1), what = c("d", "p"))
Loglogistic
llogis
- shape (γ)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("llogis", list(shape = 6, scale = 10))
Paralogistic
paralogis
- shape (α)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("paralogis", list(shape = 3, scale = 10))
Inverse Paralogistic
invparalogis
- shape (τ)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("invparalogis", list(shape = 6, scale = 10))
Transformed Gamma Family
Transformed Gamma
trgamma
- shape1 (α)
- shape2 (τ)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("trgamma", list(shape1 = 3, shape2 = 1, scale = 10))
Inverse Transformed Gamma
invtrgamma
- shape1 (α)
- shape2 (τ)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("invtrgamma", list(shape1 = 3, shape2 = 2, scale = 10))
Inverse Gamma
invgamma
- shape (α)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("invgamma", list(shape = 6, scale = 10))
Weibull
actuar *only includes* moments and limited moments functionality for the Weibull distribution.
weibull
- shape (τ)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("weibull", list(shape = 1.5, scale = 10))
Inverse Weibull
invweibull
- shape (τ)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("invweibull", list(shape = 6, scale = 10))
Inverse Exponential
invexp
- rate (λ = 1/θ)
- scale (θ)
showgraphs("invexp", list(rate = 1), what = c("d", "p"))
Other
Single Parameter Pareto
pareto1
- shape (α)
- min (θ)
showgraphs("pareto1", list(shape = 5, min = 10), xlim = c(0, 50))
Loggamma
lgamma
- shapelog (α)
- ratelog (λ)
showgraphs("lgamma", list(shapelog = 2, ratelog = 5))
Generalized Beta
genbeta
- shape1 (α)
- shape2 (β)
- shape3 (τ)
- rate (λ = 1/θ)
- scale (θ)
showgraphs("genbeta", list(shape1 = 1, shape2 = 2, shape3 = 3, scale = 2))
Phase-type
phtype(prob, rates)
- prob: probability vector
- rates: transition rate matrix — must be a matrix!
showgraphs("phtype", list(prob = c(0.5614, 0.4386), rates = matrix(c(-8.64, 0.101, 1.997, -1.095), 2, 2)), what = c("d", "p", "m", "mgf"), xlim = c(0.001, 5))
Included in R
Gamma
gamma
- shape (α)
- rate (β = 1/θ)
- scale (θ)
showgraphs("gamma", list(shape = 3, rate = 5), what = c("m", "lev", "mgf"))
Chi-square
chisq
- df (degrees of freedom)
showgraphs("chisq", list(df = 3), what = c("m", "lev", "mgf"))
Exponential
exp
- rate (λ = 1/θ)
- scale (θ)
showgraphs("exp", list(rate = 5), what = c("m", "lev", "mgf"))
Lognormal
lnorm
- meanlog (ln(μ))
- sdlog (ln(σ))
showgraphs("lnorm", list(meanlog = 1, sdlog = 1), what = c("m", "lev"))
Inverse Gaussian
(from package SuppDists)
invgauss
- nu (ν)
- lambda(λ)
showgraphs("invgauss", list(nu = 1, lambda = 10), what = c("m", "lev", "mgf"), xlim = c(0, 10))
Normal
norm
- mean (μ)
- sd (σ)
showgraphs("norm", list(mean = 0, sd = 1), what = c("m", "mgf"))
Beta
beta
- shape1 (α)
- shape2 (β)
showgraphs("beta", list(shape1 = 1, shape2 = 2), what = c("m", "lev"))
Uniform
unif
- min (a)
- max (b)
showgraphs("unif", list(min = 0, max = 1), what = c("m", "lev", "mgf"))
Grouped Data
actuar provides extensive support to store, manipulate and summarize grouped data. The function
group.data
creates a grouped data object, with the arguments being
- vector of group boundaries
- one or more vectors of grouped frequencies
For example:
x <- grouped.data(Group = c(0, 25, 50, 100, 150, 250, 500), Line.1 = c(30, 31, 57, 42, 65, 84), Line.2 = c(26, 33, 31, 19, 16, 11))
x
Extraction and Replacement
actuar supports common extraction and replacement methods for grouped data. Using the above example, we can perform:
Extraction of the vector of group boundaries (the first column):
x[, 1]
Extraction of the vector or matrix of group frequencies (the second and
third columns):
x[, -1]
Extraction of a subset of the whole object (first three lines):
x[1:3, ]
Replacement of one or more group frequencies:
x[1, 2] <- 22
x
x[1, c(2, 3)] <- c(22, 19)
x
Replacement of the boundaries of one or more groups:
x[1, 1] <- c(0, 20)
x
x[c(3, 4), 1] <- c(55, 110, 160)
x
It is not possible to replace the boundaries and the frequencies simultaneously.
Group Mean & Histograms
There is also built-in support for the mean:
mean(x)
As well as histograms. The function is
hist(x, freq = NULL, probability = !freq,
density = NULL, angle = 45, col = NULL, border = NULL,
main = paste("Histogram of" , xname),
xlim = range(cj), ylim = NULL, xlab = xname, ylab,
axes = TRUE, plot = TRUE, labels = FALSE, ...)
with arguments:
- x: an object of class "grouped.data"; only the first column of frequencies is used.
- freq: logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE iff group boundaries are equidistant (and probability is not specified).
- probability: an alias for !freq, for S compatibility.
- density: the density of shading lines, in lines per inch. The default value of NULL means that no shading lines are drawn. Non-positive values of density also inhibit the drawing of shading lines.
- angle: the slope of shading lines, given as an angle in degrees (counter-clockwise).
- col: a colour to be used to fill the bars. The default of NULL yields unfilled bars.
- border: the color of the border around the bars. The default is to use the standard foreground color.
- main, xlab, ylab: these arguments to title have useful defaults here.
- xlim, ylim: the range of x and y values with sensible defaults. Note that xlim is not used to define the histogram (breaks), but only for plotting (when plot = TRUE).
- axes: logical. If TRUE (default), axes are draw if the plot is drawn.
- plot: logical. If TRUE (default), a histogram is plotted, otherwise a list of breaks and counts is returned.
- labels: logical or character. Additionally draw labels on top of bars, if not FALSE; see plot.histogram.
- …: further graphical parameters passed to plot.histogram and their to title and axis (if plot=TRUE)
Here is code for a histogram of the first frequency column:
hist(x[, -3])
Ogive
Finally, actuar can create an ogive, which is a approximation of the empirical cdf for grouped data.
CDF <- ogive(x)
CDF
You can find the knots and plot the CDF:
knots(CDF)
plot(CDF)
Like the R function ecdf, ogive does not create a vector, but rather a function object to compute F(x) at any value. For example:
CDF(250)
The following code will produce the cumulative probability for each knot:
CDF(knots(CDF))
Empirical Moments
kth Empirical Moment
The emm function can calculate empirical moments from either individual or grouped data. emm takes the form:
emm(x, order = 1, ...)
with arguments:
- x: a vector or matrix of individual data, or an object of class "grouped data".
- order: order of the moment. Must be positive. "1" returns with first moment, "1:4" the first four, etc.
- …: further arguments passed to or from other methods.
For example, using the built-in dental data:
emm(dental, order = 1:3)
emm(gdental, order = 1:3)
Empirical Limited Expected Value
The elev function returns a function object (like ogive) to calculate the empirical limited expected value, or first limited moment, of a sample for any limit. There is functionality for individual and grouped data.
lev <- elev(dental)
lev
glev <- elev(gdental)
glev
We can use the same code for cumulative probability as earlier:
lev(knots(lev))
glev(knots(glev))
You can use R's plot function with a few specified parameters for a nice graph of the empirical limited value function:
plot(lev, type="o", pch = 19)
plot(glev, type="o", pch = 19)
Optimization
actuar has the ability to optimize the parameters of a fitted model—however, you MUST have a general idea of the underlying distribution of the data and the corresponding parameters. Reaching this "starting point" is largely predicated on the type and context of the data and an overview of the numerous methods for reaching this starting point is beyond the scope of this article.
Maximum Likelihood Estimation (MASS package)
The package MASS has support for optimization through maximum likelihood estimation through the fitdistr function.
fitdistr(x, densfun, start, ...)
with arguments:
- x: A numeric vector of length at least one containing only finite values.
- densfun: Either a character string or a function returning a density evaluated at its first argument. Distributions "beta", "cauchy", "chi-squared", "exponential", "f", "gamma", "geometric", "log-normal", "lognormal", "logistic", "negative binomial", "normal", "Poisson", "t" and "weibull" are recognised, case being ignored.
- start: A named list giving the parameters to be optimized with initial values. This can be omitted for some of the named distributions and must be for others.†
- …: Additional parameters, either for densfun or for optim. In particular, it can be used to specify bounds via lower or upper or both. If arguments of densfun (or the density function corresponding to a character-string specification) are included they will be held fixed.
†For the Normal, log-Normal, geometric, exponential and Poisson distributions the closed-form MLEs (and exact standard errors) are used, and start should not be supplied.
For all other distributions, direct optimization of the log-likelihood is performed using optim. The
estimated standard errors are taken from the observed information matrix, calculated by a numerical
approximation.
For one-dimensional problems the Nelder-Mead method is used and for multidimensional problems the BFGS method, unless arguments named lower or upper are supplied (when L-BFGS-B is used) or method is supplied explicitly.
For the "t" named distribution the density is taken to be the location-scale family with location m
and scale s.
For the following named distributions, reasonable starting values will be computed if start is omitted or only partially specified: "cauchy", "gamma", "logistic", "negative binomial" (parametrized by mu and size), "t" and "weibull". Note that these starting values may not be good enough if the fit is poor: in particular they are not resistant to outliers unless the fitted distribution is long-tailed.
fitdistr functions very similarly to actuar's minimum distance estimation, which is covered in the following section.
Minimum Distance Estimation
To optimize by the minimum distance estimation method, use the following function:
mde(x, fun, start, measure = ("CvM", "chi-square", "LAS"),
weights = NULL, ...)
with arguments:
- x: a vector or an object of class "grouped data" (in which case only the first column of frequencies is used).
- fun: function returning a cumulative distribution (for measure = "CvM" and measure = "chi-square") or a limited expected value (for measure = "LAS") evaluated at its first argument.
- start: a named list giving the parameters to be optimized with initial values measure. Either "CvM" for the Cramer-von Mises method, "chi-square" for the modified chi-square method, or "LAS" for the layer average severity method.
- weights: weights; All sum of squares can be weighted. If arguments weights is missing, weights default to 1 for measure = "CvM" and measure = "LAS"; for measure = "chi-square", weights default to (1/n), where n is the frequency in the specific group.
- …: Additional parameters, either for fun or for optim. In particular, it can be used to specify bounds via lower or upper or both. If arguments of fun are included they will be held fixed.
For example, fitting an exponential model with rate = .005 to the grouped dental data using the CvM method:
mde(gdental, pexp, start = list(rate = 1/200), measure = "CvM")
and LAS method. Remember to use a CDF for CvM and chi-square and a limited expected value for LAS.
mde(gdental, levexp, start = list(rate = 1/200), measure = "LAS")
Issues
actuar can encounter errors when attempting to optimize. For instance, when trying to fit a pareto distribution to the grouped dental data:
Error in mde(gdental, ppareto, start = list(shape = 3, scale = 600), measure = "CvM") :
optimization failed
Working in the log of the parameters often solves the problem since the
optimization routine can then flawlessly work with negative parameter values. You must first define a new function using the following code:
newdistname <- function(x, logparamter1, logparamter2, ...) existingdist(x, exp(logparameter1), exp(logparameter2), ...)
Using our example we have:
pparetolog <- function(x, logshape, logscale) ppareto(x,exp(logshape),exp(logscale))
You may then proceed with the mde function as normal:
pl <- mde(gdental, pparetolog, start = list(logshape = log(3),logscale = log(600)), measure = "CvM")
And obtain the actual estimates with the following code:
exp(pl$estimate)
Coverage Modifications
Let X be the random variable of the actual claim amount for an insurance
policy and Y be the random variable of the amount of the claim as it appears in the insurer’s database. These two random variables will differ if
any of the following coverage modifications are present for the policy: an
ordinary or a franchise deductible, a limit, coinsurance, inflation, etc.
Often, one will want to use data from the random variable Y to fit a model on the unobservable random variable X. This requires to express the PDF or CDF of Y in terms of the PDF or CDF of X. Function coverage of actuar does just that: given a pdf or cdf and any combination of the coverage modifications mentioned above, coverage returns a function object to compute the pdf or cdf of the modified random variable. The function can then be used in modeling like any other droot or proot function. It takes the form:
coverage(pdf, cdf, deductible = 0, franchise = FALSE, limit = Inf, coinsurance = 1, inflation = 0, per.loss = FALSE)
with arguments:
- pdf, cdf: function object or character string naming a function to compute, respectively, the probability density function and cumulative distribution function of a probability law.
- deductible: a unique positive numeric value.
- franchise: logical; TRUE for a franchise deductible, FALSE (default) for an ordinary deductible.
- limit: a unique positive numeric value larger than deductible.
- coinsurance: a unique value between 0 and 1; the proportion of coinsurance.
- inflation: a unique value between 0 and 1; the rate of inflation.
- per.loss: logical; TRUE for the per loss distribution, FALSE (default) for the per payment distribution.
NOTE: If pdf is specified, the pdf is returned; if pdf is missing or NULL, the cdf is returned. Note that cdf is needed if there is a deductible or a limit.
Assume X has a gamma distribution. To create a PDF and CDF with a deductible d = 1 and a limit u = 10 is obtained with coverage as follows:
mcovpdf <- coverage(pdf = dgamma, cdf = pgamma, deductible = 1, limit = 10)
mcovcdf <- coverage(cdf = pgamma, deductible = 1, limit = 10)
To find P(X=5), execute:
mcovpdf(5, shape = 5, scale = 3)
and the cumulative probability at X=5:
mcovcdf(5, shape = 5, scale = 3)