Big Data/Analytics Zone is brought to you in partnership with:

Arthur Charpentier, ENSAE, PhD in Mathematics (KU Leuven), Fellow of the French Institute of Actuaries, professor at UQàM in Actuarial Science. Former professor-assistant at ENSAE Paritech, associate professor at Ecole Polytechnique and professor assistant in economics at Université de Rennes 1. Arthur is a DZone MVB and is not an employee of DZone and has posted 159 posts at DZone. You can read more from them at their website. View Full User Profile

Non-observable vs. Observable Heterogeneity Factor

  • submit to reddit
Recently, in the ACT2040 class (on non-life insurance), we’ve discussed the difference between observable and non-observable heterogeneity in ratemaking (from an economic perspective). To illustrate that point (we will spend more time, later on, discussing observable and non-observable risk factors), we looked at the following simple example. Let  denote the height of a person. Consider the following dataset:
> Davis=read.table(
+ "

There is a small typo in the dataset, so let us make manual changes here.

> Davis[12,c(2,3)]=Davis[12,c(3,2)] 

Here, the variable of interest is the height of a given person:

> X=Davis$height 

If we look at the histogram, we have:

> hist(X,col="light green", border="white",proba=TRUE,xlab="",main="")

Can we assume that we have a Gaussian distribution?

Maybe not … here, if we fit a Gaussian distribution, plot it, and add a kernel based estimator, we get:

> (param <- fitdistr(X,"normal")$estimate) 
> f1 <- function(x) dnorm(x,param[1],param[2]) 
> x=seq(100,210,by=.2) 
> lines(x,f1(x),lty=2,col="red") 
> lines(density(X))

If you look at that black line, you might think of a mixture, something like:

(using standard mixture notations). Mixtures are obtained when we have a non-observable heterogeneity factor: with probability , we have a random variable  (call it type [1]), and with probability , a random variable  (call it type [2]). So far, nothing new. And we can fit such a mixture distribution, such as:

> library(mixtools) 
> mix <- normalmixEM(X)
 number of iterations= 335 
> (param12 <- c(mix$lambda[1],mix$mu,mix$sigma)) 
[1] 0.4002202 178.4997298 165.2703616 6.3561363 5.9460023

If we plot that mixture of two Gaussian distributions, we get:

> f2 <- function(x){ param12[1]*dnorm(x,param12[2],param12[4])
+ (1-param12[1])*dnorm(x,param12[3],param12[5]) }
> lines(x,f2(x),lwd=2, col="red") lines(density(X))

Not bad. Actually, we can try to maximize the likelihood with our own codes:

> logdf <- function(x,parameter){
+ p <- parameter[1]
+ m1 <- parameter[2]
+ s1 <- parameter[4]
+ m2 <- parameter[3]
+ s2 <- parameter[5]
+ return(log(p*dnorm(x,m1,s1)+(1-p)*dnorm(x,m2,s2)))
+ }
> logL <- function(parameter) -sum(logdf(X,parameter))
> Amat <- matrix(c(1,-1,0,0,0,0,
+ 0,0,0,0,1,0,0,0,0,0,0,0,0,1), 4, 5)
> bvec <- c(0,-1,0,0)
> constrOptim(c(.5,160,180,10,10), logL, NULL, ui = Amat, ci = bvec)$par

[1]   0.5996263 165.2690084 178.4991624   5.9447675   6.3564746

Here, we include some constraints, to insurance that the probability belongs to the unit interval, and that the variance parameters remain positive. Note that we have something close to the previous output.

Published at DZone with permission of Arthur Charpentier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)