Big Data/Analytics Zone is brought to you in partnership with:

Arthur Charpentier, ENSAE, PhD in Mathematics (KU Leuven), Fellow of the French Institute of Actuaries, professor at UQàM in Actuarial Science. Former professor-assistant at ENSAE Paritech, associate professor at Ecole Polytechnique and professor assistant in economics at Université de Rennes 1. Arthur is a DZone MVB and is not an employee of DZone and has posted 160 posts at DZone. You can read more from them at their website. View Full User Profile

# Non-observable vs. Observable Heterogeneity Factor

10.07.2013
| 1192 views |
Recently, in the ACT2040 class (on non-life insurance), we’ve discussed the difference between observable and non-observable heterogeneity in ratemaking (from an economic perspective). To illustrate that point (we will spend more time, later on, discussing observable and non-observable risk factors), we looked at the following simple example. Let $X$ denote the height of a person. Consider the following dataset:
```> Davis=read.table(
+ "http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Davis```

There is a small typo in the dataset, so let us make manual changes here.

`> Davis[12,c(2,3)]=Davis[12,c(3,2)] `

Here, the variable of interest is the height of a given person:

`> X=Davis\$height `

If we look at the histogram, we have:

`> hist(X,col="light green", border="white",proba=TRUE,xlab="",main="")`

Can we assume that we have a Gaussian distribution?

$X\sim\mathcal{N}(\theta,\sigma^2)$ Maybe not … here, if we fit a Gaussian distribution, plot it, and add a kernel based estimator, we get:

```> (param <- fitdistr(X,"normal")\$estimate)
> f1 <- function(x) dnorm(x,param[1],param[2])
> x=seq(100,210,by=.2)
> lines(x,f1(x),lty=2,col="red")
> lines(density(X))```

If you look at that black line, you might think of a mixture, something like:

$X\sim p_1\cdot\mathcal{N}(\theta_1,\sigma_1^2)+p_2\cdot\mathcal{N}(\theta_2,\sigma_2^2)$

(using standard mixture notations). Mixtures are obtained when we have a non-observable heterogeneity factor: with probability $p_1$, we have a random variable $\mathcal{N}(\mu_1,\sigma_1^2)$ (call it type [1]), and with probability $p_2$, a random variable $\mathcal{N}(\mu_2,\sigma_2^2)$ (call it type [2]). So far, nothing new. And we can fit such a mixture distribution, such as:

```> library(mixtools)
> mix <- normalmixEM(X)
number of iterations= 335
> (param12 <- c(mix\$lambda[1],mix\$mu,mix\$sigma))
[1] 0.4002202 178.4997298 165.2703616 6.3561363 5.9460023```

If we plot that mixture of two Gaussian distributions, we get:

```> f2 <- function(x){ param12[1]*dnorm(x,param12[2],param12[4])
+ (1-param12[1])*dnorm(x,param12[3],param12[5]) }
> lines(x,f2(x),lwd=2, col="red") lines(density(X))```

Not bad. Actually, we can try to maximize the likelihood with our own codes:

```> logdf <- function(x,parameter){
+ p <- parameter[1]
+ m1 <- parameter[2]
+ s1 <- parameter[4]
+ m2 <- parameter[3]
+ s2 <- parameter[5]
+ return(log(p*dnorm(x,m1,s1)+(1-p)*dnorm(x,m2,s2)))
+ }
> logL <- function(parameter) -sum(logdf(X,parameter))
> Amat <- matrix(c(1,-1,0,0,0,0,
+ 0,0,0,0,1,0,0,0,0,0,0,0,0,1), 4, 5)
> bvec <- c(0,-1,0,0)
> constrOptim(c(.5,160,180,10,10), logL, NULL, ui = Amat, ci = bvec)\$par

[1]   0.5996263 165.2690084 178.4991624   5.9447675   6.3564746```

Here, we include some constraints, to insurance that the probability belongs to the unit interval, and that the variance parameters remain positive. Note that we have something close to the previous output.

Published at DZone with permission of Arthur Charpentier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

"Starting from scratch" is seductive but disease ridden