Big Data/Analytics Zone is brought to you in partnership with:

Arthur Charpentier, ENSAE, PhD in Mathematics (KU Leuven), Fellow of the French Institute of Actuaries, professor at UQàM in Actuarial Science. Former professor-assistant at ENSAE Paritech, associate professor at Ecole Polytechnique and professor assistant in economics at Université de Rennes 1. Arthur is a DZone MVB and is not an employee of DZone and has posted 156 posts at DZone. You can read more from them at their website. View Full User Profile

Regression on Categorical Variables

02.01.2013
| 2047 views |

This morning, Stéphane asked me tricky question about extracting coefficients from a regression with categorical explanatory variates. More precisely, he asked me if it was possible to store the coefficients in a nice table, with information on the variable and the modality (those two information being in two different columns). Here is some code I did to produce the table he was looking for, but I guess that some (much) smarter techniques can be used (comments – see below – are open). Consider the following dataset

> base
x sex   hair
1  1   H  Black
2  4   F  Brown
3  6   F  Black
4  6   H  Black
5 10   H  Brown
6  5   H Blonde

with two factors,

> levels(base$hair) [1] "Black" "Blonde" "Brown" > levels(base$sex)
[1] "F" "H"

Let us run a (standard linear) regression,

> reg=lm(x~hair+sex,data=base)

which is here

> summary(reg)

Call:
lm(formula = x ~ hair + sex, data = base)

Residuals:
1          2          3          4          5          6
-3.714e+00 -2.429e+00  2.429e+00  1.286e+00  2.429e+00 -2.220e-16

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.5714     3.4405   1.038    0.408
hairBlonde    0.2857     4.8655   0.059    0.959
hairBrown     2.8571     3.7688   0.758    0.528
sexH          1.1429     3.7688   0.303    0.790

Residual standard error: 4.071 on 2 degrees of freedom
Multiple R-squared: 0.2352,	Adjusted R-squared: -0.9121
F-statistic: 0.205 on 3 and 2 DF,  p-value: 0.886

If we want to extract the names of the factors (assuming here that there are no numbers in the name of the factor), and the values of the associated modality, one can use

> VARIABLE=c("",gsub("[-^0-9]", "", names(unlist(reg$xlevels)))) > MODALITY=c("",as.character(unlist(reg$xlevels)))
> names=data.frame(VARIABLE,MODALITY,NOMVAR=c(
+ "(Intercept)",paste(VARIABLE,MODALITE,sep="")[-1]))
> regression=data.frame(NOMVAR=names(coefficients(reg)),
+ COEF=as.numeric(coefficients(reg)))
> merge(names,regression,all.x=TRUE)
NOMVAR VARIABLE MODALITE      COEF
1 (Intercept)                   3.5714286
2   hairBlack     hair    Black        NA
3  hairBlonde     hair   Blonde 0.2857143
4   hairBrown     hair    Brown 2.8571429
5        sexF      sex        F        NA
6        sexH      sex        H 1.1428571

or, if we want modalities exluding references,

> merge(names,regression)
NOMVAR VARIABLE MODALITE      COEF
1 (Intercept)                   3.5714286
2  hairBlonde     hair   Blonde 0.2857143
3   hairBrown     hair    Brown 2.8571429
4        sexH      sex        H 1.1428571

In order to reproduce the table Stéphane sent me, let us use the following code to produce an html table,

> library(xtable)
> htlmtable <- xtable(merge(names,regression))
> print(htlmtable,type="html")
NOMVARVARIABLEMODALITYCOEF
1(Intercept)3.57
2hairBlondehairBlonde0.29
3hairBrownhairBrown2.86
4sexHsexH1.14

So yes, it is possible to build a table with the variable, modalities, and coefficients. This function can be interesting on prospective mortality, when we do have a large number of modalities per factor (years, ages and year of birth). Consider the following datasets

> DEATH=read.table(
+ "http://freakonometrics.free.fr/DeathsSwitzerland.txt",
+ "http://freakonometrics.free.fr/ExposuresSwitzerland.txt",
> DEATH$Age=as.numeric(as.character(DEATH$Age))
> DEATH=DEATH[-which(is.na(DEATH$Age)),] > EXPOSURE$Age=as.numeric(as.character(EXPOSURE$Age)) > EXPOSURE=EXPOSURE[-which(is.na(EXPOSURE$Age)),]
> base=data.frame(y=as.factor(DEATH$Year),a=as.factor(DEATH$Age),
+ c=as.factor(DEATH$Year-DEATH$Age),D=DEATH$Total,E= EXPOSURE$Total)
> base=base[base$E>0,] and the following nonlinear model, based on Lee-Carter model (including a cohort effect), $N_{x,t}\sim\mathcal{P}(E_{x,t}\cdot \exp[\alpha_x+\beta_x \kappa_t + \gamma_x \delta_{t-x}>)$ can be estimated using > library(gnm) > reg=gnm(D~a+Mult(a,y)+Mult(a,c),offset=log(E),family=poisson,data=base) In order to extract the 671 coefficients from the regresssion, > length(coefficients(reg)) [1] 671 (as properly as possible) we have to be careful: names of coefficients are not that simple to handle. For instance, we can see things like > coefficients(reg)[200] Mult(., year).age98 0.04203519 In order to extract them, define > na=length((reg$xlevels)$age) > ny=length((reg$xlevels)$year) > nc=length((reg$xlevels)$cohort) > VARIABLElong=c("",rep("age",na),rep("Mult(., year).age",na), + rep("Mult(a, .).y",ny), + rep("Mult(., cohort).age",na),rep("Mult(age, .).cohort",nc)) > VARIABLEshort=c("",rep("age",na),rep("age",na),rep("year",ny), + rep("age",na),rep("cohort",nc)) > MODALITY=c("",(reg$xlevels)$age,(reg$xlevels)$age, + (reg$xlevels)$year,(reg$xlevels)$age,(reg$xlevels)$cohort) > names=data.frame(VARIABLElong,VARIABLEshort, + MODALITY,NOMVAR=c("(Intercept)",paste(VARIABLElong,MODALITY,sep="")[-1])) > regression=data.frame(NOMVAR=names(coefficients(reg)), + COEF=as.numeric(coefficients(reg))) Here we go, now we have the coefficients from the regression in a nice table, > outputreg=merge(names,regression) > outputreg[1:10,] NOMVAR VARIABLElong VARIABLEshort MODALITY COEF 1 (Intercept) -8.22225458 2 age1 age age 1 -0.87495451 3 age10 age age 10 -1.67145704 4 age100 age age 100 4.91041650 5 age11 age age 11 -1.00186990 6 age12 age age 12 -1.05953497 7 age13 age age 13 -0.90952859 8 age14 age age 14 0.02880668 9 age15 age age 15 0.42830738 10 age16 age age 16 1.35961403 It is now possible to plot all the coefficients, as functions of the age, the year of observation, or the year of birth. For instance, for the standard average age effect (namely $\alpha_x$ as a function of $x$), we can use > typevariable=as.character(unique(outputreg$VARIABLElong))
> basegraph=outputreg[outputreg$VARIABLElong==typevariable[2],] > x=as.numeric(as.character(basegraph$MODALITY))
> y=basegraph$COEF > plot(x,y,type="p",col="blue",xlab="Age") while the cohort effect ($\delta_t$ as a function of $t$) is obtained using > basegraph=outputreg[outputreg$VARIABLElong==typevariable[5],]
> x=as.numeric(as.character(basegraph$MODALITY)) > y=basegraph$COEF
> plot(x,y,type="p",col="blue",xlab="Cohort (year of birth)",ylim=c(0,10))

Published at DZone with permission of Arthur Charpentier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

"Starting from scratch" is seductive but disease ridden
-Pithy Advice for Programmers