This afternoon, we’ve seen in the training on data science that it was possible to use AIC criteria for model selection.

> library(splines) > AIC(glm(dist ~ speed, data=train_cars, family=poisson(link="log"))) [1] 438.6314 > AIC(glm(dist ~ speed, data=train_cars, family=poisson(link="identity"))) [1] 436.3997 > AIC(glm(dist ~ bs(speed), data=train_cars, family=poisson(link="log"))) [1] 425.6434 > AIC(glm(dist ~ bs(speed), data=train_cars, family=poisson(link="identity"))) [1] 428.7195

And I’ve been asked why we don’t use a training sample to fit a model, and then use a validation sample to compare predictive properties of those models, penalizing by the complexity of the model. But it turns out that it is difficult to compute the AIC of those models on a different dataset. I mean, it is possible to write down the likelihood (since we have a Poisson model) but I want a code that could work for any model, any distribution….

Hopefully, Heather suggested a very clever idea, using her package

```
```@freakonometrics @pihive you could use gnm with `data = dat2, constrain = “*”, constrainTo = coef(mod1)’ then use AIC on result

— Heather Turner (@HeathrTurnr) 29 Juillet 2015

And actually, it works well.

Create here two datasets, one for the training, and one for validation

> set.seed(1) > idx = sample(1:50,size=10,replace=FALSE) > train_cars = cars[-idx,] > valid_cars = cars[idx,]

then use simply

> library(gnm) > reg1 = gnm(dist ~ speed, data=train_cars, family=poisson(link="log")) > reg2 = gnm(dist ~ speed, data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="log"))

Here Akaike criteria on the validation sample is

> AIC(reg2) [1] 82.57612

Let us keep tracks of a prediction to plot it later on

> v_log=predict(reg1,newdata= data.frame(speed=u),type="response")

We can challenge that Poisson model with a log link with a Poisson with a linear link function

> reg1 = gnm(dist ~ speed, data=train_cars, family=poisson(link="identity")) > reg2 = gnm(dist ~ speed, data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="identity")) > AIC(reg2) [1] 73.24745 > v_id=predict(reg1,newdata=data.frame(speed=u), type="response")

We can also try to include splines, either with the log link

> library(splines) > reg1 = gnm(dist ~ bs(speed), data=train_cars, family=poisson(link="log")) > reg2 = gnm(dist ~ speed, data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="log")) > AIC(reg2) [1] 82.57612 > v_log_bs=predict(reg1,newdata= data.frame(speed=u),type="response")

or with the identity link (but always in the same family)

> reg1 = gnm(dist ~ bs(speed), data=train_cars, family=poisson(link="identity")) > reg2 = gnm(dist ~ speed, data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="identity")) > AIC(reg2) [1] 73.24745 > v_id_bs=predict(reg1,newdata= data.frame(speed=u),type="response")

If we plot the predictions, we get

> plot(cars) > points(train_cars,pch=19,cex=.85,col="grey") > lines(u,v_log,col="blue") > lines(u,v_id,col="red") > lines(u,v_log_bs,col="blue",lty=2) > lines(u,v_id_bs,col="red",lty=2)

Now, the problem with this holdout technique is that we might get unlucky (or lucky) when creating the samples. So why not try some monte carlo study, where many samples are generated,

> four_aic=function(i){ + idx = sample(1:50,size=10,replace=FALSE) + train_cars = cars[-idx,] + valid_cars = cars[idx,] + reg1 = gnm(dist ~ speed, data=train_cars, family=poisson(link="log")) + reg2 = gnm(dist ~ speed, data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="log")) + a1=AIC(reg2) + reg0 = lm(dist ~ speed, data=train_cars) + reg1 = gnm(dist ~ speed, data=train_cars, family=poisson(link="identity"), start=c(1,1)) + reg2 = gnm(dist ~ speed, data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="identity"), start=c(1,1)) + a2=AIC(reg2) + reg1 = gnm(dist ~ bs(speed), data=train_cars, family=poisson(link="log")) + reg2 = gnm(dist ~ bs(speed), data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="log")) + a3=AIC(reg2) + reg1 = gnm(dist ~ bs(speed), data=train_cars, family=poisson(link="identity")) + reg2 = gnm(dist ~ bs(speed), data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="identity")) + a4=AIC(reg2) + return(c(a1,a2,a3,a4))}

Consider for instance 1,000 scenarios

> S = sapply(1:1000,four_aic)

The model that has, on average, the lowest AIC on a validation sample was the log-link with splines

> rownames(S) = c("log","id","log+bs","id+bs") > W = apply(S,2,which.min) > barplot(table(W)/10,names.arg=rownames(S))

And indeed,

> boxplot(t(S))

with that model, the AIC is usually lower with the spline model with a log link than the other one (or at least almost the same as the spline model with an identity link). Or at least, we can confirm here that a nonlinear model should be better than a nonlinear one.

And I ve been asked why we don t use a training sample to fit a model, and then use a validation sample to compare predictive properties of those models, penalizing by the complexity of the model. But it turns out that it is difficult to compute the AIC of those models on a different dataset.

Hi Arthur,

while you show that gnm can compute a value labelled AIC, you haven’t said what it actually does. I would assume that it computes the likelihood of the test data (which is simple, since the prediction (at the response scale) is the lambda of the Poisson likelihood, and then uses the rank of the fitted model as plugin for the number of parameters:

-2*dpois(testObs, lambda=testPreds) + 2*rank(fittedmodel)

If so, I wonder what the statistical rational behind this would be. Since cross-validation (specifically: leave-one-out CV) is asymptotically equivalent to AIC, we do not need a correction for model complexity: it is automatically corrected for by having lower log-likelihood on the test data. (I would even say: putting in the rank of the model overcorrects the performance on the test data if you use a repeated test/train split as you show in your code.)

Or, more plainly: Why bother with AIC (as approximated measure of performance on new data), if you can get the performance on a test data set directly from repeated validation?

thanks Carsten for the comment,

yes, we use the rank of the fitted model as plugin for the number of parameters

my experience is that leave-one-out cross validation could be slow, while just computing the loglikelihood (and the AIC criteria) at some different point should be easily computed. And it was quite a challenge to observe that it was difficult to compute.

What do you mean by “it was difficult to compute”? Are you talking about the technical computation (in R, forcing you to resort to gnm) or the statistical theory? I mean, the rank is contained in the model object in the case of a GLM (…$rank) , and the log-likelihood (for Bernoulli and Poisson as single-parameter distributions) is also obvious (see my comment; for normal, neg.bin. and alike one has to find a plug-in estimator for the variance terms).

I agree that LOOCV is costly, but then you had 1000 runs for the repeated AIC computation, which is costly, too. My (limited) experience with 1000 runs of a 75/25 train/test-split is that it comes close to a LOOCV, at the same expense as your example.

(I am struggling with a very similar problem to your post and thus am interested in your approach and point of view! I dislike that there is quite a lot of ad-hoc treatment in what I am doing, so I hope for at least a convergence of this ad-hockery.)

sorry, difficult to find a simple and efficient R function

I concur. Doing the AIC calculation ‘by hand’ is a nice demonstration, but I do not see much the point of applying it for cross validation.

Hey Guys, looking for some help. I have an insurance dataset with exposure, claims number and a few categorical predictors. Looking to predict claim frequency so setting it us as a poisson regression problem with offset = log (exposure). I’d like to compare the performance of Generalized Linear Model (using glm) and Poisson Regression with Lasoo (glmnet). I have split into train and test. I have used train to arrive at a final model, and have generated predicted values of claim frequency for the test set. I’m now looking to compute a test statistic which I can use to pick the winner. I was first thinking of deviance D=2∑ni=1{Yilog(Yi/μi)−(Yi−μi)} . I can’t get my head around what happens when Yi=0- I will get an error. Should I just use the likelihood instead? i.e. ∑{(Yilog(ui)-ui}

Hi Tony

I hope it can wait… I just landed from an intense work in Singapore, and about to leave for holidays with the family. But I will try to get back to your point !

Sorry