Sam Stites

BIC, DIC, and CV

October 9, 2017

Comparison of BIC, DIC, and CV:


In information criterion like AIC, BIC, and DIC, the primary role is to come up with a criteria that will both reward high likelihood, but penalizes model complexity. In this way a model will both be a good fit for the data and will also generalize well beyond the data. One of BIC’s advantages is that it prefers simpler models than its predecessor, AIC. Furthermore, BIC has some nice theoretical backing since it is simply trying to maximize the posterior likelihood of the data, given the model. That said BIC assumes that one of the models it is using is true and that you are trying to find the model most likely to be true in the Bayesian sense (Wasserman lecture notes). If the true models doesn’t exist in comparison, if the real evaluation is non-bayesian, or if high model complexity is permissable, then it is not a good measure.

How does BIC deal with overfitting?

BIC deals with overfitting by penalizing models that have high complexity. Models which have high fitness and high complexity are overfit and will not scale beyond the data trained on. BIC counters this in the same way AIC does, but with a more stringent formulation of the penalty.

Cross Validation:

Cross validation is useful in that, while both DIC and BIC are attempts at generalizing selector scores, cross validation should be exactly the selector score on observed data. While it won’t claim to have any insights into how performant the selector will be for all future data, if our dataset contains everything in our selector’s world, it will create the most accurate representation that will predict the best (Wasserman lecture notes). That said, cross validation will be incredibly expensive to compute and the slowest selector since it will require running through the data multiple times, over each folds. This expense will result in a very slow runs and it may not be pragmatic to use with large datasets.

How does CV deal with overfitting? Does it need more data, or can it work with less?

CV will deal with overfitting by training on as many folds as are passed in as its hyperparameter. Any model which overfits on the data will be uncovered by continuously changing the hold-out set. A k-folds strategy can work with less data, in this way, since it will train k-times against the same dataset with different hold-out sets each time.


As discussed in “A Model Selection Criterion for Classification: Application to HMM Topology Optimization” (link), DIC outperforms BIC in realizing and reducing more than 18% of the relative error rate in it’s inaugural study. Furthermore, it seems to do well in a number of empirical cases. DIC is a more complex information criterion which has a more sophisicated means of finding the effective number of parameters. It uses a discriminative principle where the goal is to select the model less likely to have generated data belonging to the competing classification categories (link). That said DIC has some theoretical limitations, in that its penalty term is invariant to reparameterization, lacks consistency, and isn’t based on a proper predictive criterion. In “The deviance information criterion: 12 years on,” (link) Spiegelhalter acknowledged that it has some problem.

Does the DIC do better with more states or worse?

DIC does better with more states in comparison to BIC, in theory because it doesn’t work under an Occam’s razor principal (link).