tulika  goyal tulika goyal

Validating the model 

Now, we need to check the predictive power of the CART model, we just built. Here, we are looking at a discordance rate (which is the number of misclassifications in the tree) as the decision criteria. We use the following code to do the same :

train.cart<-predict(modfit,newdata=training)
table(train.cart,training$Species)
train.cart   setosa versicolor virginica

setosa         25         0         0
versicolor     0         22         0
virginica       0         3       25

# Misclassification rate = 3/75

Only 3 misclassified observations out of 75, signifies good predictive power. In general, a model with misclassification rate less than 30% is considered to be a good model. But, the range of a good model depends on the industry and the nature of the problem. Once we have built the model, we will validate the same on a separate data set. This is done to make sure that we are not over fitting the model. In case we do over fit the model, validation will show a sharp decline in the predictive power. It is also recommended to do an out of time validation of the model. This will make sure that our model is not time dependent. For instance, a model built in festive time, might not hold in regular time. For simplicity, we will only do an in-time validation of the model. We use the following code to do an in-time validation:

pred.cart<-predict(modfit,newdata=Validation)
table(pred.cart,Validation$Species)
pred.cart   setosa versicolor virginica

setosa         25         0         0
versicolor     0         22         1
virginica       0         3       24

# Misclassification rate = 4/75

As we see from the above calculations that the predictive power decreased in validation as compared to training. This is generally true in most cases. The reason being, the model is trained on the training data set, and just overlaid on validation training set. But, it hardly matters, if the predictive power of validation is lesser or better than training. What we need to check is that they are close enough. In this case, we do see the misclassification rate to be really close to each other. Hence, we see a stable CART model in this case study.

Let’s now try to visualize the cases for which the prediction went wrong. Following is the code we use to find the same :

correct <- pred.cart == Validation$Species

qplot(Petal.Length,Petal.Width,colour=correct,data=Validation)

As you see from the graph attached, the predictions which went wrong were actually those borderline cases. We have already discussed before that these are the cases which make or break the comparison for the model. Most of the models will be able to categorize observation far away from each other. It takes a model to be sharp to distinguish these borderline cases.

tulika  goyal

tulika goyal Creator

B-tech 2nd year student of polymer science.

Suggested Creators

tulika  goyal