The objectives of this section are:
to explore training and testing errors and how they are influenced by the complexity of the tree.
to present the balancing act needed for an optimal decision tree.
to explain the concept of overfitting and underfitting.
to introduce you to the measure estimation of generalization errors.
to explore how overfitting is handle in decision tree induction algorithms.
Outcomes
By the time you have completed this section you will be able to:
to define training errors, testing errors, overfitting & underfitting.
to explain the balancing act needed to avoid either extreme.
to give a brief synopsis of the measures used to estimate generalization errors.
to explain how overfitting is handle in decision tree induction algorithms.
Let’s take a break and try and remember all that has been covered up unto this point in time. Classification is the ability to group information into predefined classes based on their attribute type, this was the main point portrayed in Section1. Section 2 introduced the importance of the Decision Tree Classifier and also presented you with Occam’s Razor which helps us determine which decision tree is more efficient. In the previous section, Section 3, we moved a little bit deeper into the world of data mining by providing you with the tools necessary to construct an efficient decision tree.
So at this point in time, you should be able to create a decision tree and use the attribute selection measures to choose the appropriate node that needs to be split. If for some reason all of this sounds like Latin or if you speak Latin, gibberish, please go revisit the material or the exercises provided at the end of each section for more information because it is about to get a bit more complex.
This section deals with errors and the concept of overfitting and underfitting. Life never goes the way we expect and so we have to make adjustments to plans from time to time and the field of computer science is not exempt from these much needed fixes.
There are two main types of errors that we can when dealing with classification models: training errors and generalization errors. Training errors are misclassification errors that are committed on training records where as Generalization errors are generated by using test or previously unseen records on the new model that has been developed.
INSERT JAVA if possible
It would be wonderful if we lived in an ideal world but we don’t so errors exist of both of the aforementioned types exist in the created models. In an attempt to reduce training errors you could create a decision tree with a lot of nodes and end up creating a model that fits the training data too well but this in turn creates a high generalization error, otherwise known as Model Overfitting. Overfitting can be caused by the presence of noise or a lack of representative samples in the training data set. There is a lot of debate on this subject but most experts agree that the complexity of a model has an impact on model overfitting
The other extreme is having a tree with a small number of nodes and therefore a low generalization error but a high training error, this is Model Underfitting. This happens because the model has yet to learn the true structure of the data. We must find a way to strike a balance between these two extremes.