Python
]
Decision Tree in Python
Decision tree is popular supervised learning algorithm that can be used for both regression and classification. Its pros and cons can be summarized as follows, copied from Machine Learning with Swift:
Pros:
- Easy to visualize, understand and interpret.
- Can work with numerical and categorical features.
- Requires little data preprocessing (one-hot encoding, dummy variables etc.)
- Non-parametric model: no assumptions about the shape of data.
- Fast for inference.
- Feature selection happens automatically: unimportant features will not influence the result. The presence of multicollinearity also doesn’t affect the quality.
Cons:
- Tends to overfit.
- Unstable: small changes in data can dramatically affect the structure of the tree and hence influence the final prediction.
- Finding the globally optimal decision tree is NP-complete. That’s why we use different heuristics and greedy search.
- Inflexible, in the sense that you can’t incorporate a new data into them easily. If you obtained new labeled data, you should retrain the tree from scratch on the whole dataset.
The full code is here
Data Import
We are going to use the cardiotocography data set. After downloading the data file, we will use Pandas read_excel() method to import data into pandas dataframe. Since there is no header in our data, we set header parameter’s value to 0. Since the raw data is located in the third sheet of the Excel workbook, we set sheet_name parameter’s value to 2. Since the last three rows at bottom of the worksheet are irrelevant data, we set skipfooter parameter’s value to 3.
Data Preprocessing
Let’s split our data into training and test set using sklearn’s train_test_split() method.
The parameter test_size is given value 0.2; it means test sets will be 20% of whole dataset and training dataset’s size will be 80% of the entire dataset. random_state variable is a pseudo-random number generator state used for random sampling. If you want to replicate our results, then use the same value of random_state.
Decision Tree Training
Using Gini Index as criterion
Using Information Gain as criterion
Calculating Prediction Accuracy Score
The function accuracy_score() will be used to print accuracy of Decision Tree algorithm. Accuracy is represented by the ratio of the correctly predicted data points to all the predicted data points. It helps to understand the effectiveness of our algorithm.
Output:
Accuracy using gini index is 88.02816901408451 Accuracy using gini index is 87.32394366197182
References:
Building Decision Tree Algorithm in Python with Scikit Learn-Rahul Saxena