Last updated: 2019-06-21

Checks: 2 0

Knit directory: stats-topics/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    analysis/.RData
    Ignored:    analysis/.Rhistory

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd 91c33bb Zhengyang Fang 2019-06-21 wflow_publish(“decision_tree.Rmd”)

Decision tree

A decision tree displays an algorithm that only contains conditional control statements.

Building the decision tree

We use a divide-and-conquer algorithm to build a decision tree: we split a big node into a few sub-nodes, and then keep splitting those sub-nodes as above, until each of those contains only a class.

In splitting, the rule of thumb is, we split the big node in a way that maximizes the “purity” of those sub-nodes. That is to say, each sub-node contains more samples from only one class.

Information entropy is a measurement of the “purity” of a set. Assume a set \(D\) contains \(N\) classes, and their proportions are \(p_1,p_2,\dots,p_N\). The definition of information entropy is \[ H(D)=-\sum_{k=1}^Np_k\log p_k. \] The smaller \(H(D)\) is, the higher purity that set \(D\) has.

  1. ID3 algorithm (Quinlan, 1986)

Among all possible splitting, choose the attribute \(a\) that maximizes the information gain: \[ Gain(D,a)=H(D)-\sum_{v=1}^V\frac{|D^v|}{|D|}H(D^v), \]

where \(V\) is the number of possible values of attribute \(a\).

It makes some sense, however, it is in favor of the attribute that splits the node into more sub-nodes. Specifically, if we include the idendity attribute in our feature, i.e. each sample has a unique indentity, then it will increase the information entropy to zero. Thus the indentity will always be the chosen one to split. But the generalization ability of this decision tree will be poor, which is not plausible.

  1. C4.5 algorithm (Quinlan, 1993)

To avoid splitting the node into fragiles,C4.5 algorithm takes the number of sub-nodes into account. It maximizes gain ratio instead of information gain. The definition of gain ratio: \[ GainRatio(D,a)=\frac{Gain(D,a)}{IV(a)}, \] where \(IV(a)\) is the intrinsic value of attribute \(a\), defined by \[ IV(a)=-\sum_{v=1}^V\frac{|D^v|}{|D|}\log \frac{|D^v|}{|D|}. \]

A notable fact is that, by maximizing the gain ratio, it tends to choose attributes with less sub-nodes, exactly the opposite to ID3 algorithm. \(C4.5\) algorithm takes both of them into account: it maximizes the gain ratio only among those attributes with information gain above the average.