Estimating mean variance and mean absolute bias of a regression tree by bootstrapping using foreach and rpart packages

by Błażej Moska, computer science student and data science intern

One of the most important thing in predictive modelling is how our algorithm will cope with various datasets, both training and testing (previously unseen).

Roughly speaking, variance of an estimator describes, how do estimator value ranges from dataset to dataset. It's defined as follows:

[ textrm{Var}[ widehat{f} (x) ]=E[(widehat{f} (x)-E[widehat{f} (x)])^{2} ] ]

[ textrm{Var}[ widehat{f} (x)]=E[(widehat{f} (x)^2]-E[widehat{f} (x)]^2 ]

Bias is defined as follows:

[ textrm{Bias}[ widehat{f} (x)]=E[widehat{f}(x)-f(x)]=E[widehat{f}(x)]-f(x) ]

One could think of a Bias as an ability to approximate function. Typically, reducing bias results in increased variance and vice versa.

(E[X]) is an expected value, this could be estimated using a mean, since mean is an unbiased estimator of the expected value.

We can estimate variance and bias by bootstrapping original training dataset, that is, by sampling with replacement indexes of an original dataframe, then drawing rows which correspond to these indexes and obtaining new dataframes. This operation was repeated over nsampl times, where nsampl is the parameter describing number of bootstrap samples.

Variance and Bias is estimated for one value, that is to say, for one observation/row of an original dataset (we calculate variance and bias over rows of predictions made on bootstrap samples). We then obtain a vector containing variances/biases. This vector is of the same length as the number of observations of the original dataset. For the purpose of this article, for each of these two vectors a mean value was calculated. We will treat these two means as our estimates of mean bias and mean variance. If we don't want to measure direction of the bias, we can take absolute values of bias.

Because bias and variance could be controlled by parameters sent to the rpart function, we can also survey how do these parameters affect tree variance. The most commonly used parameters are cp (complexity parameter), which describe how much each split must decrease overall variance of a decision variable in order to be attempted, and minsplit, which defines minimum number of observations needed to attempt a split.

Operations mentioned above is rather exhaustive in computational terms: we need to create nsampl bootstrap samples, grow nsampl trees, calculate nsampl predictions, nrow variances, nrow biases and repeat those operations for the number of parameters (length of the vector of cp or minsplit). For that reason the foreach package was used, to take advantage of parallelism. The above procedure still can't be considered as fast, but It was much faster than without using the foreach package.

So, summing up, the procedure looks as follows:

Create bootstrap samples (by bootstrapping original dataset)
Train model on each of these bootstrap datasets
Calculate mean of predictions of these trees (for each observation) and compare these predictions with values of the original datasets (in other words, calculate bias for each row)
Calculate variance of predictions for each row (estimate variance of an estimator-regression tree)
Calculate mean bias/absolute bias and mean variance

Code

Sources and bibliography

[1] Vijayakumar, Sethu (2007). "The Bias-Variance Tradeoff" . University of Edinburgh [link]

[3] http://scott.fortmann-roe.com/docs/BiasVariance.html

[4] Bias-Variance in Machine Learning [link]

Estimating mean variance and mean absolute bias of a regression tree by bootstrapping using foreach and rpart packages

Code

Sources and bibliography

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

Police confirm man stabbed to death in Selsdon was Andrew David Else of Croydon

99 God Status for Whatsapp, Facebook

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

Notorious Naushad of Ippa gang nabbed

Download: Stuf G ft B1 & Trice – Puzya Mami (Prod-j Stunner)

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Mp3 Download: Mr Raw - Hallelujah Ft. J Martins

Practice Sheet of Right form of verbs for HSC Students

Gauhati University TDC 2nd-4th-6th Result 2017 BA B.Com B.Sc

Universal Multi-Patch v1.3 By RADIXX11

[E² Plugin] HDF-Radio

Telangana TS New Food Security Card/ Telangana Ration card Application Form...

IWAN – Thanks and Praise ( Throw Back Thursday )

Inthalo ennenni vinthalo ( male ) lyrics and translation | Karthikeya (2014)