Random forest – Harper Data Science

Objectives

Demonstrate a typical use case for RF
Evaluate outputs of RF
Visualisations

Why use Random Forests?

There are two principle uses of RF, describing a large dataset, or making predictions from new data trained on a large dataset
The dependent variable in RF can be either categorical or numeric
RF is free from the burden of parametric assumptions and is fast and easy to use
Dimension reduction - Once important variables are identified, they can be used for further analysis

A simple RF example for classification

We will use the famous iris** dataset to classify iris species based on several morphological measurements (hence, this is a classification application of RF using a categorical dependent variable).

**Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x.

#| edit: false
data(iris) # load the data
head(iris) # print the first few lines of data

We can get an idea of how the morphological variables relate to the species variable graphically.

# NB you can edit and re-run this code...
plot(Sepal.Length ~ Sepal.Width, col = Species, data = iris)
legend(x = "topleft", legend = unique(iris$Species), col = unique(iris$Species), pch = 1)

Let’s perform RF and look at some outputs for this dataset.

library(randomForest)

myrf <-  randomForest(Species ~ ., data = iris)
print(myrf) # for classification, we get a "confusion matrix"

# variance importance plot
varImpPlot(myrf)

Mean Decrease Gini is a measure used in random forests to indicate how important a particular feature (i.e. predictor variables, independent variables, etc.) is for making accurate predictions. It shows how much the model’s accuracy would decrease if that feature were left out, with higher values indicating more important features.

Here, we can see petal width and petal length are more important than sepal width and length. Maybe we should look at this graphically compared to our other plot above?

plot(Petal.Length ~ Petal.Width, col = Species, data = iris)
legend(x = "topleft", legend = unique(iris$Species), col = unique(iris$Species), pch = 1)

A very practical way to look at variable importance here would be, say, if you could only choose two explanatory variables for further analysis, you would definitely choose petal width and petal length. For this trivial example where there are only four variables this is not so important, but imagine if you had 100 or more potential explanatory variables…

Regression example

data(mtcars)
head(mtcars)

library(visreg)

mylm <- lm(mpg ~ disp + hp + wt, data = mtcars)

par(mfrow = c(1,3))
plot(mpg ~ disp, mtcars)
plot(mpg ~ hp, mtcars, main = "Plots of raw data")
plot(mpg ~ wt, mtcars)

par(mfrow = c(1,3))
visreg(mylm, "disp")
visreg(mylm, "hp", main = "Plots of marginal effects")
visreg(mylm, "wt")

set.seed(4543)
rf.fit <- randomForest(mpg ~ ., data=mtcars, ntree=1000,
                       keep.forest=FALSE, importance=TRUE)
rf.fit

varImpPlot(rf.fit)

The %MSE (Percent Mean Squared Error) in random forests measures how much a specific feature improves the accuracy of predictions by reducing errors, with higher values indicating more important features.

Node purity indicates how well a feature splits the data within a tree, with higher purity meaning that the splits created by the feature result in more homogeneous groups (e.g., more accurate classifications).

Refining and using random forests

There are a lot of options

Usage

randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), weights=NULL, replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE, …)

Important considerations:

how many trees do you need?
is set.seed() important to you?
should you scale your variables?
should you create new “features” derived from existing variables?
how can you deal with missing values in your data?
should you scale your data?

Objectives

Further reading

Why use Random Forests?

A simple RF example for classification

Regression example

Refining and using random forests