by Boxuan Cui, Data Scientist at Smarter Travel
Once upon a time, there was a joke:
In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.
— Big Data Borat (@BigDataBorat) February 27, 2013
According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.
Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any data.frame
-like objects. However, certain functions require a data.table
class object as input due to the update-by-reference feature, which I will cover in later part of the post.
Now enough said and let's look at some code, shall we?
Take the BostonHousing
dataset from the mlbench
library:
library(mlbench)
data("BostonHousing", package = "mlbench")
Initial Visualization
Without knowing anything about the data, my first 3 tasks are almost always:
library(DataExplorer)
plot_missing(BostonHousing) ## Are there missing values, and what is the missing data profile?
plot_bar(BostonHousing) ## How does the categorical frequency for each discrete variable look like?
plot_histogram(BostonHousing) ## What is the distribution of each continuous variable?
While there are not many interesting insights from plot_missing
and plot_bar
, below is the output from plot_histogram
.
Upon scrutiny, the variable rad looks like discrete, and I want to group crim, zn, indus and b into bins as well. Let's do so:
## Set `rad` to factor
BostonHousing$rad <- as.factor(BostonHousing$rad)
## Create new discrete variables
for (col in c("crim", "zn", "indus", "b"))
BostonHousing[[paste0(col, "_d")]] <- as.factor(ggplot2::cut_interval(BostonHousing[[col]], 2))
## Plot bar chart for all discrete variables
plot_bar(BostonHousing)
At this point, we have much better understanding of the data distribution. Now assume we are interested in medv (median value of owner-occupied homes in USD 1000's), and would like to build a model to predict it. Let's plot it against all other variables:
plot_boxplot(BostonHousing, by = "medv")
plot_scatterplot(
subset(BostonHousing, select = -c(crim, zn, indus, b)),
by = "medv", size = 0.5)
plot_correlation(BostonHousing)
And this is how you slice & dice your data, and analyze correlation with merely 3 lines of code.
Feature Engineering
Feature engineering is a crucial step in building better models. DataExplorer provides a couple of functions to ease the process. All of them require a data.table
as the input object, because it is lightning fast. However, if you don't feel like coding in data.table
syntax, you may adopt the following process:
## Set your data to `data.table` first
your_data <- data.table(your_data)
## Apply DataExplorer functions
group_category(your_data, ...)
drop_columns(your_data, ...)
set_missing(your_data, ...)
## Set data back to the original object
class(your_data) <- "original_object_name"
Let's return to the BostonHousing
dataset. For the rest of this section, we'll assume the data has been converted to a data.table
already.
library(data.table)
BostonHousingDT <- data.table(BostonHousing)
Remember those transformed continuous variables? Let's drop them:
drop_columns(BostonHousingDT, c("crim", "zn", "indus", "b"))
Note: Because data.table
updates by reference, the original object is updated without the need to re-assign a returned object.
Let's take a look at the discrete variable rad:
plot_bar(BostonHousingDT$rad)
I think categories other than 4, 5 and 24 are too sparse, and might skew my model fit. How could I group all the sparse categories together?
group_category(BostonHousingDT, "rad", 0.25, update = FALSE)
# rad cnt pct cum_pct
# 1: 24 132 0.2608696 0.2608696
# 2: 5 115 0.2272727 0.4881423
# 3: 4 110 0.2173913 0.7055336
Looks like grouping by bottom 25% of rad would give me what I need. Let's do so:
group_category(BostonHousingDT, "rad", 0.25, update = TRUE)
plot_bar(BostonHousingDT$rad)
In addition to categorical frequency, you may also play with the measure
argument to group by the sum of a different variable. See ?group_category
for more example use cases.
Data Report
To generate a report of your data:
create_report(BostonHousing)
Currently, there is not much to do with this, but it is my plan to support customization of the generated report, so stay tuned for more features!
I hope you enjoyed exploring the Boston housing data with me, and finally here are some additional resources about the DataExplorer package: