Telco customer churn data is used throughout this tutorial. The data describes customers using the telecommunications platform using features such as gender, whether or not they have paperless billing and if the customer has stopped subscribing (churn).
Supervised machine learning techniques require data to be shared between a testing and training set. A model is generated from the data in the training set and its performance is characterised using the test set.
Initially, data are separated into testing and training sets.
set.seed(123)
data_split <- initial_split(data, prop = 3/4)
data_test <- testing(data_split)
data_train <- training(data_split)
The fraction of data in the training set must not be too large in order to avoid over-fitting. On the other hand, if the fraction is too small, you will obtain poor estimates of the model parameters. Typically, 75% training and 25% testing is a good bet - and is sufficient for the following examples. When a dataset is significantly large (thousands of observations with far fewer features) increasing the size of the training set will result in only minor model improvements. In such cases, the training set size can be reduced to decrease processing time.
You must ensure the training data is representative of the initial dataset in order to build a generizable model. Generally, random sampling of the initial data is sufficient to achieve this. However, if an observation is infrequent (< 10%), random sampling could result in its omission or artificially reduce its impact in the training process. Multiple methods exist to ensure the training data is representative. This page describes two such methods; stratification and up/down sampling.
To produce a higher fidelity model, subsets of the training data can be used to build a model multiple times. This process is known as resampling.
When exploring data it is important to note the number of responses in each category. One way to do this is using the describe method from the Hmisc library:
Hmisc::describe(data)
## data
##
## 21 Variables 7043 Observations
## --------------------------------------------------------------------------------
## customerID
## n missing distinct
## 7043 0 7043
##
## lowest : 0002-ORFBO 0003-MKNFE 0004-TLHLJ 0011-IGKFF 0013-EXCHZ
## highest: 9987-LUTYD 9992-RRAMN 9992-UJOEL 9993-LHIEB 9995-HOTOH
## --------------------------------------------------------------------------------
## gender
## n missing distinct
## 7043 0 2
##
## Value Female Male
## Frequency 3488 3555
## Proportion 0.495 0.505
## --------------------------------------------------------------------------------
## SeniorCitizen
## n missing distinct Info Sum Mean Gmd
## 7043 0 2 0.408 1142 0.1621 0.2717
##
## --------------------------------------------------------------------------------
## Partner
## n missing distinct
## 7043 0 2
##
## Value No Yes
## Frequency 3641 3402
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## Dependents
## n missing distinct
## 7043 0 2
##
## Value No Yes
## Frequency 4933 2110
## Proportion 0.7 0.3
## --------------------------------------------------------------------------------
## tenure
## n missing distinct Info Mean Gmd .05 .10
## 7043 0 73 0.999 32.37 28.08 1 2
## .25 .50 .75 .90 .95
## 9 29 55 69 72
##
## lowest : 0 1 2 3 4, highest: 68 69 70 71 72
## --------------------------------------------------------------------------------
## PhoneService
## n missing distinct
## 7043 0 2
##
## Value No Yes
## Frequency 682 6361
## Proportion 0.097 0.903
## --------------------------------------------------------------------------------
## MultipleLines
## n missing distinct
## 7043 0 3
##
## Value No No phone service Yes
## Frequency 3390 682 2971
## Proportion 0.481 0.097 0.422
## --------------------------------------------------------------------------------
## InternetService
## n missing distinct
## 7043 0 3
##
## Value DSL Fiber optic No
## Frequency 2421 3096 1526
## Proportion 0.344 0.440 0.217
## --------------------------------------------------------------------------------
## OnlineSecurity
## n missing distinct
## 7043 0 3
##
## Value No No internet service Yes
## Frequency 3498 1526 2019
## Proportion 0.497 0.217 0.287
## --------------------------------------------------------------------------------
## OnlineBackup
## n missing distinct
## 7043 0 3
##
## Value No No internet service Yes
## Frequency 3088 1526 2429
## Proportion 0.438 0.217 0.345
## --------------------------------------------------------------------------------
## DeviceProtection
## n missing distinct
## 7043 0 3
##
## Value No No internet service Yes
## Frequency 3095 1526 2422
## Proportion 0.439 0.217 0.344
## --------------------------------------------------------------------------------
## TechSupport
## n missing distinct
## 7043 0 3
##
## Value No No internet service Yes
## Frequency 3473 1526 2044
## Proportion 0.493 0.217 0.290
## --------------------------------------------------------------------------------
## StreamingTV
## n missing distinct
## 7043 0 3
##
## Value No No internet service Yes
## Frequency 2810 1526 2707
## Proportion 0.399 0.217 0.384
## --------------------------------------------------------------------------------
## StreamingMovies
## n missing distinct
## 7043 0 3
##
## Value No No internet service Yes
## Frequency 2785 1526 2732
## Proportion 0.395 0.217 0.388
## --------------------------------------------------------------------------------
## Contract
## n missing distinct
## 7043 0 3
##
## Value Month-to-month One year Two year
## Frequency 3875 1473 1695
## Proportion 0.550 0.209 0.241
## --------------------------------------------------------------------------------
## PaperlessBilling
## n missing distinct
## 7043 0 2
##
## Value No Yes
## Frequency 2872 4171
## Proportion 0.408 0.592
## --------------------------------------------------------------------------------
## PaymentMethod
## n missing distinct
## 7043 0 4
##
## Value Bank transfer (automatic) Credit card (automatic)
## Frequency 1544 1522
## Proportion 0.219 0.216
##
## Value Electronic check Mailed check
## Frequency 2365 1612
## Proportion 0.336 0.229
## --------------------------------------------------------------------------------
## MonthlyCharges
## n missing distinct Info Mean Gmd .05 .10
## 7043 0 1585 1 64.76 34.39 19.65 20.05
## .25 .50 .75 .90 .95
## 35.50 70.35 89.85 102.60 107.40
##
## lowest : 18.25 18.40 18.55 18.70 18.75, highest: 118.20 118.35 118.60 118.65 118.75
## --------------------------------------------------------------------------------
## TotalCharges
## n missing distinct Info Mean Gmd .05 .10
## 7032 11 6530 1 2283 2449 49.6 84.6
## .25 .50 .75 .90 .95
## 401.4 1397.5 3794.7 5976.6 6923.6
##
## lowest : 18.80 18.85 18.90 19.00 19.05
## highest: 8564.75 8594.40 8670.10 8672.45 8684.80
## --------------------------------------------------------------------------------
## Churn
## n missing distinct
## 7043 0 2
##
## Value No Yes
## Frequency 5174 1869
## Proportion 0.735 0.265
## --------------------------------------------------------------------------------
Note the large deviation in observation values for the PhoneService feature:
data$PhoneService %>% Hmisc::describe()
## .
## n missing distinct
## 7043 0 2
##
## Value No Yes
## Frequency 682 6361
## Proportion 0.097 0.903
Less than 10% of the responses are No. Randomly sampling this feature may result in a training set containg few No responses and therefore impact the predicted model parameters.
set.seed(123)
data_split <- initial_split(data, prop = 3/4)
data_test <- testing(data_split)
data_train <- training(data_split)
data_test$PhoneService %>%
Hmisc::describe()
## .
## n missing distinct
## 1761 0 2
##
## Value No Yes
## Frequency 163 1598
## Proportion 0.093 0.907
data_train$PhoneService %>%
Hmisc::describe()
## .
## n missing distinct
## 5282 0 2
##
## Value No Yes
## Frequency 519 4763
## Proportion 0.098 0.902
To ensure the training and test sets contain comparable distributions of observations you can employ stratified sampling. Stratification works by splitting the data into homogeneous groups based on the observations, known as strata. A sample of values is then taken from each stratum in an amount proportional to its size. The selected values are then pooled to produce the final, representative sample. Note the observation proportions are now very similar to the original:
data_split <- initial_split(data, prop = 3/4, strata = PhoneService)
data_test <- testing(data_split)
data_train <- training(data_split)
data$PhoneService %>%
Hmisc::describe()
## .
## n missing distinct
## 7043 0 2
##
## Value No Yes
## Frequency 682 6361
## Proportion 0.097 0.903
data_test$PhoneService %>%
Hmisc::describe()
## .
## n missing distinct
## 1761 0 2
##
## Value No Yes
## Frequency 152 1609
## Proportion 0.086 0.914
data_train$PhoneService %>%
Hmisc::describe()
## .
## n missing distinct
## 5282 0 2
##
## Value No Yes
## Frequency 530 4752
## Proportion 0.1 0.9