Telco customer churn data is used throughout this tutorial. The data describes customers using the telecommunications platform using features such as gender, whether or not they have paperless billing and if the customer has stopped subscribing (churn).

Supervised machine learning techniques require data to be shared between a testing and training set. A model is generated from the data in the training set and its performance is characterised using the test set.

Initially, data are separated into testing and training sets.

set.seed(123)
data_split <- initial_split(data, prop = 3/4)
data_test <- testing(data_split)
data_train <- training(data_split)

The fraction of data in the training set must not be too large in order to avoid over-fitting. On the other hand, if the fraction is too small, you will obtain poor estimates of the model parameters. Typically, 75% training and 25% testing is a good bet - and is sufficient for the following examples. When a dataset is significantly large (thousands of observations with far fewer features) increasing the size of the training set will result in only minor model improvements. In such cases, the training set size can be reduced to decrease processing time.

You must ensure the training data is representative of the initial dataset in order to build a generizable model. Generally, random sampling of the initial data is sufficient to achieve this. However, if an observation is infrequent (< 10%), random sampling could result in its omission or artificially reduce its impact in the training process. Multiple methods exist to ensure the training data is representative. This page describes two such methods; stratification and up/down sampling.

To produce a higher fidelity model, subsets of the training data can be used to build a model multiple times. This process is known as resampling.

Stratified Sampling

When exploring data it is important to note the number of responses in each category. One way to do this is using the describe method from the Hmisc library:

Hmisc::describe(data)
## data 
## 
##  21  Variables      7043  Observations
## --------------------------------------------------------------------------------
## customerID 
##        n  missing distinct 
##     7043        0     7043 
## 
## lowest : 0002-ORFBO 0003-MKNFE 0004-TLHLJ 0011-IGKFF 0013-EXCHZ
## highest: 9987-LUTYD 9992-RRAMN 9992-UJOEL 9993-LHIEB 9995-HOTOH
## --------------------------------------------------------------------------------
## gender 
##        n  missing distinct 
##     7043        0        2 
##                         
## Value      Female   Male
## Frequency    3488   3555
## Proportion  0.495  0.505
## --------------------------------------------------------------------------------
## SeniorCitizen 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     7043        0        2    0.408     1142   0.1621   0.2717 
## 
## --------------------------------------------------------------------------------
## Partner 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency   3641  3402
## Proportion 0.517 0.483
## --------------------------------------------------------------------------------
## Dependents 
##        n  missing distinct 
##     7043        0        2 
##                     
## Value        No  Yes
## Frequency  4933 2110
## Proportion  0.7  0.3
## --------------------------------------------------------------------------------
## tenure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     7043        0       73    0.999    32.37    28.08        1        2 
##      .25      .50      .75      .90      .95 
##        9       29       55       69       72 
## 
## lowest :  0  1  2  3  4, highest: 68 69 70 71 72
## --------------------------------------------------------------------------------
## PhoneService 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency    682  6361
## Proportion 0.097 0.903
## --------------------------------------------------------------------------------
## MultipleLines 
##        n  missing distinct 
##     7043        0        3 
##                                                              
## Value                    No No phone service              Yes
## Frequency              3390              682             2971
## Proportion            0.481            0.097            0.422
## --------------------------------------------------------------------------------
## InternetService 
##        n  missing distinct 
##     7043        0        3 
##                                               
## Value              DSL Fiber optic          No
## Frequency         2421        3096        1526
## Proportion       0.344       0.440       0.217
## --------------------------------------------------------------------------------
## OnlineSecurity 
##        n  missing distinct 
##     7043        0        3 
##                                                                       
## Value                       No No internet service                 Yes
## Frequency                 3498                1526                2019
## Proportion               0.497               0.217               0.287
## --------------------------------------------------------------------------------
## OnlineBackup 
##        n  missing distinct 
##     7043        0        3 
##                                                                       
## Value                       No No internet service                 Yes
## Frequency                 3088                1526                2429
## Proportion               0.438               0.217               0.345
## --------------------------------------------------------------------------------
## DeviceProtection 
##        n  missing distinct 
##     7043        0        3 
##                                                                       
## Value                       No No internet service                 Yes
## Frequency                 3095                1526                2422
## Proportion               0.439               0.217               0.344
## --------------------------------------------------------------------------------
## TechSupport 
##        n  missing distinct 
##     7043        0        3 
##                                                                       
## Value                       No No internet service                 Yes
## Frequency                 3473                1526                2044
## Proportion               0.493               0.217               0.290
## --------------------------------------------------------------------------------
## StreamingTV 
##        n  missing distinct 
##     7043        0        3 
##                                                                       
## Value                       No No internet service                 Yes
## Frequency                 2810                1526                2707
## Proportion               0.399               0.217               0.384
## --------------------------------------------------------------------------------
## StreamingMovies 
##        n  missing distinct 
##     7043        0        3 
##                                                                       
## Value                       No No internet service                 Yes
## Frequency                 2785                1526                2732
## Proportion               0.395               0.217               0.388
## --------------------------------------------------------------------------------
## Contract 
##        n  missing distinct 
##     7043        0        3 
##                                                        
## Value      Month-to-month       One year       Two year
## Frequency            3875           1473           1695
## Proportion          0.550          0.209          0.241
## --------------------------------------------------------------------------------
## PaperlessBilling 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency   2872  4171
## Proportion 0.408 0.592
## --------------------------------------------------------------------------------
## PaymentMethod 
##        n  missing distinct 
##     7043        0        4 
##                                                               
## Value      Bank transfer (automatic)   Credit card (automatic)
## Frequency                       1544                      1522
## Proportion                     0.219                     0.216
##                                                               
## Value               Electronic check              Mailed check
## Frequency                       2365                      1612
## Proportion                     0.336                     0.229
## --------------------------------------------------------------------------------
## MonthlyCharges 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     7043        0     1585        1    64.76    34.39    19.65    20.05 
##      .25      .50      .75      .90      .95 
##    35.50    70.35    89.85   102.60   107.40 
## 
## lowest :  18.25  18.40  18.55  18.70  18.75, highest: 118.20 118.35 118.60 118.65 118.75
## --------------------------------------------------------------------------------
## TotalCharges 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     7032       11     6530        1     2283     2449     49.6     84.6 
##      .25      .50      .75      .90      .95 
##    401.4   1397.5   3794.7   5976.6   6923.6 
## 
## lowest :   18.80   18.85   18.90   19.00   19.05
## highest: 8564.75 8594.40 8670.10 8672.45 8684.80
## --------------------------------------------------------------------------------
## Churn 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency   5174  1869
## Proportion 0.735 0.265
## --------------------------------------------------------------------------------

Note the large deviation in observation values for the PhoneService feature:

data$PhoneService %>% Hmisc::describe()
## . 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency    682  6361
## Proportion 0.097 0.903

Less than 10% of the responses are No. Randomly sampling this feature may result in a training set containg few No responses and therefore impact the predicted model parameters.

set.seed(123)
data_split <- initial_split(data, prop = 3/4)
data_test <- testing(data_split)
data_train <- training(data_split)

data_test$PhoneService %>% 
  Hmisc::describe()
## . 
##        n  missing distinct 
##     1761        0        2 
##                       
## Value         No   Yes
## Frequency    163  1598
## Proportion 0.093 0.907
data_train$PhoneService %>% 
  Hmisc::describe()
## . 
##        n  missing distinct 
##     5282        0        2 
##                       
## Value         No   Yes
## Frequency    519  4763
## Proportion 0.098 0.902

To ensure the training and test sets contain comparable distributions of observations you can employ stratified sampling. Stratification works by splitting the data into homogeneous groups based on the observations, known as strata. A sample of values is then taken from each stratum in an amount proportional to its size. The selected values are then pooled to produce the final, representative sample. Note the observation proportions are now very similar to the original:

data_split <- initial_split(data, prop = 3/4, strata = PhoneService)
data_test <- testing(data_split)
data_train <- training(data_split)

data$PhoneService %>% 
  Hmisc::describe()
## . 
##        n  missing distinct 
##     7043        0        2 
##                       
## Value         No   Yes
## Frequency    682  6361
## Proportion 0.097 0.903
data_test$PhoneService %>% 
  Hmisc::describe()
## . 
##        n  missing distinct 
##     1761        0        2 
##                       
## Value         No   Yes
## Frequency    152  1609
## Proportion 0.086 0.914
data_train$PhoneService %>% 
  Hmisc::describe()
## . 
##        n  missing distinct 
##     5282        0        2 
##                     
## Value        No  Yes
## Frequency   530 4752
## Proportion  0.1  0.9

Upsampling and Downsampling

Data Resampling