Real-word data is never perfect, you will frequently encounter missing observations. Sometimes a missing value can be informative. Consider a pair of conditional features; asking someone “if they had lunch” and “what they ate for lunch”. Missing observations will be found in the food column whenever lunch was not eaten. In this case, the missing data can be re-categorised to “None” and input into the training process. The second category of missing data is less useful, random missingness. This could result from any number of reasons.
Some machine learning algorithms accommodate missing values, such as models decision trees, but many do not (linear regression etc.). You will need to decide how to deal with these values. Two options are available; removal and imputation.
To begin dealing with missing values, you must first visualise them. This page will make use of the iris dataset, from the R datasets library, with observations randomly removed.
The first step is to determine whether any data is missing:
any(is.na(data))
## [1] TRUE
…and if so, how is that missing data distributed?
purrr::map_df(.x = data,
.f = function(.x){sum(is.na(.x))})
## # A tibble: 1 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <int> <int> <int> <int> <int>
## 1 22 17 15 16 24
The visdat package simplifies this process further with the vis_miss function, resulting in an intuitive visualisation of missing data.
visdat::vis_miss(data)