Vertikal Willis - Missing values handling

On this occasion, I would like to discuss missing values. There are various methods for imputing missing values, such as using descriptive statistics like the mean, median, mode, and employing techniques like the KNN Imputer. However, it is essential to first gain a deep understanding of the domain-specific context and the data itself. Before imputing missing values, we must first classify them into three categories, which are:

Missing Completely At Random (MCAR)

This means that missing values are truly missing without any discernible pattern or relation to other variables or factors. They are simply missing. For example, when people do not input certain values in a form due to carelessness or negligence. This type of missing values should not be many.

Missing At Random (MAR)

This means that the missing values are not entirely missing. First, it could be because the values are 0 or None. Second, they may have a relationship with or can be inferred from other columns or variables. For example, when people sell an iPhone online but don’t fill in the ‘Operating system’ field because iPhones always run on iOS and can never be Android. A second example would be when people sell their houses online but don’t specify the number of floors because it can be visually observed in the photos.

Missing Not At Random (MNAR)

This means that the missing values are not related to other variables within the dataset but are instead associated with external factors. For example, individuals with significant debt may choose not to tell it. Imputing this type of missing values is more challenging due to the influence of external factors.

After classifying them, let’s proceed to the practical demonstration of how to impute missing values.

Example 1

Let’s say you want to retrieve employee information from the servers to build a salary prediction model.

Name	Age	Position	Salary
Sarah Connor	35	Accounting	50k
John Connor	32	Marketing	35k
Kyle Reese	28	IT	45k
Vertikal		IT	45k

You can see that there’s a missing value in the ‘Age’ column under the ‘vertikal’ name. Since this data is derived from employee information stored on the server, it indicates that the missing value must be MCAR and not MAR or MNAR. Let’s take into consideration that this missing value is relatively small, so an appropriate imputation method would be to use the mean or median of the ‘Age’ column. If the missing values are high, then it’s better to report it to database administrator because by context, ‘Age’ has no relation to other columns and thus only can be imputed by itself.

Example 2

Imagine you are conducting data collection on an online vehicle listing website.

Brand	Type	Transmission	Color	Price
Toyota	Innova	Manual	White	200Jt
Honda	Brio	Automatic	Orange	125Jt
Honda	Brio	Manual	Black	110Jt
Suzuki	Ertiga	Manual	Red	140Jt
Suzuki	Ertiga		White	142Jt
Suzuki	Ertiga	Automatic	Black	165Jt

There’s a missing value in the ‘Transmission’ column, and upon investigation, it’s apparent that users mentioned the transmission type in the title rather than in the ‘Vehicle Transmission Type’ field. In this case, the missing value is of type MAR. Three possible approaches to impute this missing value are: directly inspecting the ad, re-scraping the data to extract ‘title’ information and apply text mining, or using a KNN Imputer based on the contextual knowledge that the same vehicle with automatic transmission is typically more expensive than manual transmission. The third method, KNN Imputer, is preferable as it take less time and resources consuming compared to other methods.

Example 3

Let’s say you’re thinking about surveying gym-goers for some fitness insights.

Gender	Weight	Age	Membership
Male	95Kg	25	6M
Female		23	1Y
Male	85Kg	35	6M
Male	80Kg	40	2Y
Female		28	6M
Male	102Kg	38	3M
Female	55Kg	41	1Y

There’s a missing column in the ‘Weight’ column, and initially, you might think that these missing values are MCAR. However, you come to realize that the missing values mostly occur for individuals with a gender of ‘Female.’ This could be due to that females may be more reluctant to disclose their weight when they are overweight. Based on this pattern, you can consider these missing values as type MNAR. If there are only a few missing values, you can impute the weight for females by using the ‘max’ value for weight among females. However, if the missing values are more extensive, it’s essential to validate the assumption first before proceeding with imputation.

Example 4

Let’s imagine you want to build a machine learning model to predict whether an email is spam or not.

Domain	Words_count	From	Spam
Public	235	IND	Yes
Public	238	USA	Yes
Public	255	IND	Yes
Public	267	DMK	No
Public	310	EUR	No
	320	AUS	No

Let’s consider that you have 10,000 rows of data above, and 98% of the data falls within the public domain. In such cases, you don’t need to impute the missing values, as the column has high cardinality, as it is likely to be dropped later in the analysis. Therefore, before considering imputing missing values, choose the features that you want to include in your analysis first to avoid unnecessary effort and redundancy.

Example 5

Look at the dataset below:

Name	Age	Position	Salary
Sarah Connor	35	Unknown	50k
John Connor	32	Marketing	35k
Kyle Reese	28	Unknown	45k
Vertikal	32	IT	45k
John Cena	33	Marketing	35k
Tom	28	IT	45k

You may not observe any missing values in the data, but take a closer look at the ‘Position’ column with ‘Unknown’ value. It should still be considered a form of missing data. It’s a good reminder that missing values can take various forms and may not always be represented as NULL or NaN. In this case, ‘Unknown’ essentially signifies that the specific position information is not available, making it a type of missing value. It’s crucial to remain vigilant and flexible in identifying and handling missing data in various formats.

Conclusions

Understanding the data is crucial before you start imputing missing values. Carelessly applying imputation techniques without a clear understanding of the data context can lead to incorrect results, which could have significant consequences.