import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import gaussian_kde
import matplotlib as mpl
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.metrics import confusion_matrix
from pdpbox import pdp, info_plots
'figure.dpi'] = 300 #set figure dpi
mpl.rcParams[set() #set figure styling sns.
Table Of Contents
• Introduction
• Data Introduction
• Data Preparation (Import libraries, data cleaning & data wrangling)
• Basic exploratory data analysis
* Descriptive statistic
* Univariate analysis
* Multivariate analysis
• Deep exploratory data analysis
* Services and internet services analysis
* Monthly charges analysis
* Customer analysis
* Benefits analysis
* Churn analysis
• Modelling
* Features selection and encoding
* Splits data and define custom metrics
* Model building combination 1
* Model building combination 2
* Model building combination 3
• Model interpretation
* PDP and ICE plots
* Checking prediction’s confidence using confidence interval
Introduction
This project is a final project created as one of the requirements to pass the Dibimbing Data Science Bootcamp. The dataset I used is the telco customer churn dataset, which is popular and commonly used for practice. I chose this dataset because it has numerous columns and it represents a ‘real-life’ dataset from an actual telco company. The problem faced by the company is customer churn, which results in financial loss. The solution involves providing prospects and promotions to potential churned customers, but this action also incurs costs for the company. Therefore, accurately targeting the churned customers is crucial. Based on this problem, I set two main goals:
1.Look for useful information, insights, and patterns by performing exploratory data analysis.
2.Create an ML model that best fits the company’s goal of reducing losses.
Data Introduction
This dataset contain informations about customers that churn and not churned in a telco company. Below is the column informations:
1.customerID = customer unique ID.
2.gender = customer gender (M/F).
3.SeniorCitizen = old / young customer.
4.Partner = either a customer has partners or not.
5.Dependents = either a customer has dependents or not.
6.tenure = how long the customer subscribed (in month).
7.MultipleLines = either a customer using multiple lines or not (phone lines).
8.InternetService = either a customer using InternetService lines or not.
9.OnlineSecurity = either a customer has OnlineSecurity or not.
10.OnlineBackup = either a customer has OnlineBackup or not.
11.DeviceProtection = either a customer has DeviceProtection or not.
12.TechSupport = either a customer has TechSupport or not.
13.StreamingTV = either a customer has StreamingTV or not.
14.StreamingMovies = either a customer has StreamingMovie or not.
15.Contract = types of contract.
16.PaperlessBilling = either a customer has PaperlessBilling or not.
17.PaymentMethod = types of the payment method.
18.MonthlyCharges = how much charges per month.
19.TotalCharges = total charges of all time.
20.Churn = either a customer churn or not.
21.Hobby = customer hobby.
Data Preparation (Import libraries, data cleaning & data wrangling)
#dataset link
= 'https://raw.githubusercontent.com/vertikalwil/Data-Analyst/main/telco.csv' github
#import dataset
= pd.read_csv('telco.csv')
df df.head()
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | Hobby | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | Female | 0 | Yes | No | 135 | No | No phone service | DSL | No | ... | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No | Swimming |
1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No | Running |
2 | 3668-QPYBK | Male | 0 | No | No | 140 | Yes | No | DSL | Yes | ... | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 7560 | Yes | Hiking |
3 | 7795-CFOCW | Male | 0 | No | No | 136 | No | No phone service | DSL | Yes | ... | Yes | No | No | One year | No | Bank transfer (automatic) | 42.45 | 1840.75 | No | Swimming |
4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes | Running |
5 rows × 22 columns
#take a look at the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 6627 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 6798 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 4859 non-null object
20 Churn 7043 non-null object
21 Hobby 4201 non-null object
dtypes: float64(1), int64(2), object(19)
memory usage: 1.2+ MB
There are 4 columns with missing values which are DeviceProtection, Contract, TotalCharges and Hobby.
#checking percentage of missing values
= ['DeviceProtection','Contract','TotalCharges','Hobby']
missingkolom for x in missingkolom:
print(f'Missing value of column {x} (%) : {round(df[x].isna().sum()/len(df) * 100,2)}')
Missing value of column DeviceProtection (%) : 5.91
Missing value of column Contract (%) : 3.48
Missing value of column TotalCharges (%) : 31.01
Missing value of column Hobby (%) : 40.35
#impute missing values with univariate imputation by value proportion
'DeviceProtection'] = df['DeviceProtection'].fillna(
df['No','Yes','No internet service'],
pd.Series(np.random.choice([= list(df['DeviceProtection'].value_counts(normalize=True)), size=len(df))))
p
'Contract'] = df['Contract'].fillna(
df['Month-to-month','Two year','One year'],
pd.Series(np.random.choice([= list(df['Contract'].value_counts(normalize=True)), size=len(df)))) p
Here I impute DeviceProtection and Contract with univariate imputation by value proportion for the following reasons:
1.The missing values is not that much (<10%).
2.The columns don’t have any relationship with other columns so that multivariate imputation is not possible.
3.Using proportion is more precise in this case rather than use ‘mode’.
#delete column Hobby
=['Hobby'],inplace=True) df.drop(columns
Reasons to delete:
1.Missing values is too many.
2.By business context, Hobby doesn’t give enough useful informations.
3.Cannot be imputed by multivariate imputation.
#impute TotalCharges from tenure and MonthlyCharges
'TotalCharges'] = df['TotalCharges'].fillna(df['tenure'] * df['MonthlyCharges']) df[
Even this column has so many missing values, I decided to impute it with multivariate imputation because:
1.By business context, TotalCharges is more or less tenure * MonthlyCharges.
2.So even the missing values are high, it can still be imputed with a strong justification.
#there's a space in the total charges column.
for x in df.TotalCharges:
try:
float(x)
except:
print(f'Unable to convert to float with this value : {x}')
Unable to convert to float with this value :
Unable to convert to float with this value :
Unable to convert to float with this value :
Unable to convert to float with this value :
Unable to convert to float with this value :
Unable to convert to float with this value :
Unable to convert to float with this value :
Unable to convert to float with this value :
Unable to convert to float with this value :
Unable to convert to float with this value :
Unable to convert to float with this value :
Unable to convert to float with this value :
== ' '].head() df[df.TotalCharges
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
488 | 4472-LVYGI | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | ... | Yes | Yes | Yes | No | Month-to-month | Yes | Bank transfer (automatic) | 52.55 | No | |
753 | 3115-CZMZD | Male | 0 | No | Yes | 0 | Yes | No | No | No internet service | ... | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.25 | No | |
936 | 5709-LVOEQ | Female | 0 | Yes | Yes | 0 | Yes | No | DSL | Yes | ... | No | No | Yes | Yes | Two year | No | Mailed check | 80.85 | No | |
1082 | 4367-NUYAO | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | ... | Yes | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.75 | No | |
1334 | 1768-ZAIFU | Female | 1 | No | No | 0 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 25.20 | Yes |
5 rows × 21 columns
When the tenure value is 0, the TotalCharges value is empty (‘space’). This is because customers who have just joined (less than a month) have not been charged yet, resulting in a TotalCharges value of 0. Since there are only 12 rows with this condition, I will delete them.
#drop rows that has empty TotalCharges.
= df.drop(df.index[df.TotalCharges == ' ']).reset_index(drop=True) df
#check for duplicate data, if True then there's no duplicate.
== len(df) df.customerID.nunique()
True
#feature engineered 2 new features for the sake of easier analysis.
'Services'] = df[['PhoneService','InternetService']].apply(
df[lambda x: 'Both' if list(x).count('No') == 0 else
'Internet Only' if x[0] == 'No' else 'Phone Only', axis=1)
'TotalBenefits'] = df.loc[:,'OnlineSecurity':'StreamingMovies']\
df[apply(lambda x: list(x).count('Yes'), axis=1) .
New features explanation:
1.Services = Combined values of PhoneService and InternetService.
2.TotalBenefits = Sum of benefits taken on OnlineSecurity until StreamingMovies.
#Change values of 1 and 0 to 'Yes' and 'No'
'SeniorCitizen'] = df.SeniorCitizen.apply(lambda x: 'Yes' if x == 1 else 'No') df[
Change numerical value to strings for simpler and consistent analysis.
#drop useless column
=['customerID'], inplace=True) df.drop(columns
#change columns object data type to numerical
= df.tenure.astype('int64')
df.tenure = df.MonthlyCharges.astype('float64')
df.MonthlyCharges = df.TotalCharges.astype('float64') df.TotalCharges
#checking values of real totalcharges and calculated totalcharges
'TotalChargesDiff'] = df[['tenure','MonthlyCharges','TotalCharges']].apply(
df[lambda x: round(abs(1 - (x[0] * x[1] / x[2])) * 100, 3), axis=1)
Here, I have created a new column called TotalChargesDiff to check the differences (%) between the actual TotalCharges value (from the dataset) and the calculated TotalCharges value (obtained by multiplying tenure with MonthlyCharges). If the difference is above 40%, I will consider those rows as invalid because the values of tenure and MonthlyCharges cannot be trusted.
'TotalChargesDiff'].sort_values(ascending=False).head(10) df[
0 13400.000
5 1357.404
18 788.048
19 330.214
3 213.633
128 73.511
47 72.615
4631 64.286
5802 63.380
20 58.263
Name: TotalChargesDiff, dtype: float64
You can observe that some data points have a TotalChargesDiff that reaches hundreds or even thousands.
#removing rows that have > 40% TotalChargesDiff.
= df[df.TotalChargesDiff < 40].reset_index(drop=True)
df =['TotalChargesDiff'], inplace=True) df.drop(columns
def numericategoric(df):
= len(df._get_numeric_data().columns)
num = len(df.columns) - num
cat print("TotalNumericalData = " + str(num))
print("TotalCategoricalData = " + str(cat))
print("Numerical = " + str(list(df._get_numeric_data().columns )))
print("Categorical = " + str(list(df.drop(df._get_numeric_data().columns, axis=1).columns)))
numericategoric(df)
TotalNumericalData = 4
TotalCategoricalData = 18
Numerical = ['tenure', 'MonthlyCharges', 'TotalCharges', 'TotalBenefits']
Categorical = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn', 'Services']
Show numerical and categorical columns
#assign categorical and numerical columns on different dataframe for easier analysis.
= df._get_numeric_data()
dfnum = df.drop(columns = dfnum.columns) dfcat
Basic Exploratory Data Analysis.
Descriptive Statistics
#numerical columns describe.
dfnum.describe()
tenure | MonthlyCharges | TotalCharges | TotalBenefits | |
---|---|---|---|---|
count | 7012.000000 | 7012.000000 | 7012.000000 | 7012.000000 |
mean | 32.506560 | 64.732760 | 2286.410207 | 2.041358 |
std | 24.564234 | 30.109753 | 2265.759401 | 1.835792 |
min | 1.000000 | 12.000000 | 13.500000 | 0.000000 |
25% | 9.000000 | 35.450000 | 402.437500 | 0.000000 |
50% | 29.000000 | 70.300000 | 1397.250000 | 2.000000 |
75% | 56.000000 | 89.850000 | 3784.125000 | 3.000000 |
max | 140.000000 | 118.750000 | 8684.800000 | 6.000000 |
1.All columns seems to have a normal min-max values. Nothing weird here.
2.Average tenure is about 30 months which is pretty low.
3.Average MonthlyCharge is about 65-70 USD which is pretty good.
4.Out of 6 benefits available, the average taken by customer is around 2, which is pretty low.
dfcat.describe()
gender | SeniorCitizen | Partner | Dependents | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | Churn | Services | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 | 7012 |
unique | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 4 | 2 | 3 |
top | Male | No | No | No | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | No | Both |
freq | 3542 | 5874 | 3624 | 4920 | 6336 | 3372 | 3087 | 3486 | 3075 | 3072 | 3459 | 2798 | 2769 | 3851 | 4155 | 2354 | 5155 | 4821 |
1.Male and Female has the same proportion.
2.Most of customers is non SeniorCitizen with no Partner and No Dependents.
3.Favorite InternetService is Fiber optic.
4.Majority of customers is subscribed with ‘Month-to-month’ Contract.
5.5155 out of 7012 is non-Churn customers which make this dataset is imbalanced.
6.All these columns have low cardiality values.
for col in dfcat.columns:
print(f'Value counts for column {col}:')
print(df[col].value_counts())
print('---'*10)
print('\n')
Value counts for column gender:
Male 3542
Female 3470
Name: gender, dtype: int64
------------------------------
Value counts for column SeniorCitizen:
No 5874
Yes 1138
Name: SeniorCitizen, dtype: int64
------------------------------
Value counts for column Partner:
No 3624
Yes 3388
Name: Partner, dtype: int64
------------------------------
Value counts for column Dependents:
No 4920
Yes 2092
Name: Dependents, dtype: int64
------------------------------
Value counts for column PhoneService:
Yes 6336
No 676
Name: PhoneService, dtype: int64
------------------------------
Value counts for column MultipleLines:
No 3372
Yes 2964
No phone service 676
Name: MultipleLines, dtype: int64
------------------------------
Value counts for column InternetService:
Fiber optic 3087
DSL 2410
No 1515
Name: InternetService, dtype: int64
------------------------------
Value counts for column OnlineSecurity:
No 3486
Yes 2011
No internet service 1515
Name: OnlineSecurity, dtype: int64
------------------------------
Value counts for column OnlineBackup:
No 3075
Yes 2422
No internet service 1515
Name: OnlineBackup, dtype: int64
------------------------------
Value counts for column DeviceProtection:
No 3072
Yes 2416
No internet service 1524
Name: DeviceProtection, dtype: int64
------------------------------
Value counts for column TechSupport:
No 3459
Yes 2038
No internet service 1515
Name: TechSupport, dtype: int64
------------------------------
Value counts for column StreamingTV:
No 2798
Yes 2699
No internet service 1515
Name: StreamingTV, dtype: int64
------------------------------
Value counts for column StreamingMovies:
No 2769
Yes 2728
No internet service 1515
Name: StreamingMovies, dtype: int64
------------------------------
Value counts for column Contract:
Month-to-month 3851
Two year 1697
One year 1464
Name: Contract, dtype: int64
------------------------------
Value counts for column PaperlessBilling:
Yes 4155
No 2857
Name: PaperlessBilling, dtype: int64
------------------------------
Value counts for column PaymentMethod:
Electronic check 2354
Mailed check 1599
Bank transfer (automatic) 1539
Credit card (automatic) 1520
Name: PaymentMethod, dtype: int64
------------------------------
Value counts for column Churn:
No 5155
Yes 1857
Name: Churn, dtype: int64
------------------------------
Value counts for column Services:
Both 4821
Phone Only 1515
Internet Only 676
Name: Services, dtype: int64
------------------------------
Show all value counts for each categorical columns.
Univariate Analysis
= plt.subplots(1,4, figsize=(10, 4))
fig, axarr for x in dfnum.columns:
=True)
axarr[dfnum.columns.get_loc(x)].boxplot(df[x],patch_artist
axarr[dfnum.columns.get_loc(x)].set_xlabel(x)'Outliers checking on numeric columns')
plt.suptitle(=1)
fig.tight_layout(pad plt.show()
Will drop outlier in tenure.
#drop outlier in tenure
= df[df.tenure < 125] df
#plot distribution for numerical columns
= plt.subplots(1,4, figsize=(10, 4))
fig, axarr for x in dfnum.columns:
=dfnum[x],
sns.histplot(data='skyblue',
color=True,
kde='none',
edgecolor=axarr[dfnum.columns.get_loc(x)])
ax'Distribution plot', weight='bold')
plt.suptitle(=1) fig.tight_layout(pad
Above plots are distribution plots on all numerical columns.
1.tenure and MonthlyCharges have a ‘U-shaped’ distribution.
2.TotalCharges has a positive-skew distribution.
#count plot for categorical columns
=(15,12))
plt.figure(figsize
= dfcat.columns
features for i in np.arange(1, len(features)+1):
5, len(features)//3 - 2, i)
plt.subplot(=df[features[i-1]], color='green')
sns.countplot(x=10)
plt.xticks(rotation-1])
plt.xlabel(features[i'CountPlot', size=19, weight='bold')
plt.suptitle(= 1) plt.tight_layout(pad
Here I count plotted all categorical columns.
Multivariate Analysis
#count plots against 'churn'
=(15,12))
plt.figure(figsize
= dfcat.columns
features for i in np.arange(1, len(features)+1):
5, len(features)//3 - 2, i)
plt.subplot(=df, x=df[features[i-1]], hue='Churn')
sns.countplot(data={'size': 8})
plt.legend(prop=10)
plt.xticks(rotation-1])
plt.xlabel(features[i'CountPlot vs Churned', size=19, weight='bold')
plt.suptitle(= 1) plt.tight_layout(pad
I added Churn count into the categorical plots.
1.You can see for column Gender, the values and Churn count is pretty equal which make this column will have a very low predictive power.
2.For InternetService, fiber optic has way higher in churn probability compare to DSL.
3.Same ways also applied on Month-to-month Contract and Electronic-check PaymentMethod.
#distribution plots against 'churn'
= plt.subplots(1,4, figsize=(10, 4))
fig, axarr for x in dfnum.columns:
=df,
sns.histplot(data= dfnum[x],
x ='skyblue',
color=True,
kde='none',
edgecolor=axarr[dfnum.columns.get_loc(x)],
ax='Churn')
hue
"Distribution plot", weight='bold')
plt.suptitle(=1) fig.tight_layout(pad
Let’s also compare Churn distribution on numerical columns. Customers tend to churn when the tenure is low and not churn when the MonthlyCharges is very low. I will do further analysis about these columns later.
#change binary column into numerical
= df.copy()
dfcorr = ['gender','SeniorCitizen','Partner','Dependents','PhoneService','PaperlessBilling','Churn']
binary = {
value_mapping 'No': 0,
'Yes' : 1,
'Male' : 1,
'Female' : 0
}
for col in binary:
= dfcorr[col].map(value_mapping).astype('int64') dfcorr[col]
=(10,8))
plt.figure(figsize=True, fmt='.2f')
sns.heatmap(dfcorr.corr(), annot plt.show()
Correlation!
1.TotalCharges and tenure have a high positive correlation (causation : the longer the customers subscribed, the more they paid).
2.TotalBenefits also has a strong correlation with MonthlyCharges and TotalCharges (causation : more benefits taken also make the MonthlyCharges higher).
Note : Correlation doesn’t indicate causation. Understanding the specific context, industry knowledge, and conducting further analysis or experiments can help determine if there is a causal relationship between the variables or if other factors are influencing the observed correlations.
Deep-Dive Exploratory Data Analysis
#create a function to plot churn probability for numerical columns.
def prob_plot(df,colom,x):
= df[colom].mean()
means = df[colom].median()
medians = df[df.Churn == 'Yes'][colom].astype('float64')
data = df[df.Churn == 'No'][colom].astype('float64')
data1
= gaussian_kde(data)
kde = gaussian_kde(data1)
kde1 = np.linspace( min(data), max(data), 200)
dist_space = np.linspace( min(data1), max(data1), 200)
dist_space1 ='Churned', color='orange' )
axarr[x].plot( dist_space, kde(dist_space), label='Not churn', color='blue')
axarr[x].plot( dist_space1, kde1(dist_space1), label= means, linestyle = '--', color='g', label='Mean')
axarr[x].axvline(x = medians, linestyle = '--', color='r', label='Median')
axarr[x].axvline(x 'Probability', fontweight='bold', size=12)
axarr[x].set_title(set(ylabel = 'Probability', xlabel = colom)
axarr[x]. axarr[x].legend()
Services & InternetService analysis
#count plot for Services.
= plt.subplots(1,2, figsize=(12, 6))
fig, axarr =df[df.Services == 'Both'],
sns.countplot(data='InternetService',
x=axarr[0])
ax
=df[df.Services == 'Internet Only'],
sns.countplot(data='InternetService',
x=axarr[1])
ax
0].set_title("Both phone service & internet service", weight='bold')
axarr[1].set_title("Internet service only", weight='bold')
axarr[
"Comparison of internet services on product services")
plt.suptitle(=1)
plt.tight_layout(pad plt.show()
It is observed that customers tend to prefer Fiber optic over DSL for ‘phone & internet service’. However, when considering ‘internet service only’ without phone, there is no option for Fiber optic available. This suggests that in order to utilize Fiber optic, a phone connection (or phone service) is required.
#plot churn probability for 'services' & 'internet services'
= plt.subplots(1,2, figsize=(12, 6))
fig, axarr
'Services')['Churn'].agg(lambda x: list(x).count('Yes') * 100 / len(list(x)))\
df.groupby(='bar', width=0.3,rot=True, ax=axarr[0])
.plot(kind
!= 'Phone Only'].groupby(['InternetService'])['Churn'].agg(lambda x: list(x).count('Yes') * 100 / len(x))\
df[df.Services ='bar', width=0.3,rot=True, ax=axarr[1])
.plot(kind
0].set_title('Services churn probability', weight='bold')
axarr[1].set_title('Internet service churn probability', weight='bold')
axarr[
=1)
plt.tight_layout(pad plt.show()
Customers that used ‘Both’ services and InternetService fiber optic tends to churn more.
MonthlyCharges analysis
#checking InternetService price.
= df[(df.Services == 'Both') & (df.TotalBenefits == 0) & (df.MultipleLines == 'No')].copy()
df_filtered 'InternetService')['MonthlyCharges'].mean() df_filtered.groupby(
InternetService
DSL 44.963824
Fiber optic 70.083186
Name: MonthlyCharges, dtype: float64
The price of Fiber optic is higher, around 25 USD, compared to DSL. However, it’s important to keep in mind that these prices include a phone service with a single line.
= df[(df.Services == 'Phone Only') & (df.TotalBenefits == 0)].copy()
df_filtered 'MultipleLines')['MonthlyCharges'].mean() df_filtered.groupby(
MultipleLines
No 19.953819
Yes 24.973716
Name: MonthlyCharges, dtype: float64
=(10,5))
plt.figure(figsize=df_filtered,
sns.histplot(data='MonthlyCharges',
x='MultipleLines',
hue='stack')
multiple
'Phone service distribution', weight='bold')
plt.title(
plt.tight_layout() plt.show()
We can see that the price for a ‘phone service’ with a single line is around 20 USD, while the price for a ‘phone service’ with multiple lines is around 25 USD. This also means that the price for DSL is around 25 USD, while the price for Fiber optic is around 50 USD, which is twice as much as DSL.
= plt.subplots(1,2, figsize=(12, 6))
fig, axarr = sns.histplot(data=df,
mc_dist = 'MonthlyCharges',
x ='Churn',
hue=axarr[0],
ax='stack')
multiple
0].set_title('Distribution', fontweight='bold', size=12)
axarr[
'MonthlyCharges',1)
prob_plot(df,1].legend(loc='upper right')
axarr[ plt.show()
At a MonthlyCharges range of approximately +- 20 USD, the ratio of non-churn customers is very high. It is known that products within this price range are typically ‘phone service only’. However, between the price range of 60 - 100 USD, the churn probability increases significantly. I am planning to conduct further analysis specifically for customers within this price range.
= df[(df.InternetService == 'Fiber optic') & (df.Services == 'Both') & (df.MultipleLines == 'No')]
df_filtered = df_filtered.groupby('TotalBenefits')['MonthlyCharges'].agg('mean').reset_index()
df_agg 'MonthlyCharges'] = df_agg.MonthlyCharges.round() df_agg[
= plt.subplots(1,2, figsize=(12, 6))
fig, axarr =df_filtered,
sns.scatterplot(data='MonthlyCharges',
x='TotalBenefits',
y=35,
s=axarr[0])
ax
= 'TotalBenefits', y = 'MonthlyCharges')
sns.barplot(df_agg, x 0].set_title('MonthlyCharges vs Totalbenefits', weight='bold')
axarr[1].set_title('Average MonthlyCharges vs Totalbenefits', weight='bold')
axarr[= 1)
fig.tight_layout(pad plt.show()
Here, you can observe that as more TotalBenefits are taken, the MonthlyCharges also increase. On the left plot, you can see that there are 5 outlier data points, which will be removed later.
= df[(df.InternetService == 'Fiber optic') & (df.Services == 'Both')\
df_fil & (df.MultipleLines == 'No') & (df.TotalBenefits == 1)]
= ['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
benefits
= pd.DataFrame()
df_benefit for x in benefits:
= pd.DataFrame([x,
df_value == 'Yes']['MonthlyCharges'].min(),
df_fil[df_fil[x] == 'Yes']['MonthlyCharges'].max(),
df_fil[df_fil[x] == 'Yes']['MonthlyCharges'].mean(),
df_fil[df_fil[x] == 'Yes']['MonthlyCharges'].median()
df_fil[df_fil[x]
]).transpose()
= pd.concat([df_benefit, df_value])
df_benefit = ['Benefit','MinCharges','MaxCharges','MeanCharges','MedianCharges'] df_benefit.columns
df_benefit
Benefit | MinCharges | MaxCharges | MeanCharges | MedianCharges | |
---|---|---|---|---|---|
0 | OnlineSecurity | 73.2 | 80.3 | 74.952857 | 75.0 |
0 | OnlineBackup | 72.75 | 76.65 | 74.708511 | 74.65 |
0 | DeviceProtection | 69.55 | 76.65 | 74.258889 | 74.8 |
0 | TechSupport | 73.85 | 76.55 | 75.045455 | 74.7 |
0 | StreamingTV | 77.65 | 81.9 | 79.746825 | 79.75 |
0 | StreamingMovies | 12.0 | 86.45 | 79.16746 | 80.0 |
Above are the prices of benefits with Fiber optic and a single line phone connection. You can see that StreamingTV and StreamingMovies are more expensive compared to other benefits, approximately +- 5 USD.
Now that we know the prices of every product, here’s a recap:
DSL = approximately 25 USD.
Fiber optic = approximately 50 USD.
Phone service (single line) = approximately 20 USD.
Phone service (multiple lines) = approximately 25 USD.
OnlineSecurity - TechSupport = approximately 5 USD.
StreamingTV - StreamingMovies = approximately 10 USD.
With this data, we can perform a simple ‘anomaly detection’ by manually calculating the MonthlyCharges and comparing them with the actual MonthlyCharges, similar to how we calculated the TotalChargesDiff above.
#checking for MonthlyCharges values with the calculated one (similar with checking TotalCharges difference).
def MonthlyChargesDiff(x):
= 0
estimation if x['PhoneService'] == 'Yes':
+= 20
estimation if x['MultipleLines'] == 'Yes':
+= 5
estimation if x['InternetService'] == 'DSL':
+= 25
estimation if x['InternetService'] == 'Fiber optic':
+= 50
estimation
if (x['StreamingTV'] == 'Yes') & (x['StreamingMovies'] == 'Yes'):
+= 20 + (x['TotalBenefits'] - 2) * 5
estimation elif (x['StreamingTV'] == 'Yes') | (x['StreamingMovies'] == 'Yes'):
+= 10 + (x['TotalBenefits'] - 1) * 5
estimation else:
+= x['TotalBenefits'] * 5
estimation
return abs(1 - (estimation / x['MonthlyCharges'])) * 100
'MonthlyChargesEstimationDifference'] = df.apply(MonthlyChargesDiff, axis=1) df[
> 40][['MonthlyCharges','MonthlyChargesEstimationDifference']] df[df.MonthlyChargesEstimationDifference
MonthlyCharges | MonthlyChargesEstimationDifference | |
---|---|---|
12 | 29.00 | 296.551724 |
389 | 12.00 | 733.333333 |
666 | 12.00 | 566.666667 |
859 | 26.41 | 278.644453 |
1439 | 18.26 | 447.645126 |
2185 | 21.63 | 362.320851 |
4090 | 31.26 | 219.897633 |
5848 | 15.00 | 466.666667 |
6718 | 21.00 | 304.761905 |
You can see that there are 9 rows with extreme MonthlyCharges values. These are considered as ‘anomalies’, so let’s remove them.
#remove MonthlyCharges extreme values.
= df[df.MonthlyChargesEstimationDifference < 40].reset_index(drop=True) df
Customer analysis
#creating a function to engineered a new feature.
def statuss(x):
= list(x)
x if (x[0] == 'Yes') & (x[1] == 'Yes'):
return 'Both'
elif (x[0] == 'Yes') & (x[1] == 'No'):
return 'Partner Only'
elif (x[0] == 'No') & (x[1] == 'Yes'):
return 'Dependent Only'
else:
return 'Single'
'Status'] = df[['Partner','Dependents']].apply(statuss, axis=1) df[
I have created a new feature called ‘Status’. This feature is derived from the columns Partner and Dependents.
1.If customers have both Partner and Dependents, it will be labeled as ‘Both’.
2.If customers have Partner but no Dependents, it will be labeled as ‘Partner Only’.
3.If customers have Dependents but no Partner, it will be labeled as ‘Dependent Only’.
4.If customers have neither Partner nor Dependents, it will be labeled as ‘Single’.
=(10,5))
plt.figure(figsize'Status', ascending=True), x='Status')
sns.countplot(df.sort_values('Status Count', size=16, weight='bold')
plt.title(
plt.tight_layout() plt.show()
Majority of customers are Single.
= plt.subplots(1,3, figsize=(15, 6))
fig, axarr = ['tenure','MonthlyCharges','TotalBenefits']
k for x in k:
=df.groupby(['Status'])[[x]].mean().reset_index(),
sns.barplot(data='Status',
x=x,
y=axarr[k.index(x)],
ax=['grey', 'g','m','b'])
palette
f'{x} average', weight='bold', size=15)
axarr[k.index(x)].set_title(
fig.tight_layout() plt.show()
Customers labeled as ‘Partner Only’ are considered the best since they have the longest tenure and the highest MonthlyCharges. The second-best group is ‘Both’, although these customers may not have MonthlyCharges as high as those in the ‘Single’ group, their tenure is almost double that of the ‘Single’ group.
=(10,5))
plt.figure(figsize=df.groupby('Status')[['Churn']].agg(lambda x: list(x).count('Yes') / len(x)).reset_index(),
sns.barplot(data='Status',
x='Churn')
y
'Churn Probability', weight='bold', size=16)
plt.title(
plt.tight_layout() plt.show()
Single customers have the highest churn probability.
=True) df.SeniorCitizen.value_counts(normalize
No 0.837903
Yes 0.162097
Name: SeniorCitizen, dtype: float64
The majority of customers are young people.
= df.groupby('SeniorCitizen')[['Status']]
df_status = df_status.agg(Single = ('Status', lambda x: list(x).count('Single') * 100 / len(x)),
df_status = ('Status', lambda x: list(x).count('Partner Only') * 100 / len(x)),
PartnerOnly = ('Status', lambda x: list(x).count('Both') * 100 / len(x)),
Both = ('Status', lambda x: list(x).count('Dependent Only') * 100 / len(x)))
DependentOnly = df_status.reset_index().melt(id_vars='SeniorCitizen')
df_status = df_status.rename(columns={'variable':'Status'}) df_status
=(10,5))
plt.figure(figsize
=df_status,
sns.barplot(data='SeniorCitizen',
x ='value',
y='Status')
hue
"Status comparison between senior citizen", size=15, weight='bold')
plt.title(
plt.tight_layout() plt.show()
For non-Senior citizens, ‘Single’ customers have the highest frequency, followed by ‘Both’. For Senior citizens, ‘Single’ is also the highest category, but the difference with ‘PartnerOnly’ is not as significant. From the plots above, we can also conclude that young people tend to have dependents more than older people.
= df.groupby('Status')[['OnlineSecurity','OnlineBackup',
df_status 'DeviceProtection','TechSupport','StreamingTV','StreamingMovies']]\
lambda x: list(x).count('Yes'))
.agg(
'total'] = df_status.apply('sum',axis=1)
df_status[
for x in df_status.drop(columns='total').columns:
= (df_status[x] * 100 / df_status.total).round()
df_status[x] ='total', inplace=True) df_status.drop(columns
='bar', rot=0)
df_status.plot(kind'Benefit count comparison between status (%)', size=12, weight='bold')
plt.title(=(1, 1))
plt.legend(bbox_to_anchor plt.show()
Single and PartnerOnly customers tend to prefer entertainment benefits such as StreamingTV and StreamingMovies compared to other customers. Additionally, these customers show a lower preference for using TechSupport and OnlineSecurity.
Benefits analysis
= plt.subplots(1,2, figsize=(12, 6))
fig, axarr != 'Phone Only'],
sns.countplot(df[df.Services = 'TotalBenefits', ax=axarr[0])
x
'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']]\
sns.barplot(df[[apply(lambda x: list(x).count('Yes')).reset_index(), x = 'index', y = 0, ax=axarr[1])
.
0].set_title('TotalBenefits Count', size=12, weight='bold')
axarr[0].set(ylabel=None)
axarr[
1].set(ylabel=None)
axarr[1].set_title('Benefits Count', size=12, weight='bold')
axarr[1].tick_params(axis='x', rotation=25)
axarr[
fig.tight_layout() plt.show()
The average number of TotalBenefits taken by customers is around 3, with StreamingTV and StreamingMovies being the most popular choices.
= pd.DataFrame()
df_total for x in ['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']:
= list(df[df[x] == 'Yes']['Churn']).count('Yes') / len(df[df[x] == 'Yes']['Churn'])
total = pd.concat([df_total, pd.DataFrame([x],[total])]) df_total
= plt.subplots(1,2, figsize=(12, 6))
fig, axarr 'TotalBenefits')[['Churn']]\
sns.barplot(df.groupby(lambda x: list(x).count('Yes') * 100 / len(x)).round().reset_index(),
.agg(='TotalBenefits',
x='Churn', ax=axarr[0])
y
sns.barplot(df_total.reset_index(), =0,
x='index',
y=axarr[1])
ax
0].set_title('TotalBenefits Churn probability', size=12, weight='bold')
axarr[0].set(ylabel=None)
axarr[
1].set(ylabel=None, xlabel='Benefits')
axarr[1].set_title('Benefits Churn probability', size=12, weight='bold')
axarr[1].tick_params(axis='x', rotation=25)
axarr[
fig.tight_layout() plt.show()
While StreamingTV and StreamingMovies are the most favored choices, the churn probability associated with them is also the highest.
Churn analysis
= plt.subplots(1, figsize=(10, 6))
fig, axarr =df,
sns.scatterplot(data= 'MonthlyCharges',
x = 'tenure',
y ='Churn',
hue =20)
s
68 , 97),20, alpha=0.2, color='green')
plt.fill_between((68 , 97),52.5, 72.5, alpha=0.2, color='blue')
plt.fill_between((
'MonthlyCharges vs tenure vs Churn', size = 15, weight = 'bold')
plt.title( plt.show()
Here, I have created two areas, denoted by green and blue, both focusing on MonthlyCharges in the range of 70 - 95 USD. This price range corresponds to the highest churn probability. The green area represents customers with low tenure and is predominantly occupied by churned customers, while the blue area represents customers with high tenure and is predominantly occupied by non-churned customers.
1, figsize=(15, 8))
plt.subplots(=df,
sns.scatterplot(data= 'MonthlyCharges',
x = 'tenure',
y =35,
s='TotalBenefits',
hue='InternetService',
style='coolwarm')
palette
68 , 97),20, alpha=0.15, color='green')
plt.fill_between((68 , 97),52.5, 72.5, alpha=0.15, color='blue')
plt.fill_between((
'MonthlyCharges vs tenure vs TotalBenefits vs InternetService', size = 15, weight = 'bold')
plt.title( plt.show()
Still on the same plot, I have added TotalBenefits and InternetService. It can be observed that in the green area, Fiber optic is the dominant InternetService with low TotalBenefits. On the other hand, the blue area is dominated by DSL with high TotalBenefits. This indicates that customers, at the same price point, tend to choose DSL with high TotalBenefits rather than Fiber optic with low TotalBenefits. Note that Fiber optic prices are doubled than DSL.
With the observed pattern above, we can create an important new feature, which we will refer to as ‘FO_LB’ (Fiber optic_Low benefit). I will assign a value of ‘1’ to indicate that the internet service is Fiber optic and the Totalbenefits taken are less than or equal to 3. For other cases, I will assign ‘0’.
'FO_LB'] = df[['InternetService','TotalBenefits']].apply(
df[lambda x: 1 if (x['InternetService'] == 'Fiber optic') & (x['TotalBenefits'] <= 3) else 0, axis=1)
Modelling
One important thing to address before we proceed is considering the types of errors to make this project as realistic as possible. Typically, there are two types of errors: false positive (FP) and false negative (FN). However, in this project, I will introduce three types of errors.
1.FP: False positive
2.FN1: False negative for customers with MonthlyCharges below 95 USD
3.FN2: False negative for customers with MonthlyCharges above 95 USD (VIP customers)
Let’s agree on the misclassification ratio, which is FP:FN1:FN2 = 1:3:5
It’s important to note that this dataset is imbalanced, meaning there is a significant difference in the number of samples between the classes.
Based on these problems, we can set up our model’s parameters as follows:
1.Hyperparameter tuning.
2.Decision threshold tuning.
3.Oversampling data using SMOTE.
4.Applying weights to the models.
I will be using Random Forest, XGBoost, and Logistic Regression.
Metrics:
Custom scoring based on sample misclassification.
Precision.
Recall.
F1_score.
Once the models are evaluated using these metrics, I will interpret the best model.
Feature selection & encoding
= df.copy() df1
= df1.copy() df
#drop columns
= ['Services','MonthlyChargesEstimationDifference','Status','PaperlessBilling','PaymentMethod'], inplace=True) df.drop(columns
I have dropped the columns from ‘Services’ to ‘Status’ as these columns were engineered features created for simpler exploratory data analysis (EDA). Additionally, I have also dropped the ‘PaperlessBilling’ and ‘PaymentMethod’ columns because, in the business context, these columns are considered irrelevant for determining customer churn since they represent optional ‘features’ for customers.
#converting 'No internet service' to 'No' in benefit columns.
= ['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
KolomBenefit
for x in KolomBenefit:
= df[x].apply(lambda x: 'No' if x == 'No internet service' else x) df[x]
#converting 'No phone service' to 'No'
'MultipleLines'] = df[x].apply(lambda x: 'No' if x == 'No phone service' else x) df[
#dict to mapping string to numerical.
= {
value_mapping 'No': 0,
'Yes' : 1,
'Male' : 1,
'Female' : 0
}
#binary encoding
= list(df.drop(columns=['tenure','InternetService','MonthlyCharges','TotalCharges','TotalBenefits','Contract','FO_LB']).columns)
binary
for col in binary:
= df[col].map(value_mapping).astype('int64') df[col]
#label encoding
'Contract'] = df['Contract'].apply(lambda x: 1 if x == 'Month-to-month' else 2 if x == 'One year' else 3) df[
#one hot encoding
= pd.get_dummies(df, columns=['InternetService']) df
#feature selection
= df[['Contract','tenure','InternetService_Fiber optic','MonthlyCharges','FO_LB','InternetService_No','Churn']]
df = df.rename(columns={
df 'InternetService_Fiber optic':'Fiber_optic',
'InternetService_No':'No_internet'})
Here I only choose a feature that have a strong predictive power (by using feature of importances)
Splits data and define custom function
#split train-test
= train_test_split(df.drop(columns='Churn'), df.Churn.to_numpy(), test_size = 0.2, random_state=123)
X_train, X_test, y_train, y_test = X_train.reset_index(drop=True)
X_train = X_test.reset_index(drop=True) X_test
#custom function to do threshold tuning and custom metrics (used in GridSearchCV)
def my_scorer_2(clf, X, y_true, thres = np.arange(0.1,1,0.1)):
= {}
result_dict for threshold in np.atleast_1d(thres):
= (clf.predict_proba(X)[:,1] > threshold).astype(int)
y_pred = (X['MonthlyCharges'] > 95).to_numpy().astype(int)
X_segment = np.column_stack((X_segment, y_pred, y_true))
y_stack = y_stack[y_stack[:,0] == 0], y_stack[y_stack[:,0] == 1]
y_stack_reg, y_stack_vip = confusion_matrix(y_stack_reg[:,2], y_stack_reg[:,1])
cm_reg = confusion_matrix(y_stack_vip[:,2], y_stack_vip[:,1])
cm_vip = cm_reg[1][0], cm_vip[1][0]
fn_reg, fn_vip = cm_reg[0][1] + cm_vip[0][1]
fp = (fp * 1) + (fn_reg * 3) + (fn_vip * 5)
loss_score = np.array([loss_score, metrics.precision_score(y_true, y_pred, zero_division = 0),
result_dict[threshold]
metrics.recall_score(y_true, y_pred), metrics.f1_score(y_true, y_pred)])
= np.array([np.insert(value, 0, key) for key, value in result_dict.items()])
result_np = result_np[result_np[:,1] == np.min(result_np[:,1])][0]
best_np return best_np
def my_scorer_threshold(clf, X, y_true):
return my_scorer_2(clf, X, y_true)[0]
def my_scorer_ls(clf, X, y_true):
return my_scorer_2(clf, X, y_true)[1]
def my_scorer_precision(clf, X, y_true):
return my_scorer_2(clf, X, y_true)[2]
def my_scorer_recall(clf, X, y_true):
return my_scorer_2(clf, X, y_true)[3]
def my_scorer_f1(clf, X, y_true):
return my_scorer_2(clf, X, y_true)[4]
#Grid scoring parameter
= {
grid_scoring 'threshold': my_scorer_threshold,
'loss_score': my_scorer_ls,
'precision': my_scorer_precision,
'recall': my_scorer_recall,
'f1': my_scorer_f1
}
#define weight by missclassification cost which is FP:FN1:FN2 = 1:3:5
= (X_train['MonthlyCharges'] > 95).to_numpy().astype(int)
X_segment = np.column_stack((X_segment, y_train))
arr_weight = np.apply_along_axis(lambda x: 1 if x[1] == 0 else 5 if x[0] == 1 else 3 , axis=1, arr=arr_weight) weight
Model building 1 / Hyperparameter tuning + threshold tuning
#RANDOM FOREST MODELLING
#define parameter for tuning
= {
param_grid_rf 'n_estimators': [250 , 400],
'max_depth': [10, 25, 50],
'min_samples_split': [25, 50, 70, 120],
'min_samples_leaf': [50, 75, 120],
'bootstrap' : [True, False]
}
#run grid search cv
= GridSearchCV(estimator = RandomForestClassifier(),
rf = param_grid_rf,
param_grid =5,
cv= grid_scoring, refit=False, n_jobs = -1)
scoring
#fit grid search cv with train data
rf.fit(X_train, y_train)
#selecting the best parameter and metrics
= pd.DataFrame(rf.cv_results_)
grid_result = grid_result[grid_result['mean_test_loss_score'] == grid_result['mean_test_loss_score'].min()][['params','mean_test_threshold','mean_test_loss_score','mean_test_precision','mean_test_recall','mean_test_f1']]
grid_result
#train model with the best hyperparameter
= RandomForestClassifier(**grid_result.iloc[0,0])
rf
#train model with train data
= rf.fit(X_train, y_train)
rf
#evaluate the model
= my_scorer_2(rf, X_test, y_test, grid_result.iloc[0,1])
mb1_rf = np.append(mb1_rf, grid_result.iloc[0,0]) mb1_rf
#XGBOOST MODELLING
#define parameter for tuning
= {
param_grid_xg 'learning_rate': [0.1, 0.01, 0.001],
'n_estimators': [100, 500],
'max_depth': [5, 10, 25],
'subsample': [0.8, 0.9, 1.0],
'colsample_bytree': [0.8, 0.9, 1.0]
}
#run grid search cv
= GridSearchCV(estimator = XGBClassifier(),
xg = param_grid_xg,
param_grid =5,
cv= grid_scoring, refit=False, n_jobs = -1)
scoring
#fit grid search cv with train data
xg.fit(X_train, y_train)
#selecting the best parameter and metrics
= pd.DataFrame(xg.cv_results_)
grid_result = grid_result[grid_result['mean_test_loss_score'] == grid_result['mean_test_loss_score'].min()][['params','mean_test_threshold','mean_test_loss_score','mean_test_precision','mean_test_recall','mean_test_f1']]
grid_result
#train model with the best hyperparameter
= XGBClassifier(**grid_result.iloc[0,0])
xg
#train model with train data
= xg.fit(X_train, y_train)
xg
#evaluate the model
= my_scorer_2(xg, X_test, y_test, grid_result.iloc[0,1])
mb1_xg = np.append(mb1_xg, grid_result.iloc[0,0]) mb1_xg
#LOGISTIC REGRESSION MODELLING
#define parameter for tuning
= {
param_grid_lg 'penalty': ['l1', 'l2'],
'C': [0.1, 1.0, 10.0],
'solver': ['liblinear'],
'max_iter': [50,100,200]
}
#run grid search cv
= GridSearchCV(estimator = LogisticRegression(),
lg = param_grid_lg,
param_grid =5,
cv= grid_scoring, refit=False, n_jobs = -1)
scoring
#fit grid search cv with train data
lg.fit(X_train, y_train)
#selecting the best parameter and metrics
= pd.DataFrame(lg.cv_results_)
grid_result = grid_result[grid_result['mean_test_loss_score'] == grid_result['mean_test_loss_score'].min()][['params','mean_test_threshold','mean_test_loss_score','mean_test_precision','mean_test_recall','mean_test_f1']]
grid_result
#train model with the best hyperparameter
= LogisticRegression(**grid_result.iloc[0,0])
lg
#train model with train data
= lg.fit(X_train, y_train)
lg
#evaluate the model
= my_scorer_2(lg, X_test, y_test, grid_result.iloc[0,1])
mb1_lg = np.append(mb1_lg, grid_result.iloc[0,0]) mb1_lg
= pd.DataFrame([mb1_rf,mb1_xg,mb1_lg])
result_mb1 = ['threshold','score','precision','recall','f1','params']
result_mb1.columns 'model'] = ['MB1_RF','MB1_XG','MB1_Log_Reg']
result_mb1[ result_mb1
threshold | score | precision | recall | f1 | params | model | |
---|---|---|---|---|---|---|---|
0 | 0.24 | 539.0 | 0.473101 | 0.828255 | 0.602216 | {'bootstrap': True, 'max_depth': 25, 'min_samp... | MB1_RF |
1 | 0.20 | 558.0 | 0.449190 | 0.844875 | 0.586538 | {'colsample_bytree': 0.8, 'learning_rate': 0.0... | MB1_XG |
2 | 0.22 | 541.0 | 0.467492 | 0.836565 | 0.599801 | {'C': 0.1, 'max_iter': 50, 'penalty': 'l2', 's... | MB1_Log_Reg |
Model building 2 / Hyperparameter tuning + threshold tuning + SMOTE
#RANDOM FOREST MODELLING
#define parameter for tuning
= {
param_grid_rf_smote 'class__n_estimators': [250 , 400],
'class__max_depth': [10, 25, 50],
'class__min_samples_split': [25, 50, 70, 120],
'class__min_samples_leaf': [50, 75, 120],
'class__bootstrap' : [True, False]
}
#create imbalanced pipeline to SMOTE
= Pipeline([
pipelinerf 'sampling', SMOTE()),
('class', RandomForestClassifier())])
(
#run grid search cv
= GridSearchCV(estimator = pipelinerf,
rf = param_grid_rf_smote,
param_grid =5,
cv= grid_scoring, refit=False, n_jobs = -1)
scoring
#fit grid search cv with train data
rf.fit(X_train, y_train)
#selecting the best parameter and metrics
= pd.DataFrame(rf.cv_results_)
grid_result = grid_result[grid_result['mean_test_loss_score'] == grid_result['mean_test_loss_score'].min()][['params','mean_test_threshold','mean_test_loss_score','mean_test_precision','mean_test_recall','mean_test_f1']]
grid_result = grid_result.iloc[0,0]
params
= {}
params_update for key, value in params.items():
= key.replace('class__', '')
new_key = value
params_update[new_key] = params_update
params
#train model with the best hyperparameter
= RandomForestClassifier(**params)
rf
#train model with SMOTE train data
= SMOTE().fit_resample(X_train, y_train)
X_train1, y_train1 = rf.fit(X_train1, y_train1)
rf
#evaluate the model
= my_scorer_2(rf, X_test, y_test, grid_result.iloc[0,1])
mb2_rf = np.append(mb2_rf, params) mb2_rf
#XGBOOST MODELLING
#define parameter for tuning
= {
param_grid_xg_smote 'class__learning_rate': [0.1, 0.01, 0.001],
'class__n_estimators': [100, 500],
'class__max_depth': [5, 10, 25],
'class__subsample': [0.8, 0.9, 1.0],
'class__colsample_bytree': [0.8, 0.9, 1.0]
}
#create imbalanced pipeline to SMOTE
= Pipeline([
pipelinexg 'sampling', SMOTE()),
('class', XGBClassifier())])
(
#run grid search cv
= GridSearchCV(estimator = pipelinexg,
xg = param_grid_xg_smote,
param_grid =5,
cv= grid_scoring, refit=False, n_jobs = -1)
scoring
#fit grid search cv with train data
xg.fit(X_train, y_train)
#selecting the best parameter and metrics
= pd.DataFrame(xg.cv_results_)
grid_result = grid_result[grid_result['mean_test_loss_score'] == grid_result['mean_test_loss_score'].min()][['params','mean_test_threshold','mean_test_loss_score','mean_test_precision','mean_test_recall','mean_test_f1']]
grid_result = grid_result.iloc[0,0]
params
= {}
params_update for key, value in params.items():
= key.replace('class__', '')
new_key = value
params_update[new_key] = params_update
params
#train model with the best hyperparameter
= XGBClassifier(**params)
xg
#train model with SMOTE train data
= SMOTE().fit_resample(X_train, y_train)
X_train1, y_train1 = xg.fit(X_train1, y_train1)
xg
#evaluate the model
= my_scorer_2(xg, X_test, y_test, grid_result.iloc[0,1])
mb2_xg = np.append(mb2_xg, params) mb2_xg
#LOGISTIC REGRESSION MODELLING
#define parameter for tuning
= {
param_grid_lg_smote 'class__penalty': ['l1', 'l2'],
'class__C': [0.1, 1.0, 10.0],
'class__solver': ['liblinear'],
'class__max_iter': [50,100,200]
}
#create imbalanced pipeline to SMOTE
= Pipeline([
pipelinelg 'sampling', SMOTE()),
('class', LogisticRegression())])
(
#run grid search cv
= GridSearchCV(estimator = pipelinelg,
lg = param_grid_lg_smote,
param_grid =5,
cv= grid_scoring, refit=False, n_jobs = -1)
scoring
#fit grid search cv with train data
lg.fit(X_train, y_train)
#selecting the best parameter and metrics
= pd.DataFrame(lg.cv_results_)
grid_result = grid_result[grid_result['mean_test_loss_score'] == grid_result['mean_test_loss_score'].min()][['params','mean_test_threshold','mean_test_loss_score','mean_test_precision','mean_test_recall','mean_test_f1']]
grid_result = grid_result.iloc[0,0]
params
= {}
params_update for key, value in params.items():
= key.replace('class__', '')
new_key = value
params_update[new_key] = params_update
params
#train model with the best hyperparameter
= LogisticRegression(**params)
lg
#train model with SMOTE train data
= SMOTE().fit_resample(X_train, y_train)
X_train1, y_train1 = lg.fit(X_train1, y_train1)
lg
#evaluate the model
= my_scorer_2(lg, X_test, y_test, grid_result.iloc[0,1])
mb2_lg = np.append(mb2_lg, params) mb2_lg
= pd.DataFrame([mb2_rf,mb2_xg,mb2_lg])
result_mb2 = ['threshold','score','precision','recall','f1','params']
result_mb2.columns 'model'] = ['MB2_RF','MB2_XG','MB2_Log_Reg']
result_mb2[ result_mb2
threshold | score | precision | recall | f1 | params | model | |
---|---|---|---|---|---|---|---|
0 | 0.42 | 557.0 | 0.454955 | 0.839335 | 0.590068 | {'bootstrap': False, 'max_depth': 25, 'min_sam... | MB2_RF |
1 | 0.44 | 546.0 | 0.472266 | 0.825485 | 0.600806 | {'colsample_bytree': 1.0, 'learning_rate': 0.0... | MB2_XG |
2 | 0.48 | 546.0 | 0.472178 | 0.822715 | 0.600000 | {'C': 1.0, 'max_iter': 50, 'penalty': 'l2', 's... | MB2_Log_Reg |
Model building 3 / Hyperparameter tuning + threshold tuning + custom weight
#RANDOM FOREST MODELLING
#define parameter for tuning
= {
param_grid_rf 'n_estimators': [250 , 400],
'max_depth': [10, 25, 50],
'min_samples_split': [25, 50, 70, 120],
'min_samples_leaf': [50, 75, 120],
'bootstrap' : [True, False]
}
#run grid search cv
= GridSearchCV(estimator = RandomForestClassifier(),
rf = param_grid_rf,
param_grid =5,
cv= grid_scoring, refit=False, n_jobs = -1)
scoring
#fit grid search cv with train data
= weight)
rf.fit(X_train, y_train, sample_weight
#selecting the best parameter and metrics
= pd.DataFrame(rf.cv_results_)
grid_result = grid_result[grid_result['mean_test_loss_score'] == grid_result['mean_test_loss_score'].min()][['params','mean_test_threshold','mean_test_loss_score','mean_test_precision','mean_test_recall','mean_test_f1']]
grid_result
#train model with the best hyperparameter
= RandomForestClassifier(**grid_result.iloc[0,0])
rf
#train model with train data
= rf.fit(X_train, y_train, sample_weight = weight)
rf
#evaluate the model
= my_scorer_2(rf, X_test, y_test, grid_result.iloc[0,1])
mb3_rf = np.append(mb3_rf, grid_result.iloc[0,0]) mb3_rf
#XGBOOST MODELLING
#define parameter for tuning
= {
param_grid_xg 'learning_rate': [0.1, 0.01, 0.001],
'n_estimators': [100, 500],
'max_depth': [5, 10, 25],
'subsample': [0.8, 0.9, 1.0],
'colsample_bytree': [0.8, 0.9, 1.0]
}
#run grid search cv
= GridSearchCV(estimator = XGBClassifier(),
xg = param_grid_xg,
param_grid =5,
cv= grid_scoring, refit=False, n_jobs = -1)
scoring
#fit grid search cv with train data
= weight)
xg.fit(X_train, y_train, sample_weight
#selecting the best parameter and metrics
= pd.DataFrame(xg.cv_results_)
grid_result = grid_result[grid_result['mean_test_loss_score'] == grid_result['mean_test_loss_score'].min()][['params','mean_test_threshold','mean_test_loss_score','mean_test_precision','mean_test_recall','mean_test_f1']]
grid_result
#train model with the best hyperparameter
= XGBClassifier(**grid_result.iloc[0,0])
xg
#train model with train data
= xg.fit(X_train, y_train, sample_weight = weight)
xg
#evaluate the model
= my_scorer_2(xg, X_test, y_test, grid_result.iloc[0,1])
mb3_xg = np.append(mb3_xg, grid_result.iloc[0,0]) mb3_xg
#LOGISTIC REGRESSION MODELLING
#define parameter for tuning
= {
param_grid_lg 'penalty': ['l1', 'l2'],
'C': [0.1, 1.0, 10.0],
'solver': ['liblinear'],
'max_iter': [50,100,200]
}
#run grid search cv
= GridSearchCV(estimator = LogisticRegression(),
lg = param_grid_lg,
param_grid =5,
cv= grid_scoring, refit=False, n_jobs = -1)
scoring
#fit grid search cv with train data
= weight)
lg.fit(X_train, y_train, sample_weight
#selecting the best parameter and metrics
= pd.DataFrame(lg.cv_results_)
grid_result = grid_result[grid_result['mean_test_loss_score'] == grid_result['mean_test_loss_score'].min()][['params','mean_test_threshold','mean_test_loss_score','mean_test_precision','mean_test_recall','mean_test_f1']]
grid_result
#train model with the best hyperparameter
= LogisticRegression(**grid_result.iloc[0,0])
lg
#train model with train data
= lg.fit(X_train, y_train, sample_weight = weight)
lg
#evaluate the model
= my_scorer_2(lg, X_test, y_test, grid_result.iloc[0,1])
mb3_lg = np.append(mb3_lg, grid_result.iloc[0,0]) mb3_lg
= pd.DataFrame([mb3_rf,mb3_xg,mb3_lg])
result_mb3 = ['threshold','score','precision','recall','f1','params']
result_mb3.columns 'model'] = ['MB3_RF','MB3_XG','MB3_Log_Reg']
result_mb3[ result_mb3
threshold | score | precision | recall | f1 | params | model | |
---|---|---|---|---|---|---|---|
0 | 0.48 | 549.0 | 0.456193 | 0.836565 | 0.590420 | {'bootstrap': True, 'max_depth': 25, 'min_samp... | MB3_RF |
1 | 0.48 | 551.0 | 0.465409 | 0.819945 | 0.593781 | {'colsample_bytree': 0.8, 'learning_rate': 0.0... | MB3_XG |
2 | 0.44 | 571.0 | 0.433803 | 0.853186 | 0.575163 | {'C': 0.1, 'max_iter': 50, 'penalty': 'l2', 's... | MB3_Log_Reg |
= pd.concat([result_mb1, result_mb2, result_mb3]).sort_values('score')
final_result final_result
threshold | score | precision | recall | f1 | params | model | |
---|---|---|---|---|---|---|---|
0 | 0.24 | 539.0 | 0.473101 | 0.828255 | 0.602216 | {'bootstrap': True, 'max_depth': 25, 'min_samp... | MB1_RF |
2 | 0.22 | 541.0 | 0.467492 | 0.836565 | 0.599801 | {'C': 0.1, 'max_iter': 50, 'penalty': 'l2', 's... | MB1_Log_Reg |
1 | 0.44 | 546.0 | 0.472266 | 0.825485 | 0.600806 | {'colsample_bytree': 1.0, 'learning_rate': 0.0... | MB2_XG |
2 | 0.48 | 546.0 | 0.472178 | 0.822715 | 0.600000 | {'C': 1.0, 'max_iter': 50, 'penalty': 'l2', 's... | MB2_Log_Reg |
0 | 0.48 | 549.0 | 0.456193 | 0.836565 | 0.590420 | {'bootstrap': True, 'max_depth': 25, 'min_samp... | MB3_RF |
1 | 0.48 | 551.0 | 0.465409 | 0.819945 | 0.593781 | {'colsample_bytree': 0.8, 'learning_rate': 0.0... | MB3_XG |
0 | 0.42 | 557.0 | 0.454955 | 0.839335 | 0.590068 | {'bootstrap': False, 'max_depth': 25, 'min_sam... | MB2_RF |
1 | 0.20 | 558.0 | 0.449190 | 0.844875 | 0.586538 | {'colsample_bytree': 0.8, 'learning_rate': 0.0... | MB1_XG |
2 | 0.44 | 571.0 | 0.433803 | 0.853186 | 0.575163 | {'C': 0.1, 'max_iter': 50, 'penalty': 'l2', 's... | MB3_Log_Reg |
Analysis of how the models performed:
1.RandomForest performs best with hyperparameter tuning and a lower decision threshold. However, this model performs worst when using SMOTE.
2.LogisticRegression performs worst when using ‘sample weighting’.
3.XGBoost performs best when using SMOTE.
4.It’s important to note that all models produce similar results when using their best parameters and conditions.
5.In my opinion, the greatest impact is achieved by using hyperparameter tuning and decision threshold tuning, rather than using SMOTE and weighting techniques.
Model Interpretation
In this section, I want to show you how to interpret a tree-based model, such as Random Forest, so we can have a better understanding of how the model actually works.
#selecting the best parameter for random forest
= final_result[final_result['model'] == 'MB1_RF']['params'][0]
rf_param rf_param
{'bootstrap': True,
'max_depth': 25,
'min_samples_leaf': 50,
'min_samples_split': 50,
'n_estimators': 250}
= RandomForestClassifier(**rf_param)
rf rf.fit(X_train, y_train)
RandomForestClassifier(max_depth=25, min_samples_leaf=50, min_samples_split=50, n_estimators=250)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=25, min_samples_leaf=50, min_samples_split=50, n_estimators=250)
Partial dependence plot (PDP) & Individual conditional expectation (ICE)
What is ICE? It is a plot that shows how a model makes predictions based on changing the value of one or more features, while keeping the values of other features constant. This provides us with more insights and understanding of how the model treats features to make predictions. ICE works per row (or per customer in this case), and PDP is simply the average of ICE.
def pdp_ice_plot(model, df_test, column, clusters=True):
= df_test.copy()
df_test = pdp.PDPIsolate(model = model, df = df_test,
pdp_isolate =15,
num_grid_points= df_test.columns,
model_features = column, feature_name=column)
feature
pdp_isolate.plot(=True, plot_lines=True,
center=clusters, n_cluster_centers=5,cluster_method='accurate',
cluster=False, to_bins=True,
plot_pts_dist=False, which_classes=None,
show_percentile=(10,6), dpi=300,
figsize=2, plot_params={"pdp_hl": True},
ncols='matplotlib', template='plotly_white')
engine
plt.show()
'MonthlyCharges', False) pdp_ice_plot(rf, X_test,
obtain pred_func from the provided model.
The light blue lines represent ICE (Individual Conditional Expectation), and the yellowish blue line represents PDP (Partial Dependence Plot). The X-axis represents MonthlyCharges, while the Y-axis represents the change in prediction probability. At MonthlyCharges of 60.7 USD, you can observe that some customers experience a significant increase in churn probability as the MonthlyCharges increase. However, it is important to note that not all customers have the same response. Some customers are minimally affected, and some may not be affected at all.
'MonthlyCharges') pdp_ice_plot(rf, X_test,
obtain pred_func from the provided model.
This is the same plot as above, but I have grouped the ICE into 5 clusters for easier viewing and analysis. You can see that there are some customers who experience a significant increase in churn probability as the MonthlyCharges increase.
'tenure') pdp_ice_plot(rf, X_test,
obtain pred_func from the provided model.
The longer the tenure, the lower the churn probability. However, the effect is not the same for all customers. Some customers are greatly affected, while others are barely affected.
'Contract') pdp_ice_plot(rf, X_test,
obtain pred_func from the provided model.
The same also goes with Contract. Longer contract means lower churn probability.
= X_test.copy()
X_test_copy = pdp.PDPInteract(model=rf, df=X_test_copy,
pdp_interact =6,
num_grid_points= X_test.columns,
model_features =['MonthlyCharges','tenure'],
features=['MonthlyCharges','tenure'])
feature_names
pdp_interact.plot(="grid", plot_pdp=True,
plot_type=True, show_percentile=False,
to_bins=None, figsize=(10,6),
which_classes=300, ncols=2, plot_params=None,
dpi='matplotlib', template='plotly_white',
engine
) plt.show()
obtain pred_func from the provided model.
obtain pred_func from the provided model.
obtain pred_func from the provided model.
Let’s analyze the combination of MonthlyCharges and tenure. We can observe a spike in churn probability for MonthlyCharges ranging from 56.3 USD to 79.2 USD, particularly for customers with a tenure of less than 7 months.
= X_test.copy()
X_test_copy = pdp.PDPInteract(model=rf, df=X_test_copy,
pdp_interact =10,
num_grid_points= X_test.columns,
model_features =['tenure','Contract'],
features=['tenure','Contract'])
feature_names
pdp_interact.plot(="grid", plot_pdp=True,
plot_type=True, show_percentile=False,
to_bins=None, figsize=(10,6),
which_classes=300, ncols=2, plot_params=None,
dpi='matplotlib', template='plotly_white',
engine
) plt.show()
obtain pred_func from the provided model.
obtain pred_func from the provided model.
obtain pred_func from the provided model.
The model produces similar churn probabilities for customers with a combination of a 2-year contract and low tenure compared to those with a month-to-month contract and medium tenure (24-36 months).
You can see how PDP and ICE plots can be very beneficial in understanding how the model utilizes features to make predictions. In the next section, I will demonstrate how to assess the model’s prediction confidence level.
Confidence level based on trees’ standard deviation and confidence interval.
Tree-based models like RandomForest make predictions by using the mean of all the trees’ prediction probabilities. However, instead of solely relying on the mean, we can also calculate the standard deviation. A higher standard deviation indicates lower confidence in the predictions. Additionally, we can utilize confidence intervals, such as 95% or even more extreme at 99%.
#extract all trees' prediction probability per row
= np.stack([x.predict_proba(X_test)[:,1] for x in rf.estimators_]) predict
#assign mean and std. deviation of trees' prediction probability.
= X_test.copy()
df_pred 'avg'] = np.round(np.mean(predict, axis = 0) * 100, 2)
df_pred['std_dev'] = np.round(np.std(predict, axis = 0) * 100, 2) df_pred[
=(10,6))
plt.figure(figsize'std_dev'],color='skyblue', kde=True, edgecolor='none')
sns.histplot(df_pred['Standard deviation distribution', weight='bold')
plt.title( plt.show()
Most of predictions have std.deviation under 10%. Let’s calculate confidence interval with 99%.
'CI-99%'] = (2.576 * df_pred['std_dev'] / np.sqrt(len(predict))) * 100 / (df_pred['avg']) df_pred[
> 40].sort_values('CI-99%', ascending=False).head(5) df_pred[df_pred.avg
Contract | tenure | Fiber_optic | MonthlyCharges | FO_LB | No_internet | avg | std_dev | CI-99% | |
---|---|---|---|---|---|---|---|---|---|
852 | 2 | 7 | 1 | 94.05 | 1 | 0 | 40.06 | 25.08 | 10.199818 |
499 | 1 | 58 | 1 | 98.70 | 1 | 0 | 41.05 | 18.24 | 7.239149 |
1186 | 1 | 59 | 1 | 101.10 | 1 | 0 | 40.37 | 16.73 | 6.751699 |
1137 | 1 | 15 | 1 | 96.30 | 0 | 0 | 48.14 | 19.69 | 6.663701 |
1184 | 1 | 10 | 1 | 92.50 | 0 | 0 | 48.17 | 19.12 | 6.466765 |
Let’s consider the example of row 1. The model predicts a 40% probability of churn for the customer, with a confidence interval of +- 10%. By default, the model’s output indicates that the customer will not churn. However, due to the high confidence interval, it is safer to assume that the customer will churn.
Checking the standard deviation and confidence interval of the trees is extremely useful, particularly when the cost of ‘False Negative’ is significant and can have severe consequences.