Vertikal Willis - Tangerang house price prediction

This is my first end-to-end project with a scraped dataset. I used Scrapy as my scraping tool to collect data from one of the largest house listing websites in Indonesia. The most challenging part of this project was cleaning the data, as it had many missing values that required careful imputation.

One of the most important lessons I learned in this project is the significance of sample selection. When training a machine learning model, we typically split the data into a training and test set. If the model performs well on both sets, we may consider it to be a good model, which could be true. However, when we apply the model to new test data from different samples, the results may not meet our expectations. This issue doesn’t stem from the machine learning algorithm itself but rather from the initial sample selection, which may not accurately represent the population.

For scrapy installation and setup for this project please visit : Github link

Presentation slide: (go full screen by clicking 3-dot option button and choose ‘Full Screen’)

• Introduction
• Data Preparation
• Basic exploratory data analysis
    • Descriptive statistic
    • Univariate analysis
    • Multivariate analysis
• Deep-dive exploratory data analysis
• Modelling
    • Model building combination 1
    • Model building combination 2
    • Model building combination 3
    • Model building combination 4
    • Model building combination 5
    • Model building combination 6
• Choosing the best model combination
• Model evaluation and interpretation
    • Residuals plot
    • PDP & ICE plots
• Conclusion

Introduction

This project revolves around predicting house prices in Tangerang City, which encompasses more than 40 regions. The data for this project was gathered by scraping information from one of Indonesia’s largest online real estate listing platforms. After conducting this exploratory phase, I will deploy an application publicly to assist people in estimating house prices.

Data Preparation

Importing libraries, cleaning data and choosing features.

Code

#import libraries
import numpy as np
import pandas as pd
import ast
import re
from sklearn.impute import KNNImputer
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from xgboost import XGBRegressor
import xgboost as xgb
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
import plotly.express as px
import matplotlib.patches as mpatches
from pdpbox import pdp

#set some library settings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
sns.set()
sns.set_style('darkgrid')
mpl.rcParams['figure.dpi'] = 150 
plt.rc('legend',**{'fontsize':10})

Code

#read csv
df = pd.read_csv('tangerang.csv')

Code

#first look of data
df.head()

	harga	alamat	fasilitas	spesifikasi
0	Rp 2,6 Miliar	BSD City, Tangerang	Tempat Jemuran,Keamanan 24 jam,Playground,Wast...	[['Kamar Tidur', '3'], ['Kamar Mandi', '2'], [...
1	Rp 20 Miliar	BSD City, Tangerang	NaN	[['Kamar Tidur', '5'], ['Kamar Mandi', '5'], [...
2	Rp 1,68 Miliar	BSD City, Tangerang	Taman,Tempat Jemuran,Keamanan 24 jam	[['Kamar Tidur', '2'], ['Kamar Mandi', '2'], [...
3	Rp 1,61 Miliar	BSD City, Tangerang	Taman,Tempat Jemuran,Keamanan 24 jam	[['Kamar Tidur', '4'], ['Kamar Mandi', '3'], [...
4	Rp 2,14 Miliar	BSD City, Tangerang	Taman,Tempat Jemuran,Keamanan 24 jam	[['Kamar Tidur', '3'], ['Kamar Mandi', '3'], [...

Code

#change 'spesifikasi' to type list and explode it to become columns
df['spesifikasi'] = df.spesifikasi.apply(ast.literal_eval)
df = df.explode('spesifikasi')

Code

#split into 'keterangan' and 'qty'
df['keterangan'] = df.spesifikasi.str[0]
df['qty'] = df.spesifikasi.str[1]

Code

df_columns = df[['harga','alamat','fasilitas']].reset_index().drop_duplicates('index').set_index('index')
df_pivot = df.pivot(columns='keterangan', values='qty').rename_axis(None, axis=1)
df = pd.concat([df_columns, df_pivot], axis = 1)

Code above is about to transform ‘keterangan’ into several columns.

Code

#new transformed dataframe
df.head()

	harga	alamat	fasilitas	Carport	Dapur	Daya Listrik	Garasi	Hadap	Hook	ID Iklan	Jumlah Lantai	Kamar Mandi	Kamar Mandi Pembantu	Kamar Pembantu	Kamar Tidur	Kondisi Perabotan	Kondisi Properti	Konsep dan Gaya Rumah	Lebar Jalan	Luas Bangunan	Luas Tanah	Material Bangunan	Material Lantai	Nomor Lantai	Pemandangan	Periode Sewa	Ruang Makan	Ruang Tamu	Sertifikat	Sumber Air	Tahun Di Renovasi	Tahun Dibangun	Terjangkau Internet	Tipe Properti
0	Rp 2,6 Miliar	BSD City, Tangerang	Tempat Jemuran,Keamanan 24 jam,Playground,Wast...	1	NaN	NaN	NaN	NaN	Tidak	hos14814642	1	2	2	1	3	NaN	Bagus	Minimalis Modern	3 Mobil	91 m²	91 m²	Batako	Granit	NaN	Taman Kota	NaN	Ya	Ya	Lainnya (PPJB,Girik,Adat,dll)	PAM atau PDAM	NaN	NaN	Ya	Rumah
1	Rp 20 Miliar	BSD City, Tangerang	NaN	NaN	2	7700 Watt	NaN	NaN	Tidak	hos13101318	2	5	1	1	5	Unfurnished	Bagus	NaN	NaN	465 m²	396 m²	NaN	NaN	NaN	NaN	NaN	Ya	Ya	Lainnya (PPJB,Girik,Adat,dll)	PAM atau PDAM	NaN	NaN	Tidak	Rumah
2	Rp 1,68 Miliar	BSD City, Tangerang	Taman,Tempat Jemuran,Keamanan 24 jam	1	1	1300 Watt	NaN	Utara	Tidak	hos14850708	1	2	NaN	NaN	2	Unfurnished	Bagus	NaN	NaN	70 m²	105 m²	NaN	NaN	NaN	NaN	NaN	Ya	Ya	HGB - Hak Guna Bangunan	PAM atau PDAM	NaN	NaN	Tidak	Rumah
3	Rp 1,61 Miliar	BSD City, Tangerang	Taman,Tempat Jemuran,Keamanan 24 jam	1	1	2200 Watt	NaN	Timur Laut	Tidak	hos14850789	2	3	NaN	NaN	4	Unfurnished	Bagus	NaN	NaN	174 m²	144 m²	NaN	NaN	NaN	NaN	NaN	Ya	Ya	SHM - Sertifikat Hak Milik	PAM atau PDAM	NaN	NaN	Tidak	Rumah
4	Rp 2,14 Miliar	BSD City, Tangerang	Taman,Tempat Jemuran,Keamanan 24 jam	1	1	5500 Watt	NaN	Utara	Tidak	hos14003997	3	3	NaN	1	3	Unfurnished	Bagus	NaN	NaN	170 m²	105 m²	NaN	NaN	NaN	NaN	NaN	Ya	Ya	SHM - Sertifikat Hak Milik	PAM atau PDAM	NaN	NaN	Tidak	Rumah

Code

#remove duplicate data / 'iklan' (ads)
df = df.drop_duplicates('ID Iklan').reset_index(drop=True)

Code

df.insert(3, 'komplek', df[['fasilitas']].fillna('none').apply(lambda z: 'ya' if any([x in z.fasilitas.lower() for x in ['lapangan','gym','jogging','playground','one gate system']]) else 'tidak', axis = 1))
df.insert(1, 'Kecamatan', df.alamat.str.split(',').str[0])
df.insert(2, 'Kota', df.alamat.str.split(',').str[1])
df.drop(columns=['alamat','fasilitas'], inplace=True)

Create 3 new columns:
1.’komplek’ = check if the house is in a ‘komplek’ or not.
2.’Kecamatan’ = ‘kecamatan’ of the address.
3.’Kota’ = ‘kota’ of the address.

Code

#ensure the selected city is Tangerang
df = df[df.Kota == ' Tangerang']
df.drop(columns = ['Kota'], inplace=True)

Code

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22667 entries, 0 to 22673
Data columns (total 34 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   harga                  22667 non-null  object
 1   Kecamatan              22667 non-null  object
 2   komplek                22667 non-null  object
 3   Carport                14104 non-null  object
 4   Dapur                  12157 non-null  object
 5   Daya Listrik           18287 non-null  object
 6   Garasi                 6450 non-null   object
 7   Hadap                  8528 non-null   object
 8   Hook                   18764 non-null  object
 9   ID Iklan               22667 non-null  object
 10  Jumlah Lantai          20569 non-null  object
 11  Kamar Mandi            21871 non-null  object
 12  Kamar Mandi Pembantu   8945 non-null   object
 13  Kamar Pembantu         9597 non-null   object
 14  Kamar Tidur            21864 non-null  object
 15  Kondisi Perabotan      16139 non-null  object
 16  Kondisi Properti       19797 non-null  object
 17  Konsep dan Gaya Rumah  9493 non-null   object
 18  Lebar Jalan            12045 non-null  object
 19  Luas Bangunan          22624 non-null  object
 20  Luas Tanah             22658 non-null  object
 21  Material Bangunan      7709 non-null   object
 22  Material Lantai        8380 non-null   object
 23  Nomor Lantai           0 non-null      object
 24  Pemandangan            9987 non-null   object
 25  Periode Sewa           1 non-null      object
 26  Ruang Makan            11288 non-null  object
 27  Ruang Tamu             18765 non-null  object
 28  Sertifikat             22489 non-null  object
 29  Sumber Air             13236 non-null  object
 30  Tahun Di Renovasi      2856 non-null   object
 31  Tahun Dibangun         8878 non-null   object
 32  Terjangkau Internet    18761 non-null  object
 33  Tipe Properti          22667 non-null  object
dtypes: object(34)
memory usage: 6.1+ MB

There are 29 missing columns. After carefully reviewing each house one by one in the context, I categorized the missing columns into:

1.Missing Completely At Random
This means that the values are not input by the users due to accidents, forgetfulness, or oversight. These columns are:
Daya Listrik, Kamar Mandi, Kamar Tidur, Luas Bangunan, Luas Tanah, Sertifikat, Sumber Air, Hook

2.Missing At Random
This means that the values are not input by the users because these values can be obtained from other sources, such as photos or other information. These columns are:
Carport, Dapur, Garasi, Kamar Mandi Pembantu, Kamar Pembantu, Kondisi Perabotan, Kondisi Properti, Jumlah Lantai, Ruang Makan, Ruang Tamu

3.Missing Not At Random
This means that the values are not input by the users because they either do not know the value or they perceive it as unimportant to input the value. These columns are:
Konsep dan Gaya Rumah, Hadap, Lebar Jalan, Material Bangunan, Material Lantai, Nomor Lantai, Pemandangan, Periode Sewa, Tahun Di Renovasi, Tahun Dibangun, Terjangkau Internet ***

Code

#create a function to categorize multiple 'perabot' conditions into more general category
def perabot(kondisi):
    kondisi = str(kondisi)
    if kondisi.lower() in ['unfurnished','butuh renovasi']:
        return 'Unfurnished'
    elif kondisi.lower() in ['furnished','bagus','bagus sekali','baru','sudah renovasi','semi furnished']:
        return 'Furnished'
    else:
        return np.nan

df['Kondisi Perabotan'] = df[['Kondisi Perabotan']].apply(lambda x: perabot(x['Kondisi Perabotan']), axis = 1)

Code

#filtered only houses and not other property
df = df[df['Tipe Properti'] == 'Rumah']

Code

#extract prices from column 'harga' into integer
def price_extract(price):
    if "Triliun" in price:
        numbers = re.findall(r'\d+\.\d+|\d+', price)
        numbers = float(".".join(numbers))
        return int(numbers * 1000000)
    elif "Miliar" in price:
        numbers = re.findall(r'\d+\.\d+|\d+', price)
        numbers = float(".".join(numbers))
        return int(numbers * 1000)
    elif "Juta" in price:
        numbers = re.findall(r'\d+\.\d+|\d+', price)
        numbers = float(".".join(numbers))
        return int(numbers * 1)      

df.insert(1, 'price',  df[['harga']].apply(lambda x: price_extract(x.harga), axis = 1))
df.drop(columns = ['harga'], inplace = True)

Here, I converted the ‘price’ column into integers and divided it by one million to make it easier to view and analyze.

Code

#move 'ID Iklan' column to left
df.insert(3, 'ID',  df['ID Iklan'])

Code

#filter Sertifikat to SHM only
df = df[df.Sertifikat == 'SHM - Sertifikat Hak Milik']

Code

#drop useless columns
df.drop(columns=['Kondisi Properti','Sertifikat','Hook','ID Iklan','Lebar Jalan','Dapur','Tipe Properti','Terjangkau Internet','Sumber Air','Ruang Tamu','Ruang Makan','Carport','Garasi','Tahun Di Renovasi','Tahun Dibangun','Hadap','Konsep dan Gaya Rumah','Material Bangunan','Material Lantai','Nomor Lantai','Pemandangan','Periode Sewa'], inplace = True, errors='ignore')

Code

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17746 entries, 3 to 22673
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 17746 non-null  int64 
 1   Kecamatan             17746 non-null  object
 2   komplek               17746 non-null  object
 3   ID                    17746 non-null  object
 4   Daya Listrik          14640 non-null  object
 5   Jumlah Lantai         16312 non-null  object
 6   Kamar Mandi           17110 non-null  object
 7   Kamar Mandi Pembantu  6714 non-null   object
 8   Kamar Pembantu        7296 non-null   object
 9   Kamar Tidur           17103 non-null  object
 10  Kondisi Perabotan     12942 non-null  object
 11  Luas Bangunan         17719 non-null  object
 12  Luas Tanah            17743 non-null  object
dtypes: int64(1), object(12)
memory usage: 1.9+ MB

Code

df = df[~df['Luas Tanah'].isna()].reset_index(drop=True)
df = df[~((df['Kamar Mandi'].isna()) & (df['Kamar Mandi Pembantu'].isna()) & (df['Kamar Pembantu'].isna()) & (df['Kamar Tidur'].isna()))].reset_index(drop=True)

I removed rows that lacked critical information, such as ‘Luas Tanah’ and the total number of ‘Kamar’.

Code

for kol in ['Kamar Mandi','Kamar Tidur']:
    df[kol] = df[kol].fillna(1)

df['Luas Bangunan'] = df['Luas Bangunan'].fillna(df['Luas Tanah'])
df['Kamar Mandi Pembantu'] = df['Kamar Mandi Pembantu'].fillna(0)
df['Kamar Pembantu'] = df['Kamar Mandi Pembantu'].fillna(0)

The columns mentioned above have a small number of missing values. I filled in the missing values for ‘Kamar Mandi’ and ‘Kamar Tidur’ with ‘1’ because there is at least one of each of them in a house. For ‘Kamar & Kamar Mandi Pembantu,’ I filled in ‘0’.

Code

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17149 entries, 0 to 17148
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 17149 non-null  int64 
 1   Kecamatan             17149 non-null  object
 2   komplek               17149 non-null  object
 3   ID                    17149 non-null  object
 4   Daya Listrik          14208 non-null  object
 5   Jumlah Lantai         15731 non-null  object
 6   Kamar Mandi           17149 non-null  object
 7   Kamar Mandi Pembantu  17149 non-null  object
 8   Kamar Pembantu        17149 non-null  object
 9   Kamar Tidur           17149 non-null  object
 10  Kondisi Perabotan     12890 non-null  object
 11  Luas Bangunan         17149 non-null  object
 12  Luas Tanah            17149 non-null  object
dtypes: int64(1), object(12)
memory usage: 1.7+ MB

Code

#If a LB is bigger than LT, I assume that the houses has more than 1 floor.
def conditions(x):
    if pd.isnull(x['Jumlah Lantai']):
        if x['Luas Bangunan'] > x['Luas Tanah']:
            return '2'
        else:
            return '1'
    else:
        return x['Jumlah Lantai']

df['Jumlah Lantai'] = df[['Jumlah Lantai','Luas Bangunan','Luas Tanah']].apply(lambda x: conditions(x) , axis = 1 )

Code

#I replaced certain ambiguous values in the 'Daya Listrik' column with 'NaN' so it can be imputed with KNN Imputer
condition = ((df['Daya Listrik'] == 'Lainnya Watt') | (df['Daya Listrik'] == 'lainnya Watt'))
df.loc[condition, 'Daya Listrik'] = np.nan

Code

#change data types
df['Daya Listrik'] = df['Daya Listrik'].str.replace('Watt','').astype('Float64')
df['Luas Bangunan'] = df['Luas Bangunan'].str.replace('m²','').astype(int)
df['Luas Tanah'] = df['Luas Tanah'].str.replace('m²','').astype(int)

Code

#impute with KNN
df['avg_bangunan'] = df.groupby('Daya Listrik')['Luas Bangunan'].transform('median').fillna(df['Luas Bangunan'])
listrik_impute = pd.DataFrame(KNNImputer(n_neighbors=1).fit_transform(df[['Daya Listrik','avg_bangunan']]))
df['Daya Listrik'] = listrik_impute[0]
df.drop(columns=['avg_bangunan'], inplace=True)

Code

#I selected only the 'Kecamatan' with more than 100 samples to ensure a reliable result
df = df[ df.groupby('Kecamatan')['price'].transform('count') > 100]

Code

df['price/m'] = df.price / (df['Luas Tanah'] + df['Luas Bangunan'])

Code

#create dictionary to map 'Kondisi Perabotan'
value_mapping = {
    'Unfurnished': 1,
    'Furnished' : 2,
}

reverse_mapping = {
    1.0 : 'Unfurnished',
    2.0 : 'Furnished',
}

Code

df['Kondisi Perabotan'] = df['Kondisi Perabotan'].map(value_mapping)

Code

for kec in df.Kecamatan.unique() :
    df1 = df[df.Kecamatan == kec][['Kondisi Perabotan','price/m']].copy().reset_index(drop=True)
    df1['price/m'] = df1.groupby(['Kondisi Perabotan'])['price/m'].transform('median').fillna(df1['price/m'])

    imputer = KNNImputer(n_neighbors = 1)
    imputer.fit(df1[['Kondisi Perabotan','price/m']])
    missing_values = df.loc[df[df.Kecamatan == kec].index,['Kondisi Perabotan','price/m']]
    df.loc[df[df.Kecamatan == kec].index,['Kondisi Perabotan','price/m']] = imputer.transform(missing_values)
        
df.reset_index(drop=True, inplace=True)
df['Kondisi Perabotan'] = df['Kondisi Perabotan'].map(reverse_mapping)

To impute missing values for ‘Kondisi Perabotan’, first gather the ‘price/m’ data for each ‘Kecamatan’. Once we have this data, we can observe that ‘Kondisi Perabotan’ prices vary significantly, with furnished properties being considerably more expensive than unfurnished ones. Then, we will use the KNN Imputer to fill in the missing values with the nearest neighbors’ values.

Code

df.drop(columns = ['price/m'], inplace = True)

Code

#change data types
df = df.convert_dtypes(convert_string = False)
for column in ['Kamar Mandi','Kamar Mandi Pembantu','Kamar Pembantu','Kamar Tidur','price','Daya Listrik','Luas Bangunan','Luas Tanah']:
    df[column] = df[column].astype(int)

Code

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17050 entries, 0 to 17049
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 17050 non-null  int64 
 1   Kecamatan             17050 non-null  object
 2   komplek               17050 non-null  object
 3   ID                    17050 non-null  object
 4   Daya Listrik          17050 non-null  int64 
 5   Jumlah Lantai         17050 non-null  object
 6   Kamar Mandi           17050 non-null  int64 
 7   Kamar Mandi Pembantu  17050 non-null  int64 
 8   Kamar Pembantu        17050 non-null  int64 
 9   Kamar Tidur           17050 non-null  int64 
 10  Kondisi Perabotan     17050 non-null  object
 11  Luas Bangunan         17050 non-null  int64 
 12  Luas Tanah            17050 non-null  int64 
dtypes: int64(8), object(5)
memory usage: 1.7+ MB

Code

df.drop(columns = 'ID', inplace = True)
df.duplicated().sum()

After getting rid of the ‘ID’ column, you’ll notice there are still lots of duplicates left. It seems like the same ads are being posted multiple times. Let’s go ahead and clean those up.

Code

df = df.drop_duplicates()

Code

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14173 entries, 0 to 17049
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 14173 non-null  int64 
 1   Kecamatan             14173 non-null  object
 2   komplek               14173 non-null  object
 3   Daya Listrik          14173 non-null  int64 
 4   Jumlah Lantai         14173 non-null  object
 5   Kamar Mandi           14173 non-null  int64 
 6   Kamar Mandi Pembantu  14173 non-null  int64 
 7   Kamar Pembantu        14173 non-null  int64 
 8   Kamar Tidur           14173 non-null  int64 
 9   Kondisi Perabotan     14173 non-null  object
 10  Luas Bangunan         14173 non-null  int64 
 11  Luas Tanah            14173 non-null  int64 
dtypes: int64(8), object(4)
memory usage: 1.4+ MB

Code

df = df[df['Jumlah Lantai'].astype(int) <= 4]

Code

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14161 entries, 0 to 17049
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 14161 non-null  int64 
 1   Kecamatan             14161 non-null  object
 2   komplek               14161 non-null  object
 3   Daya Listrik          14161 non-null  int64 
 4   Jumlah Lantai         14161 non-null  object
 5   Kamar Mandi           14161 non-null  int64 
 6   Kamar Mandi Pembantu  14161 non-null  int64 
 7   Kamar Pembantu        14161 non-null  int64 
 8   Kamar Tidur           14161 non-null  int64 
 9   Kondisi Perabotan     14161 non-null  object
 10  Luas Bangunan         14161 non-null  int64 
 11  Luas Tanah            14161 non-null  int64 
dtypes: int64(8), object(4)
memory usage: 1.4+ MB

Code

df5 = df.copy()

Code

#rename columns
df.rename(columns={'Daya Listrik': 'Listrik', 
                   'Jumlah Lantai': 'Lantai',
                   'Kamar Mandi': 'KM',
                   'Kamar Mandi Pembantu': 'KMP',
                   'Kamar Pembantu': 'KP',
                   'Kamar Tidur': 'KT',
                   'Kondisi Perabotan': 'Kondisi',
                   'Luas Bangunan': 'LB',
                   'Luas Tanah': 'LT'}, inplace=True)

Code

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14161 entries, 0 to 17049
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   price      14161 non-null  int64 
 1   Kecamatan  14161 non-null  object
 2   komplek    14161 non-null  object
 3   Listrik    14161 non-null  int64 
 4   Lantai     14161 non-null  object
 5   KM         14161 non-null  int64 
 6   KMP        14161 non-null  int64 
 7   KP         14161 non-null  int64 
 8   KT         14161 non-null  int64 
 9   Kondisi    14161 non-null  object
 10  LB         14161 non-null  int64 
 11  LT         14161 non-null  int64 
dtypes: int64(8), object(4)
memory usage: 1.4+ MB

Code

df.Kecamatan.unique()

array(['BSD City', 'Poris', 'Karang Tengah', 'Pondok Benda',
       'Tangerang Kota', 'BSD Delatinos', 'Serua', 'Karawaci', 'Jombang',
       'Cikupa', 'Curug', 'Suradita', 'Panongan', 'BSD Residence One',
       'Kelapa Dua', 'Pondok Ranji', 'Ciledug', 'Pondok Cabe',
       'Gading Serpong', 'Pagedangan', 'Legok', 'Pinang', 'Cikokol',
       'Dadap', 'BSD Foresta', 'BSD Green Wich', 'Cireundeu',
       'BSD Anggrek Loka', 'Cipondoh', 'BSD Nusaloka', 'BSD Puspita Loka',
       'Gading Serpong Pondok Hijau Golf', 'BSD The Green', 'Cipadu',
       'Larangan', 'BSD Giri Loka', 'Cikupa Citra Raya', 'Modernland',
       'Benda', 'Rempoa', 'Alam Sutera', 'BSD', 'Cisauk', 'Graha Raya',
       'Lippo Karawaci'], dtype=object)

Basic Exploratory Data Analysis

Code

#Let's split the categorical and numerical columns into separate dataframes for easier analysis.
dfnum = df._get_numeric_data()
dfcat = df.drop(columns = dfnum.columns)

Descriptive Statistic

Code

#numerical columns describe.
dfnum.describe()

	price	Listrik	KM	KMP	KP	KT	LB	LT
count	1.416100e+04	14161.000000	14161.000000	14161.000000	14161.000000	14161.000000	14161.000000	1.416100e+04
mean	4.525861e+03	3892.131205	2.717958	0.419603	0.419603	3.521715	187.472848	4.820822e+04
std	1.518706e+05	6346.367405	2.124905	0.535776	0.535776	2.272806	209.443483	5.714284e+06
min	7.000000e+00	-1300.000000	1.000000	0.000000	0.000000	1.000000	1.000000	7.000000e+00
25%	1.200000e+03	2200.000000	2.000000	0.000000	0.000000	3.000000	81.000000	8.300000e+01
50%	2.000000e+03	2200.000000	2.000000	0.000000	0.000000	3.000000	135.000000	1.260000e+02
75%	3.500000e+03	4400.000000	3.000000	1.000000	1.000000	4.000000	225.000000	2.050000e+02
max	1.800000e+07	95000.000000	99.000000	10.000000	10.000000	99.000000	8032.000000	6.800000e+08

All columns above have outliers. ‘Listrik’ column has negative value which is caused by incorrect user input

Code

df['Listrik'] = abs(df['Listrik'])

Code

dfcat.describe()

	Kecamatan	komplek	Lantai	Kondisi
count	14161	14161	14161	14161
unique	45	2	4	2
top	Poris	tidak	2	Unfurnished
freq	503	7852	8818	8014

There are 45 ‘Kecamatan’ used on this samples.

Univariate Analysis

Code

fig, axarr = plt.subplots(2,4, figsize=(11, 8))
for col in dfnum.columns:
    loc = dfnum.columns.get_loc(col)
    y = 0 if loc < 4 else 1
    x = loc if loc < 4 else loc - 4
    sns.boxplot(y = dfnum[col], ax = axarr[y,x])
    axarr[y, x].set_ylabel(None)
    axarr[y, x].set_xlabel(col)
plt.suptitle('Outliers checking on numeric columns', weight='bold')
fig.tight_layout(pad=1)
plt.show()

Code

#removing extreme values
df = df[(df.price < 10000)]
df = df[(df['Listrik'] < 20000)]
df = df[(df['KM'] < 10)]
df = df[(df['KMP'] <= 2)]
df = df[(df['KP'] <= 2)]
df = df[(df['KT'] <= 20)]
df = df[(df['LB'] <= 1000)]
df = df[(df['LT'] <= 1000)]

Code

dfnum = df._get_numeric_data()
dfcat = df.drop(columns = dfnum.columns)

Code

fig, axarr = plt.subplots(2,4, figsize=(11, 8))
for col in dfnum.columns:
    loc = dfnum.columns.get_loc(col)
    y = 0 if loc < 4 else 1
    x = loc if loc < 4 else loc - 4
    sns.histplot(data=dfnum[col], 
                 color='skyblue', 
                 kde=True, 
                 edgecolor='none', 
                 ax=axarr[y, x])
plt.suptitle('Distribution checking on numeric columns', weight='bold')
fig.tight_layout(pad=1)
plt.show()

We can clearly see that this data is right skewed.

Code

fig, axarr = plt.subplots(2,2, figsize=(11, 8))
for col in dfcat.columns:
    loc = dfcat.columns.get_loc(col)
    y = 0 if loc < 2 else 1
    x = loc if loc < 2 else loc - 2
    sns.countplot(x=df[col], color='green', ax = axarr[y,x], order=df[col].value_counts().iloc[:4].index)

plt.suptitle('Countplot of categorical columns', weight='bold')
fig.tight_layout(pad=1)
plt.show()

Multivariate Analysis

Code

plt.figure(figsize=(10,8))
sns.heatmap(dfnum.corr(), annot=True, fmt='.2f')
plt.title('Correlation plot', weight ='bold')
plt.xticks(rotation = 0)
plt.show()

price, LB, and LT, exhibit a strong positive correlation. Although this correlation is quite strong, it does not lead to multicollinearity issues.

Deep-Dive Exploratory Data Analysis

Code

df['KMR'] = df.KP + df.KT
df['KMD'] = df.KM + df.KMP
df['price/m'] = df.price / (df.LT + df.LB) 
df['Lantai'] = df.Lantai.astype(int)
df.drop(columns=['KM','KMP','KP','KT'], inplace=True)

I’ve developed a new feature named ‘price/m’ by dividing the price by the sum of LT and LB. This feature is more suitable for analysis compared to using just the price alone.

Code

sns.histplot(df.groupby('Kecamatan').size(), alpha=0.8)
plt.xlabel('Kecamatan Frequency')
plt.title('Kecamatan distribution count', weight='bold')
plt.show()

The plot above illustrates the distribution of samples collected from each Kecamatan. The sample size varies, with the lowest being approximately 100 and the highest reaching 500. In cases of insufficient sample size, potential issues may arise, and further analysis is necessary to determine whether these samples are adequate.

Code

fig, axarr = plt.subplots(1,2, figsize=(10, 4))
data = df.groupby('Kecamatan').size().reset_index(drop=False).sort_values(0, ascending=False)
data.columns = ['Kecamatan','count']
sns.barplot(data = data.head(15), x='count', y='Kecamatan', ax = axarr[0])
sns.barplot(data = data.tail(15), x='count', y='Kecamatan', ax = axarr[1], palette='RdBu')
axarr[0].set_title('Top 15 highest kecamatan count', weight='bold')
axarr[1].set_title('Top 15 lowest kecamatan count', weight='bold')
axarr[0].set_position([0.1, 0.1, 0.3, 0.8])
axarr[1].set_position([0.6, 0.1, 0.3, 0.8])

Code

df['Kecamatan'] = df.Kecamatan.str.replace('Gading Serpong Pondok Hijau Golf' , 'Gading Serpong PHG')

Code

fig, axarr = plt.subplots(1,2, figsize=(10, 4))
data = df.groupby('Kecamatan')[['price']].agg('median').reset_index().sort_values('price', ascending=False)
sns.barplot(data = data.head(15), x='price', y='Kecamatan', ax = axarr[0], palette='plasma')
sns.barplot(data = data.tail(15), x='price', y='Kecamatan', ax = axarr[1], palette='coolwarm')
plt.title('harga per kecamatan')
axarr[0].set_title('Top 15 highest kecamatan price', weight='bold')
axarr[1].set_title('Top 15 lowest kecamatan price', weight='bold')
axarr[0].set_position([0.1, 0.1, 0.3, 0.8])
axarr[1].set_position([0.6, 0.1, 0.3, 0.8])

BSD is one of the most expensive region in Tangerang

Code

fig, axarr = plt.subplots(1,2, figsize=(10, 4))
data = df.groupby('Kecamatan')[['price/m']].agg('median').reset_index().sort_values('price/m', ascending=False)
sns.barplot(data = data.head(15), x='price/m', y='Kecamatan', ax = axarr[0], palette = 'YlGn')
sns.barplot(data = data.tail(15), x='price/m', y='Kecamatan', ax = axarr[1], palette = 'YlGnBu')
plt.title('harga meteran per kecamatan')
axarr[0].set_title('Top 15 highest kecamatan price/m', weight='bold')
axarr[1].set_title('Top 15 lowest kecamatan price/m', weight='bold')
axarr[0].set_position([0.1, 0.1, 0.3, 0.8])
axarr[1].set_position([0.6, 0.1, 0.3, 0.8])

from the price/m, bsd is still one of highest price/m

Code

df.Kondisi = df.Kondisi.map(value_mapping)

Code

df['TK'] = df.KMR + df.KMD
df.drop(columns=['KMR','KMD'], inplace=True)

Code

df['avg_price'] = df.groupby(['Kecamatan','Kondisi'])['price/m'].transform('median')

Code

df['rstd'] = df.groupby(['Kecamatan','Kondisi'])['price/m'].transform(lambda x: x.std() * 100 / x.mean())

Code

df1 = df[df.Kondisi == 1]
df2 = df[df.Kondisi == 2]

fig, axarr = plt.subplots(1,2, figsize=(10, 4))
data = df1[['Kecamatan','rstd']].drop_duplicates().sort_values('rstd', ascending=False).head()
sns.barplot(data = data, x='rstd', y='Kecamatan', ax = axarr[0])
sns.histplot(data = df1.rstd.drop_duplicates(), ax = axarr[1])
sns.histplot(data = df2.rstd.drop_duplicates(), ax = axarr[1], alpha = 0.5, legend='a')
axarr[0].set_title('Top 5 highest kecamatan price variability', weight='bold')
axarr[1].set_title('r-std.dev price/m per kecamatan', weight='bold')
axarr[0].yaxis.set_tick_params(labelsize=10)

hand2 = [mpatches.Rectangle((0, 0), 1, 1, color='blue', label='Unfurnished')]
hand1 = [mpatches.Rectangle((0, 0), 1, 1, color='orange', label='Furnished')]
legend1 = plt.legend(handles = hand2, loc='upper right', labelcolor='blue')
legend2 = plt.legend(handles = hand1, loc='upper left', labelcolor='orange')
plt.gca().add_artist(legend1)
plt.show()

R-std, which stands for relative standard deviation, is a valuable tool for assessing variability and making comparisons between datasets. In the displayed plot, we observe that the r-std is distributed within a substantial 30% range. When the price exhibits a high r-std, it typically indicates an insufficient number of samples or a failure to account for critical factors influencing price fluctuations.

Code

#create a variable to contain avg_price information from the data to be used by test data
avg = df[['Kecamatan','Kondisi','avg_price']].drop_duplicates()

Code

#remove columns that are no longer needed
df9 = df.copy()
df = df.drop(columns=['Kecamatan','price/m','rstd'])

Code

#encoding
df['komplek'] =  df[['komplek']].apply(lambda x: 1 if x.komplek == 'ya' else 0, axis = 1)

Code

#create a variable to contain result from all combination
final_result = pd.DataFrame(index = ['mae','mape','rmse','r2'])

Modelling Combination 1

For the first combination, I will utilize all available features and apply linear regression.

Code

dfc1 = df.copy().reset_index(drop=True)

Code

dfc1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13428 entries, 0 to 13427
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   price      13428 non-null  int64  
 1   komplek    13428 non-null  int64  
 2   Listrik    13428 non-null  int64  
 3   Lantai     13428 non-null  int64  
 4   Kondisi    13428 non-null  int64  
 5   LB         13428 non-null  int64  
 6   LT         13428 non-null  int64  
 7   TK         13428 non-null  int64  
 8   avg_price  13428 non-null  float64
dtypes: float64(1), int64(8)
memory usage: 944.3 KB

Code

X_train, X_test, y_train, y_test = train_test_split(dfc1.drop(columns='price'), dfc1.price.to_numpy(), test_size = 0.2, random_state=129)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

Code

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Code

y_pred_test = model.predict(X_test)
y_pred_train = model.predict(X_train)

Code

result = pd.DataFrame()
mae_test = mean_absolute_error(y_test, y_pred_test)
mape_test = mean_absolute_percentage_error(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
r2_test = r2_score(y_test, y_pred_test)

mae_train = mean_absolute_error(y_train, y_pred_train)
mape_train = mean_absolute_percentage_error(y_train, y_pred_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
r2_train = r2_score(y_train, y_pred_train)

result['test_OLS'] = [mae_test, mape_test, rmse_test, r2_test]
result['train_OLS'] = [mae_train, mape_train, rmse_train, r2_train]
result.index = ['mae','mape','rmse','r2']

Code

result

	test_OLS	train_OLS
mae	532.715149	541.468771
mape	0.278840	0.329897
rmse	800.262693	805.231411
r2	0.797440	0.794544

Above is the result of the base model.

Code

final_result = pd.concat([final_result, result], axis = 1)

Modelling Combination 2

In this second combination, I intend to enhance model complexity by incorporating polynomial features. If overfitting concerns arise, I will address them by applying ridge regression.

Code

dfc2 = df.copy().reset_index(drop=True)

Code

dfc2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13428 entries, 0 to 13427
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   price      13428 non-null  int64  
 1   komplek    13428 non-null  int64  
 2   Listrik    13428 non-null  int64  
 3   Lantai     13428 non-null  int64  
 4   Kondisi    13428 non-null  int64  
 5   LB         13428 non-null  int64  
 6   LT         13428 non-null  int64  
 7   TK         13428 non-null  int64  
 8   avg_price  13428 non-null  float64
dtypes: float64(1), int64(8)
memory usage: 944.3 KB

Code

scaler = MinMaxScaler()
df1 = scaler.fit_transform(dfc2.drop(columns=['price']))
df1 = pd.DataFrame(df1, columns = dfc2.drop(columns=['price']).columns)
df1['price'] = dfc2.price
dfc2 = df1.copy()

Code

X_train, X_test, y_train, y_test = train_test_split(dfc2.drop(columns='price'), dfc2.price.to_numpy(), test_size=0.2, random_state=129)

Code

result = pd.DataFrame()
for x in [2,3,4]:
    poly_features = PolynomialFeatures(degree=x)
    X_poly = poly_features.fit_transform(X_train)
    X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(X_poly, y_train, test_size=0.2, random_state=129)

    model = LinearRegression()
    model.fit(X_train_poly, y_train_poly)

    y_pred_test = model.predict(X_test_poly)
    y_pred_train = model.predict(X_train_poly)

    mae_test = mean_absolute_error(y_test_poly, y_pred_test)
    rmse_test = np.sqrt(mean_squared_error(y_test_poly, y_pred_test))
    r2_test = r2_score(y_test_poly, y_pred_test)

    mae_train = mean_absolute_error(y_train_poly, y_pred_train)
    rmse_train = np.sqrt(mean_squared_error(y_train_poly, y_pred_train))
    r2_train = r2_score(y_train_poly, y_pred_train)

    result[f'test_{x}'] = [mae_test, rmse_test, r2_test]
    result[f'train_{x}'] = [mae_train, rmse_train, r2_train]

result.index = ['mae','rmse','r2']
result

	test_2	train_2	test_3	train_3	test_4	train_4
mae	472.915901	481.615889	472.519472	464.954008	501.537211	442.182946
rmse	712.046550	738.563833	719.862027	705.295298	834.318214	660.588687
r2	0.837082	0.827754	0.833486	0.842922	0.776326	0.862204

I tested a polynomial degree of 3, and the results indicate a minor degree of overfitting. However, when I increased it to degree 4, the model exhibited a severe overfitting issue.

Code

result = pd.DataFrame()
for x in [3,4]:
    poly_features = PolynomialFeatures(degree=x)
    X_poly = poly_features.fit_transform(X_train)
    X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(X_poly, y_train, test_size=0.2, random_state=129)

    for alpha in [0.01, 0.1, 1., 10]:
        model = Ridge(alpha=alpha)
        model.fit(X_train_poly, y_train_poly)

        y_pred_test = model.predict(X_test_poly)
        y_pred_train = model.predict(X_train_poly)

        mae_test = mean_absolute_error(y_test_poly, y_pred_test)
        rmse_test = np.sqrt(mean_squared_error(y_test_poly, y_pred_test))
        r2_test = r2_score(y_test_poly, y_pred_test)

        mae_train = mean_absolute_error(y_train_poly, y_pred_train)
        rmse_train = np.sqrt(mean_squared_error(y_train_poly, y_pred_train))
        r2_train = r2_score(y_train_poly, y_pred_train)

        result[f'test_{x}_{alpha}'] = [mae_test, rmse_test, r2_test]
        result[f'train_{x}_{alpha}'] = [mae_train, rmse_train, r2_train]

result.index = ['mae','rmse','r2']
result

	test_3_0.01	train_3_0.01	test_3_0.1	train_3_0.1	test_3_1.0	train_3_1.0	test_3_10	train_3_10	test_4_0.01	train_4_0.01	test_4_0.1	train_4_0.1	test_4_1.0	train_4_1.0	test_4_10	train_4_10
mae	467.342704	466.439950	464.299064	470.035509	470.187015	478.290378	484.975821	490.051383	464.840133	456.548113	462.897803	463.601116	466.596826	473.665648	481.134291	487.347859
rmse	704.461373	708.507262	698.403249	717.934342	707.437585	731.792609	726.156035	749.558649	706.885343	690.605679	695.774749	704.905074	700.618287	723.246246	720.228905	743.215049
r2	0.840534	0.841488	0.843265	0.837242	0.839184	0.830898	0.830561	0.822587	0.839435	0.849397	0.844443	0.843096	0.842270	0.834824	0.833316	0.825578

I attempted to use ridge regression to tackle overfitting on highly polynomial features, but the outcome wasn’t superior to simply using a polynomial degree of 2 without any regularization.

Code

poly_features = PolynomialFeatures(degree=2)
X_train = poly_features.fit_transform(X_train)
X_test = poly_features.fit_transform(X_test)

Code

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

Code

y_pred_test = model.predict(X_test)
y_pred_train = model.predict(X_train)

Code

result = pd.DataFrame()
mae_test = mean_absolute_error(y_test, y_pred_test)
mape_test = mean_absolute_percentage_error(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
r2_test = r2_score(y_test, y_pred_test)

mae_train = mean_absolute_error(y_train, y_pred_train)
mape_train = mean_absolute_percentage_error(y_train, y_pred_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
r2_train = r2_score(y_train, y_pred_train)

result['test_PolyRidgeOLS'] = [mae_test, mape_test, rmse_test, r2_test]
result['train_PolyRidgeOLS'] = [mae_train, mape_train, rmse_train, r2_train]
result.index = ['mae','mape','rmse','r2']

Code

result

	test_PolyRidgeOLS	train_PolyRidgeOLS
mae	471.090238	478.917020
mape	0.227019	0.276395
rmse	729.230240	732.438213
r2	0.831804	0.830011

The model is not overfitting and the result is better than the base model

Code

final_result = pd.concat([final_result, result], axis = 1)

Modelling Combination 3

Let’s try to log transform skewed features and use linear regression.

Code

dfc3 = df.copy().reset_index(drop=True)

Code

column = ['LB','LT','TK']
for col in column:
    dfc3[col] = np.log(dfc3[col])

Code

dfc3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13428 entries, 0 to 13427
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   price      13428 non-null  int64  
 1   komplek    13428 non-null  int64  
 2   Listrik    13428 non-null  int64  
 3   Lantai     13428 non-null  int64  
 4   Kondisi    13428 non-null  int64  
 5   LB         13428 non-null  float64
 6   LT         13428 non-null  float64
 7   TK         13428 non-null  float64
 8   avg_price  13428 non-null  float64
dtypes: float64(4), int64(5)
memory usage: 944.3 KB

Code

X_train, X_test, y_train, y_test = train_test_split(dfc3.drop(columns='price'), dfc3.price.to_numpy(), test_size = 0.2, random_state=129)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

Code

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

Code

y_pred_test = model.predict(X_test)
y_pred_train = model.predict(X_train)

Code

result = pd.DataFrame()
mae_test = mean_absolute_error(y_test, y_pred_test)
mape_test = mean_absolute_percentage_error(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
r2_test = r2_score(y_test, y_pred_test)

mae_train = mean_absolute_error(y_train, y_pred_train)
mape_train = mean_absolute_percentage_error(y_train, y_pred_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
r2_train = r2_score(y_train, y_pred_train)

result['test_LogOLS'] = [mae_test, mape_test, rmse_test, r2_test]
result['train_LogOLS'] = [mae_train, mape_train, rmse_train, r2_train]
result.index = ['mae','mape','rmse','r2']

Code

result

	test_LogOLS	train_LogOLS
mae	583.179251	595.413840
mape	0.329967	0.394927
rmse	835.080845	841.121663
r2	0.779431	0.775821

The results show a performance decline compared to using Linear Regression on the original, untransformed features.

Code

final_result = pd.concat([final_result, result], axis = 1)

Modelling Combination 4

For this combination, let’s try XGBRegressor

Code

dfc4 = df.copy()

Code

dfc4.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13428 entries, 0 to 17049
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   price      13428 non-null  int64  
 1   komplek    13428 non-null  int64  
 2   Listrik    13428 non-null  int64  
 3   Lantai     13428 non-null  int64  
 4   Kondisi    13428 non-null  int64  
 5   LB         13428 non-null  int64  
 6   LT         13428 non-null  int64  
 7   TK         13428 non-null  int64  
 8   avg_price  13428 non-null  float64
dtypes: float64(1), int64(8)
memory usage: 1.0 MB

Code

X_train, X_test, y_train, y_test = train_test_split(dfc4.drop(columns='price'), dfc4.price.to_numpy(), test_size = 0.2, random_state=129)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

Code

xgb_regressor = XGBRegressor(
    objective='reg:squarederror',  
    eval_metric='rmse',           
    random_state=42               
)

Code

param_grid = {  
    'learning_rate': [0.001, 0.01],  
    'n_estimators': [100, 400, 900, 1000],  
    'max_depth': [3, 4],  
    'eval_metric': ['rmse'],  
}

Code

grid_search = GridSearchCV(
    estimator=xgb_regressor,
    param_grid=param_grid,
    scoring='neg_mean_squared_error',  
    cv=5,                              
    verbose=1,                         
    n_jobs=-1                           
)

Code

grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits

GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, device=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False,
                                    eval_metric='rmse', feature_types=None,
                                    gamma=None, grow_policy=None,
                                    importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=None...
                                    max_depth=None, max_leaves=None,
                                    min_child_weight=None, missing=nan,
                                    monotone_constraints=None,
                                    multi_strategy=None, n_estimators=None,
                                    n_jobs=None, num_parallel_tree=None,
                                    random_state=42, ...),
             n_jobs=-1,
             param_grid={'eval_metric': ['rmse'],
                         'learning_rate': [0.001, 0.01], 'max_depth': [3, 4],
                         'n_estimators': [100, 400, 900, 1000]},
             scoring='neg_mean_squared_error', verbose=1)

GridSearchCV

GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, device=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False,
                                    eval_metric='rmse', feature_types=None,
                                    gamma=None, grow_policy=None,
                                    importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=None...
                                    max_depth=None, max_leaves=None,
                                    min_child_weight=None, missing=nan,
                                    monotone_constraints=None,
                                    multi_strategy=None, n_estimators=None,
                                    n_jobs=None, num_parallel_tree=None,
                                    random_state=42, ...),
             n_jobs=-1,
             param_grid={'eval_metric': ['rmse'],
                         'learning_rate': [0.001, 0.01], 'max_depth': [3, 4],
                         'n_estimators': [100, 400, 900, 1000]},
             scoring='neg_mean_squared_error', verbose=1)

estimator: XGBRegressor

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric='rmse', feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)

XGBRegressor

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric='rmse', feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)

Code

best_params_xgb = grid_search.best_params_

Code

best_params_xgb

{'eval_metric': 'rmse',
 'learning_rate': 0.01,
 'max_depth': 4,
 'n_estimators': 1000}

Code

xgboost = XGBRegressor(**best_params_xgb, random_state=42)

Code

xgboost.fit(X_train, y_train)

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric='rmse', feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.01, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=4, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=1000, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)

Code

y_pred_test = xgboost.predict(X_test)
y_pred_train = xgboost.predict(X_train)

Code

result = pd.DataFrame()
mae_test = mean_absolute_error(y_test, y_pred_test)
mape_test = mean_absolute_percentage_error(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
r2_test = r2_score(y_test, y_pred_test)

mae_train = mean_absolute_error(y_train, y_pred_train)
mape_train = mean_absolute_percentage_error(y_train, y_pred_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
r2_train = r2_score(y_train, y_pred_train)

result['test_XGB'] = [mae_test, mape_test, rmse_test, r2_test]
result['train_XGB'] = [mae_train, mape_train, rmse_train, r2_train]
result.index = ['mae','mape','rmse','r2']

Code

result

	test_XGB	train_XGB
mae	438.290716	419.337764
mape	0.213093	0.253128
rmse	678.516114	626.639712
r2	0.854384	0.875573

The model is little overfit. But this is the best model so far

Code

final_result = pd.concat([final_result, result], axis = 1)

Code

xgb.plot_importance(xgboost)  # You can also use 'gain' or 'cover' for importance_type
plt.show()

Modelling Combination 5

I will use RandomForest on this combination. But using RandomForest model on this case is very computational expensive because of many continuous features.

Code

dfc5 = df.copy().reset_index(drop=True) #df[['LT','avg_price','distance','LB','Listrik','price']]

Code

dfc5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13428 entries, 0 to 13427
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   price      13428 non-null  int64  
 1   komplek    13428 non-null  int64  
 2   Listrik    13428 non-null  int64  
 3   Lantai     13428 non-null  int64  
 4   Kondisi    13428 non-null  int64  
 5   LB         13428 non-null  int64  
 6   LT         13428 non-null  int64  
 7   TK         13428 non-null  int64  
 8   avg_price  13428 non-null  float64
dtypes: float64(1), int64(8)
memory usage: 944.3 KB

Code

X_train, X_test, y_train, y_test = train_test_split(dfc5.drop(columns='price'), dfc5.price.to_numpy(), test_size = 0.2, random_state=129)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

Code

randomforest = RandomForestRegressor(random_state=42)

Code

param_grid = {
    'n_estimators': [600],
    'max_depth': [8, 7],
    'min_samples_split': [10, 50, 100],
}

Code

grid_search = GridSearchCV(estimator=randomforest, param_grid=param_grid, cv=5, n_jobs=-1)

Code

grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42), n_jobs=-1,
             param_grid={'max_depth': [8, 7],
                         'min_samples_split': [10, 50, 100],
                         'n_estimators': [600]})

Code

best_params_rf = grid_search.best_params_

Code

best_params_rf

{'max_depth': 8, 'min_samples_split': 10, 'n_estimators': 600}

Code

rf = RandomForestRegressor(**best_params_rf, random_state=42)

Code

rf.fit(X_train, y_train)

RandomForestRegressor(max_depth=8, min_samples_split=10, n_estimators=600,
                      random_state=42)

Code

y_pred_test = rf.predict(X_test)
y_pred_train = rf.predict(X_train)

Code

result = pd.DataFrame()
mae_test = mean_absolute_error(y_test, y_pred_test)
mape_test = mean_absolute_percentage_error(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
r2_test = r2_score(y_test, y_pred_test)

mae_train = mean_absolute_error(y_train, y_pred_train)
mape_train = mean_absolute_percentage_error(y_train, y_pred_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
r2_train = r2_score(y_train, y_pred_train)

result['test_RF'] = [mae_test, mape_test, rmse_test, r2_test]
result['train_RF'] = [mae_train, mape_train, rmse_train, r2_train]
result.index = ['mae','mape','rmse','r2']

Code

result

	test_RF	train_RF
mae	438.359894	399.306010
mape	0.216506	0.243069
rmse	687.059730	599.451504
r2	0.850694	0.886136

The result is more or less the same with XGBoost model.

Code

final_result = pd.concat([final_result, result], axis = 1)

Modelling Combination 6

For this last combination, let’s see how AdaBoost perform.

Code

dfc6 = df.copy().reset_index(drop=True) #df[['LT','avg_price','distance','LB','Listrik','price']]

Code

dfc6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13428 entries, 0 to 13427
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   price      13428 non-null  int64  
 1   komplek    13428 non-null  int64  
 2   Listrik    13428 non-null  int64  
 3   Lantai     13428 non-null  int64  
 4   Kondisi    13428 non-null  int64  
 5   LB         13428 non-null  int64  
 6   LT         13428 non-null  int64  
 7   TK         13428 non-null  int64  
 8   avg_price  13428 non-null  float64
dtypes: float64(1), int64(8)
memory usage: 944.3 KB

Code

X_train, X_test, y_train, y_test = train_test_split(dfc5.drop(columns='price'), dfc5.price.to_numpy(), test_size = 0.2, random_state=129)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

Code

adaboost = AdaBoostRegressor(random_state = 42)

Code

param_grid = {
    'n_estimators': [50, 150, 250],
    'learning_rate': [0.01, 0.1, 1, 0.001]
}

Code

grid_search = GridSearchCV(estimator=adaboost, param_grid=param_grid, cv=5, n_jobs=-1)

Code

grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=AdaBoostRegressor(random_state=42), n_jobs=-1,
             param_grid={'learning_rate': [0.01, 0.1, 1, 0.001],
                         'n_estimators': [50, 150, 250]})

Code

best_params_ada = grid_search.best_params_

Code

best_params_ada

{'learning_rate': 0.1, 'n_estimators': 50}

Code

ada = AdaBoostRegressor(**best_params_ada, random_state=42)

Code

ada.fit(X_train, y_train)

AdaBoostRegressor(learning_rate=0.1, random_state=42)

Code

y_pred_test = ada.predict(X_test)
y_pred_train = ada.predict(X_train)

Code

result = pd.DataFrame()
mae_test = mean_absolute_error(y_test, y_pred_test)
mape_test = mean_absolute_percentage_error(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
r2_test = r2_score(y_test, y_pred_test)

mae_train = mean_absolute_error(y_train, y_pred_train)
mape_train = mean_absolute_percentage_error(y_train, y_pred_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
r2_train = r2_score(y_train, y_pred_train)

result['test_ADA'] = [mae_test, mape_test, rmse_test, r2_test]
result['train_ADA'] = [mae_train, mape_train, rmse_train, r2_train]
result.index = ['mae','mape','rmse','r2']

Code

result

	test_ADA	train_ADA
mae	574.532983	580.901639
mape	0.329233	0.398783
rmse	838.365578	816.508925
r2	0.777692	0.788748

Code

final_result = pd.concat([final_result, result], axis = 1)

Choosing The Best Combination

Let’s use the best combination

Code

final_result

	test_OLS	train_OLS	test_PolyRidgeOLS	train_PolyRidgeOLS	test_LogOLS	train_LogOLS	test_XGB	train_XGB	test_RF	train_RF	test_ADA	train_ADA
mae	532.715149	541.468771	471.090238	478.917020	583.179251	595.413840	438.290716	419.337764	438.359894	399.306010	574.532983	580.901639
mape	0.278840	0.329897	0.227019	0.276395	0.329967	0.394927	0.213093	0.253128	0.216506	0.243069	0.329233	0.398783
rmse	800.262693	805.231411	729.230240	732.438213	835.080845	841.121663	678.516114	626.639712	687.059730	599.451504	838.365578	816.508925
r2	0.797440	0.794544	0.831804	0.830011	0.779431	0.775821	0.854384	0.875573	0.850694	0.886136	0.777692	0.788748

By seeing the results, 2 best models are XGBoost and RandomForrest. I’ll go with XGBoost because it way more less computational than RandomForest

Code

#choosing only the most important features for XGB
df = df[['LT','LB','Listrik','price','avg_price']]

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='price'), df.price.to_numpy(), test_size = 0.2, random_state=123)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

Code

best_xgb_regressor = XGBRegressor(**best_params_xgb, random_state=42)

Code

best_xgb_regressor.fit(X_train, y_train)

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric='rmse', feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.01, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=4, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=1000, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)

Code

y_pred_test = best_xgb_regressor.predict(X_test)
y_pred_train = best_xgb_regressor.predict(X_train)

Code

result = pd.DataFrame()
mae_test = mean_absolute_error(y_test, y_pred_test)
mape_test = mean_absolute_percentage_error(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
r2_test = r2_score(y_test, y_pred_test)

mae_train = mean_absolute_error(y_train, y_pred_train)
mape_train = mean_absolute_percentage_error(y_train, y_pred_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
r2_train = r2_score(y_train, y_pred_train)

result['test_XGB'] = [mae_test, mape_test, rmse_test, r2_test]
result['train_XGB'] = [mae_train, mape_train, rmse_train, r2_train]
result.index = ['mae','mape','rmse','r2']

Code

result

	test_XGB	train_XGB
mae	459.897612	429.131317
mape	0.307846	0.244483
rmse	699.862902	643.969481
r2	0.846041	0.868387

The results remain favorable and consistent

Model Evaluation & Interpretation

Here, I will assess and interpret the model’s performance using a new set of sample data that I collected separately from the main dataset.

Code

test = pd.read_csv('test.csv')
test = pd.merge(test, avg , on=['Kecamatan','Kondisi']).convert_dtypes()

Add the avg_price column to test data.

Code

test = test[(test.price < 10000) & (test.price > 100)]
test = test[test.LB < 900]
test = test[test.LT < 500]
test = test[['LT','avg_price','LB','Listrik','price']]

Clean the test data

Code

test_x = test[['LT','LB','Listrik','avg_price']]
test_y = test.price

Code

y_pred_test = best_xgb_regressor.predict(test_x)
y_test = test_y

Code

result = pd.DataFrame()
mae_test = mean_absolute_error(y_test, y_pred_test)
mape_test = mean_absolute_percentage_error(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
r2_test = r2_score(y_test, y_pred_test)

mae_train = mean_absolute_error(y_train, y_pred_train)
mape_train = mean_absolute_percentage_error(y_train, y_pred_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
r2_train = r2_score(y_train, y_pred_train)

result['test_XGB'] = [mae_test, mape_test, rmse_test, r2_test]
result['train_XGB'] = [mae_train, mape_train, rmse_train, r2_train]
result.index = ['mae','mape','rmse','r2']

Code

result

	test_XGB	train_XGB
mae	531.780306	429.131317
mape	0.214570	0.244483
rmse	885.063409	643.969481
r2	0.785323	0.868387

The model perform worse on scrapped test data.

Residuals Plot

Code

res = pd.DataFrame([list(y_train), list(y_pred_train)]).transpose()
res.columns = ['test','pred']
res['residual'] = res.test - res.pred
res['residualp'] = abs(res.test - res.pred) * 100 / res.test

Code

fig = px.scatter(x=res['test'], y=res['residualp'])
fig.update_layout( 
    height = 800,
    title = 'Residual\'s absolute percentage plot'
)
fig.add_hline(y=res['residualp'].mean(), line_dash="dot", line_color="red")
fig

From the residual percentage above we can see that :
1. 7.5% of data is APE that is more than 50%.
2. 59% of data is APE that is lower than 20%.
3. 32.5% of data is ape that is 20% - 50%.
4. When the houses prices is 4 million, the model become worse as the price kept going up.
5. There are extreme APE error on 0 to 2k prices, this indicate that there are anomalies data inserted.

Code

fig = px.scatter(x=res['test'], y=res['residual'])
fig.update_layout(  
    height = 700,
    title = 'Residuals plot'
)
fig.add_hline(y=0, line_dash="dot", line_color="red")
fig

The unequal variance in the data implies the presence of heteroscedasticity, and a higher price is associated with increased variability. This phenomenon is considered normal since more expensive houses involve a greater number of factors in determining their prices. Additionally, the model also consistently underpredicts as house prices increase. I think we need to gather more variables or predictors that influence house prices.

PDP & ICE Plots

Code

def pdp_ice_plot(model, df_test, column, clusters=True):
    df_test = df_test.copy()
    pdp_isolate = pdp.PDPIsolate(model = model, df = df_test, 
                      num_grid_points=20,
                      model_features = df_test.columns, 
                      feature = column, feature_name=column, n_classes=0)
    pdp_isolate.plot(
                center=True, plot_lines=True,
                cluster=clusters, n_cluster_centers=5,cluster_method='accurate',
                plot_pts_dist=False, to_bins=True,
                show_percentile=False, which_classes=None,
                figsize=(10,6), dpi=300,
                ncols=2, plot_params={"pdp_hl": True},
                engine='matplotlib', template='plotly_white')
    
    plt.show()
    return pdp_isolate

Code

pdp_ice_plot(best_xgb_regressor, test_x, 'LT', clusters=False)

obtain pred_func from the provided model.

<pdpbox.pdp.PDPIsolate at 0x7f92d07a8a30>

The Larger the LT, higher the price. But as the LT get higher, the price is also more disperse.

Code

test1 = test_x.copy()
pdp_interact = pdp.PDPInteract(model=best_xgb_regressor, df=test1,
                              num_grid_points=10,
                              model_features = test1.columns, 
                              features=['LB','avg_price'], 
                              feature_names=['LB','avg_price'], n_classes=0)

obtain pred_func from the provided model.
obtain pred_func from the provided model.
obtain pred_func from the provided model.

Code

pdp_ice_plot(best_xgb_regressor, test_x, 'avg_price', clusters=True)

obtain pred_func from the provided model.

<pdpbox.pdp.PDPIsolate at 0x7f92c44449d0>

The higher the avg_price, also the higher the price, but not all data is affected the same, some are highly affected, some are less affected.

Code

fig, axes = pdp_interact.plot(
    plot_type="grid",
    to_bins=True,
    plot_pdp=True,
    show_percentile=True,
    which_classes=None,
    figsize=None,
    dpi=300,
    ncols=2,
    plot_params=None,
    engine="plotly",
    template="plotly_white",
)
fig

By combining the avg_price and the LT, you can see that low LT is not affected as much as it affect high LT. The price is more volatile on higher LT.

Conclusion

I have concerns regarding the model’s accuracy, primarily because the residuals are still to be relatively high. This issue can be attributed to the limited sample size and the exclusion of critical factors influencing prices. Among these factors, ‘avg_price’ is of utmost importance. To enhance the accuracy of ‘avg_price,’ it’s imperative to expand the sample size and consider a broader spectrum of variables affecting prices. While I’ve used ‘Kecamatan’ and ‘Kondisi Perabotan’ in this context to estimate ‘avg_price,’ I believe these variables alone are insufficient. Moving forward, I plan to further develop this project by gathering additional data from various real estate websites to create a more robust and accurate model.

Table Of Contents

Introduction

Data Preparation

Basic Exploratory Data Analysis

Descriptive Statistic

Univariate Analysis

Multivariate Analysis

Deep-Dive Exploratory Data Analysis

Modelling Combination 1

Modelling Combination 2

Modelling Combination 3

Modelling Combination 4

Modelling Combination 5

Modelling Combination 6

Choosing The Best Combination

Model Evaluation & Interpretation

Residuals Plot

PDP & ICE Plots

Conclusion