Methods adopt to predict the future house pricing in Pakistan, we evaluated a wide series of appropriate models and after an in-depth evaluation, four of the models that were found to be most appropriate were used to run tests on this data. These four models were

preferred over the other available options because they tended to handle various sorts of data relationships and numerous other kinds of complexities in the data in an efficient manner. These models can process data with different types of characteristics. However, while choosing these models, the amount of available data, the quality of data, computational resources, and the desired level of interpretability were also part of the factors upon which the decision was made to go with these models. We choose four models and not a single one to ensure that performance was evaluated based on performance metrics as done in the latter. The Linear Regression Model

is well known for the simplicity that it offers, whereas the KNN Regression model has a unique ability to capture complex nonlinear patterns, accommodating noisy data and outliers. The random method uses a wide range of trees providing us robustness against overfitting and feature importance insights. The decision Tree model has its own perks such as interpretability and

adaptability to various feature types that its offers. All of these four models come with a very diverse range of strengths that enable us to address the different aspects that are important for making predictions based on a given data set.

Two commonly used evaluation metrics in four models are the Root Mean Squared Error (RMSE) and the Coefficient of Determination (R-squared). These metrics provide valuable insights into the performance of the predictive models. The mathematical formulations for these metrics are as follows:

### Root Mean Squared Error (RMSE):

RMSE =

vuut

1

n

Xn

i=1

(yactuali − ypredi

)

2

Coefficient of Determination (R-squared):

R

2 = 1 −

Pn

i=1(yactuali − ypredi

)

2

Pn

i=1(yactuali − y¯)

2

where:

• yactuali

represents the actual observed value of the dependent variable for the i-th data point.

• Ypredi

represents the predicted value of the dependent variable for the i-th data point.

• y¯ represents the mean of the actual observed values of the dependent variable

### Linear Regression

Being also found to be very commonly used in previous similar studies, Linear Regression

serves several uses when it comes to house price prediction. Amal Joby,2021 stated that Linear

Regression is a very simple and interpretable statistical technique that was used in order to

model the relationship between the target variable which in our case was Price and all the other

prior stated independent variables. The main purpose behind using this is to make sure that we

are able to find the best-fitting linear equation that enables us to explain how the changes in

the stated independent variables are associated with changes that occur in the target variable

(Price).

A simple linear regression model can be put into a simple mathematical equation, represented

as:

Y = β0 + β1 · x

In the above equation:

• Y represents the predicted target variable (which in our case is the house price).

• x represents the input feature (e.g., total area, number of bedrooms, etc.).

• β0 is the y-intercept, representing the value of y when x is 0.

• β1 is the coefficient of the feature x, representing the change in y for a unit change in x.

Our case consists of multiple features, and therefore we developed an extended equation that

is written below:

Y = β0 + β1 · latitude + β2 · longitude

+ β3 · baths + β4 · bedrooms

+ β5 · Total_Area + β6 · property_type_new

+ β7 · city_new + β8 · province_name_new

+ β9 · purpose_new

Where: Latitude, longitude, baths, bedrooms, Total_Area, property_type_new, city_new, province_name_new, and purpose_new represent the values of the corresponding features. β0 to β9 are the coefficients associated with each feature, indicating their respective influence on the predicted house price. Also to be noted that each coefficient β tells us the extent to which our predicted house price

can change for a unit change in the corresponding feature while assuming all other features remain constant. Another important feature of this equation is that it assumes a linear relationship between the target variable and the independent variables, however, this may or may not hold true in reality. The above-stated equation basically represents the hypothesis of the linear regression model. This allows us to make predictions based on the learned coefficients during the process. In cases where the relationship between the dependent and independent variables is rather more complex, we are then supposed to switch to other wide range of techniques such as the polynomial regression or more advanced machine learning algorithms which are more suitable methods to go with and can enable us to undergo accurate predictions in a more appropriate manner.

### K-Nearest Neighbor Regression

KNN is another very useful Machine Learning tool. It’s commonly used while predicting house pricing due to the simplicity that it offers. It enables us to understand the relationship between the predictor and the independent variables. However, when using KNN, we need to make sure that we carefully tune the hyperparameters, handle data preprocessing, and consider the computational requirements while using this method, as mentioned in the study by Analytics Vidhya (2018).

KNN is used for both classification and regression-related tasks. It helps us find the desired results in cases where there is no particular mathematical relationship between dependent and independent variables or while dealing with nonlinear data. Moreover, KNN holds great significance while predicting house pricing because it enables us to predict prices based on the prices of its closest neighbors. Enjoy Algorithms (n.d.) emphasizes the uniqueness of this method, as it doesn’t explicitly map the input variables into the target variables. Unlike other similar methods, in KNN, we do not just rely on learning from the parameters of the training data and fitting the function; in fact, new test-based samples are classified using the information that has been memorized.

Mathematically, the predicted output Ypredi for a new data point can be expressed as:

ypred =

X

k

i=1

yneighbori

Where:

• k is the number of neighbors.

• yneighbori

is the output value of the ith nearest neighbor.

### Decision Tree Regressor Model

Coursera, 2023 says The Decision Tree Regressor model is another type of machine learning model used for regression tasks that have been implemented in this study for house price prediction for the given Pakistani data set. Being an extension of the conventional decision tree algorithm, the Decision Tree Regressor model is mainly used for classification-related purposes, however, also to be noted that this model is based on a supervised learning algorithm and hence can be used for both regression and classification with the main focus being on predicting continuous numerical values (regression) rather than discrete classes (classification).

## Random Forest Regression

The random forest tree method was also included in our study. Similar to the KNN methods, this method is also very commonly used for both classification and regression-related tasks. As the name suggests, the Random forest tree method splits the data into a tree-like structure based on the values of the features that the data possesses. The tree then consists of various nodes and each one of them is based on features of the data representing a certain prediction. The potential issues associated with a Single forest tree include the fact that a single decision tree comes with a possibility of overfitting the data which can further lead to capturing noise and not being able to generalize appropriately to the new data. Therefore, we used the Random forest tree method where a collection of forest trees is used with each one of the trees trained on a different subset of the data and features hence enabling us to escape from the issues that a single forest tree can pose.