Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Analytics for Finance on House Price Data, Assignments of Data Analysis & Statistical Methods

The task is to estimate house price based on several explanatory variables like “lotSize”, “age”, “landValue”, etc. You may choose to use only some of the variables to explain house price if you like.

Typology: Assignments

2022/2023

Uploaded on 03/05/2024

thai-aye-captain
thai-aye-captain 🇬🇧

5

(1)

2 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Analytics for Finance (Assignment)
1. Summary of the Process of Analysis
The task is determining how several factors (explanatory/independent variables) affect a house price (dependent
variable). The hypothesis is that house price can be predicted by the 15 presented factors such as lotSize, age, etc.
To test this hypothesis, multiple regression analysis was used. My approach was to use the available data to fit a
multiple regression model for the house price (y), given its explanatory variables (x). I used 13 explanatory variables
to make the best guess for the house price with some combination of the variables. Based on the multiple regression
analysis, crucial findings are as follows:
a. Explanatory variables (lotSize, landValue, livingArea, rooms, and bathrooms), dummy variables (waterfront and
centralAir), and a new variable deriving from the variable’s interaction term (fireplaces*livingArea) have a
positive relationship with the house price.
b. On the contrary, explanatory variables (fireplaces and bedrooms), dummy variable (newConstruction),
transformed variable (SQRT(Age)), and new variable deriving from the variables interaction term
(lotSize*landValue) have a negative relationship with the house price.
After diagnostics of the model, I performed the analysis, such as omitting outliers from the sample size, transforming
variables, and including interaction terms to build a robust regression model.
2. Multiple Linear Regression Analysis
a. Exploratory Data Analysis (EDA)
- By exploring all the datasets, there were some categorical variables (heating, fuel, sewer, waterfront,
newConstruction, and centralAir). These variables were treated as numerical, creating new variables based
on the original variables. For example, waterfront is a yes/no variable indicating the presence of waterfront
in the house, and the new coding for waterfront is 1/0. In another example, the heating variable is
transformed into three dummy variables, which are heating_electric, heating_hotair, and heating_hotwater,
and the new coding for them is 1/0.
Figure 1.1-1.4. Scatterplots of several explanatory variables
The outliers
By observing all the 1,728 observations, I
cleaned the data by removing some outliers.
Using the graphical EDA, specifically
scatterplots for variables such as lotSize,
landValue, Age, and livingArea as the x-axis
against the house prices as the y-axis (as
seen in Figure 1.1-1.4), it was found 18
outliers. Some outliers were also confirmed
to be outliers using the numerical EDA by
calculating the IQR for the house prices. The
total observations became 1,710.
- Based on the correlation analysis, few variables show a strong relationship with house price. livingArea
(0.71), landValue (0.56), bathrooms (0.60), and rooms (0.53) are highly correlated with house prices.
However, rooms, bedrooms, bathrooms, and livingArea are correlated to each other.
b. Multiple Regression
- I tried to run the first multiple regression using all the explanatory variables. I investigated that some
variables such as age, pctCollege, sewer, fireplaces, heating, and fuel have significant P-values (> 0.05) and
concluded that these variables could be statistically insignificant in the linear relationship with price, and
thus eliminated some of them (pctCollege, sewer, heating, and fuel). However, some variables were
significantly related to the price, specifically age and fireplaces.
- As age and fireplaces are believed to have a significant relationship with the price, I transformed age to
SQRT(Age) using the square root function. This transformation helped make the relationship more linear and
improved the model’s fit. For fireplaces, I investigated that the livingArea interacts with fireplaces. Therefore,
I created an interaction term between livingArea and fireplaces. As both variables are positively correlated,
more fireplaces indicate a bigger house.
- I performed the multiple regression analysis again. The final model's overall regression was statistically
significant (𝑅2 = 0.65), and all the explanatory variables significantly predicted the house price (P-value of all
the variables < 0.05).
pf2

Partial preview of the text

Download Data Analytics for Finance on House Price Data and more Assignments Data Analysis & Statistical Methods in PDF only on Docsity!

Data Analytics for Finance (Assignment)

1. Summary of the Process of Analysis

The task is determining how several factors (explanatory/independent variables) affect a house price (dependent

variable). The hypothesis is that house price can be predicted by the 15 presented factors such as lotSize , age , etc.

To test this hypothesis, multiple regression analysis was used. My approach was to use the available data to fit a

multiple regression model for the house price (y), given its explanatory variables (x). I used 13 explanatory variables

to make the best guess for the house price with some combination of the variables. Based on the multiple regression

analysis, crucial findings are as follows:

a. Explanatory variables ( lotSize , landValue , livingArea , rooms , and bathrooms ), dummy variables ( waterfront and

centralAir ), and a new variable deriving from the variable’s interaction term ( fireplaces*livingArea ) have a

positive relationship with the house price.

b. On the contrary, explanatory variables ( fireplaces and bedrooms ), dummy variable ( newConstruction ),

transformed variable ( SQRT(Age) ), and new variable deriving from the variables interaction term

( lotSize*landValue ) have a negative relationship with the house price.

After diagnostics of the model, I performed the analysis, such as omitting outliers from the sample size, transforming

variables, and including interaction terms to build a robust regression model.

2. Multiple Linear Regression Analysis

a. Exploratory Data Analysis (EDA)

- By exploring all the datasets, there were some categorical variables ( heating , fuel , sewer , waterfront ,

newConstruction , and centralAir ). These variables were treated as numerical, creating new variables based

on the original variables. For example, waterfront is a yes/no variable indicating the presence of waterfront

in the house, and the new coding for waterfront is 1/0. In another example, the heating variable is

transformed into three dummy variables, which are heating_electric, heating_hotair, and heating_hotwater,

and the new coding for them is 1/0.

Figure 1.1-1.4. Scatterplots of several explanatory variables

• The outliers

By observing all the 1,728 observations, I

cleaned the data by removing some outliers.

Using the graphical EDA, specifically

scatterplots for variables such as lotSize,

landValue, Age , and livingArea as the x-axis

against the house prices as the y-axis (as

seen in Figure 1.1-1.4), it was found 18

outliers. Some outliers were also confirmed

to be outliers using the numerical EDA by

calculating the IQR for the house prices. The

total observations became 1, 710.

- Based on the correlation analysis, few variables show a strong relationship with house price. livingArea

(0.71), landValue (0.5 6 ), bathrooms (0.6 0 ), and rooms (0.53) are highly correlated with house prices.

However, rooms , bedrooms , bathrooms , and livingArea are correlated to each other.

b. Multiple Regression

- I tried to run the first multiple regression using all the explanatory variables. I investigated that some

variables such as age, pctCollege, sewer, fireplaces, heating, and fuel have significant P-values (> 0.05) and

concluded that these variables could be statistically insignificant in the linear relationship with price, and

thus eliminated some of them ( pctCollege, sewer, heating, and fuel ). However, some variables were

significantly related to the price, specifically age and fireplaces.

- As age and fireplaces are believed to have a significant relationship with the price, I transformed age to

SQRT(Age) using the square root function. This transformation helped make the relationship more linear and

improved the model’s fit. For fireplaces, I investigated that the livingArea interacts with fireplaces. Therefore,

I created an interaction term between livingArea and fireplaces. As both variables are positively correlated,

more fireplaces indicate a bigger house.

- I performed the multiple regression analysis again. The final model's overall regression was statistically

significant (𝑅^2 = 0.65), and all the explanatory variables significantly predicted the house price (P-value of all

the variables < 0.05).

Data Analytics for Finance (Assignment)

Figure 1. 5. Regression statistic results

lotSize = 𝑥 1 , Age = 𝑥 2 , landValue = 𝑥 3 , livingArea = 𝑥 4 , bedrooms = 𝑥 5 , fireplaces = 𝑥 6 , bathrooms = 𝑥 7 ,

rooms = 𝑥 8 , waterfront = 𝑥 9 , newConstruction = 𝑥 10 , centralAir = 𝑥 11

c. More Regression Diagnostics

I performed the residual output analysis by making a histogram of residuals (Figure 1.6) and a residual vs. fitted

plot (Figure 1.7) to check my regression model further. As three key assumptions underlie regression modeling,

I checked whether those assumptions are applied to the residuals. As we can see from the figure 1.4 and 1.5,

the residuals follow all the three key assumptions:

- The residuals follow a normal distribution (see Figure 1. 6 ).

- The residuals are uncorrelated with each other or random (see Figure 1. 7 ).

- The variance of the residuals appears relatively constant or homoscedasticity (see Figure 1. 7 ).

Figure 1.6. Residuals histogram Figure 1. 7. Residuals scatter plot Figure 1. 8. Residuals QQ Plot

3. Conclusions

In my analysis, I constructed a multiple linear regression model to evaluate the prediction of house prices from 13

key factors. The independent variables significantly predict the house price, 𝐹( 13 , 1696 )^ = 247 , 𝑝 < 0. 001 , which

indicates that the 13 factors under study have a significant impact on house prices. Moreover, the 𝑅^2 = 0. 646

depicts that the model explains 64.6% of the variance in house price. Overall estimated regression equation:

The predicted house price is equal to 34,374 + 16,709 ( lotSize ) – 2,910( SQRT(Age) ) + 0.96( landValue ) –

0.18( lotSizelandValue ) + 61.53( livingArea ) – 7,019( bedrooms ) – 18,206( fireplaces ) + 10.53 ( fireplaceslivingArea )

+ 19,637( bathrooms ) + 3,013( rooms ) + 124,765( waterfront ) – 45,621( newConstruction ) + 14,802( centralAir ), per

one unit increase in each variable.

Variables such as lotSize, landValue, livingArea, bathrooms, rooms , and centralAir positively impact house prices.

Conversely, bedrooms and fireplaces negatively affect prices, indicating that more bedrooms and fireplaces tend to

reduce the overall value. Moreover, intriguing interactions between variables emerge: a combination of larger

livingArea and fireplaces increases house prices, while the product of lotSize and landValue has a diminishing effect.

Notably, waterfront properties command a significant premium. These findings offer a comprehensive view of

house price determinants and provide a useful tool for estimating house values. The regression model has

improved; however, some questions remain. The fact that the newConstruction variable has a negative coefficient

is contrary to the common sense that new houses have higher prices. Consequently, this needs to be further

investigated with more samples and a more robust regression analysis.

[-^29

3 ,^41

(-^27

3 ,^41

(-^26

3 ,^41

(-^24

3 ,^41

(-^23

3 ,^41

(-^21

3 ,^41

(-^19

3 ,^41

(-^18

3 ,^41

(-^16

3 ,^41

(-^15

3 ,^41

(-^13

3 ,^41

(-^11

3 ,^41

(-^10

3 ,^41

(-^86

,^415

8 ,^ - …

(-^70

,^415

8 ,^ - …

(-^54

,^415

8 ,^ - …

(-^38

,^415

8 ,^ - …

(-^22

,^415

8 ,^ - …

(-^60

03 ,^4

(^999

6 ,^58

(^259

96 ,^5

(^419

96 ,^5

(^579

96 ,^5

(^739

96 ,^5

(^899

96 ,^5

(^105

(^121

(^137

(^153

(^169

(^185

(^201

(^217

(^233

(^249

(^265

(^281

(^297

(^313

(^329

(^345

(^361

(^377