Partial preview of the text
Download Data Analytics for Finance on House Price Data and more Assignments Data Analysis & Statistical Methods in PDF only on Docsity!
Data Analytics for Finance (Assignment)
1. Summary of the Process of Analysis
The task is determining how several factors (explanatory/independent variables) affect a house price (dependent
variable). The hypothesis is that house price can be predicted by the 15 presented factors such as lotSize , age , etc.
To test this hypothesis, multiple regression analysis was used. My approach was to use the available data to fit a
multiple regression model for the house price (y), given its explanatory variables (x). I used 13 explanatory variables
to make the best guess for the house price with some combination of the variables. Based on the multiple regression
analysis, crucial findings are as follows:
a. Explanatory variables ( lotSize , landValue , livingArea , rooms , and bathrooms ), dummy variables ( waterfront and
centralAir ), and a new variable deriving from the variable’s interaction term ( fireplaces*livingArea ) have a
positive relationship with the house price.
b. On the contrary, explanatory variables ( fireplaces and bedrooms ), dummy variable ( newConstruction ),
transformed variable ( SQRT(Age) ), and new variable deriving from the variables interaction term
( lotSize*landValue ) have a negative relationship with the house price.
After diagnostics of the model, I performed the analysis, such as omitting outliers from the sample size, transforming
variables, and including interaction terms to build a robust regression model.
2. Multiple Linear Regression Analysis
a. Exploratory Data Analysis (EDA)
- By exploring all the datasets, there were some categorical variables ( heating , fuel , sewer , waterfront ,
newConstruction , and centralAir ). These variables were treated as numerical, creating new variables based
on the original variables. For example, waterfront is a yes/no variable indicating the presence of waterfront
in the house, and the new coding for waterfront is 1/0. In another example, the heating variable is
transformed into three dummy variables, which are heating_electric, heating_hotair, and heating_hotwater,
and the new coding for them is 1/0.
Figure 1.1-1.4. Scatterplots of several explanatory variables
• The outliers
By observing all the 1,728 observations, I
cleaned the data by removing some outliers.
Using the graphical EDA, specifically
scatterplots for variables such as lotSize,
landValue, Age , and livingArea as the x-axis
against the house prices as the y-axis (as
seen in Figure 1.1-1.4), it was found 18
outliers. Some outliers were also confirmed
to be outliers using the numerical EDA by
calculating the IQR for the house prices. The
total observations became 1, 710.
- Based on the correlation analysis, few variables show a strong relationship with house price. livingArea
(0.71), landValue (0.5 6 ), bathrooms (0.6 0 ), and rooms (0.53) are highly correlated with house prices.
However, rooms , bedrooms , bathrooms , and livingArea are correlated to each other.
b. Multiple Regression
- I tried to run the first multiple regression using all the explanatory variables. I investigated that some
variables such as age, pctCollege, sewer, fireplaces, heating, and fuel have significant P-values (> 0.05) and
concluded that these variables could be statistically insignificant in the linear relationship with price, and
thus eliminated some of them ( pctCollege, sewer, heating, and fuel ). However, some variables were
significantly related to the price, specifically age and fireplaces.
- As age and fireplaces are believed to have a significant relationship with the price, I transformed age to
SQRT(Age) using the square root function. This transformation helped make the relationship more linear and
improved the model’s fit. For fireplaces, I investigated that the livingArea interacts with fireplaces. Therefore,
I created an interaction term between livingArea and fireplaces. As both variables are positively correlated,
more fireplaces indicate a bigger house.
- I performed the multiple regression analysis again. The final model's overall regression was statistically
significant (𝑅^2 = 0.65), and all the explanatory variables significantly predicted the house price (P-value of all
the variables < 0.05).
Data Analytics for Finance (Assignment)
Figure 1. 5. Regression statistic results
lotSize = 𝑥 1 , Age = 𝑥 2 , landValue = 𝑥 3 , livingArea = 𝑥 4 , bedrooms = 𝑥 5 , fireplaces = 𝑥 6 , bathrooms = 𝑥 7 ,
rooms = 𝑥 8 , waterfront = 𝑥 9 , newConstruction = 𝑥 10 , centralAir = 𝑥 11
c. More Regression Diagnostics
I performed the residual output analysis by making a histogram of residuals (Figure 1.6) and a residual vs. fitted
plot (Figure 1.7) to check my regression model further. As three key assumptions underlie regression modeling,
I checked whether those assumptions are applied to the residuals. As we can see from the figure 1.4 and 1.5,
the residuals follow all the three key assumptions:
- The residuals follow a normal distribution (see Figure 1. 6 ).
- The residuals are uncorrelated with each other or random (see Figure 1. 7 ).
- The variance of the residuals appears relatively constant or homoscedasticity (see Figure 1. 7 ).
Figure 1.6. Residuals histogram Figure 1. 7. Residuals scatter plot Figure 1. 8. Residuals QQ Plot
3. Conclusions
In my analysis, I constructed a multiple linear regression model to evaluate the prediction of house prices from 13
key factors. The independent variables significantly predict the house price, 𝐹( 13 , 1696 )^ = 247 , 𝑝 < 0. 001 , which
indicates that the 13 factors under study have a significant impact on house prices. Moreover, the 𝑅^2 = 0. 646
depicts that the model explains 64.6% of the variance in house price. Overall estimated regression equation:
The predicted house price is equal to 34,374 + 16,709 ( lotSize ) – 2,910( SQRT(Age) ) + 0.96( landValue ) –
0.18( lotSizelandValue ) + 61.53( livingArea ) – 7,019( bedrooms ) – 18,206( fireplaces ) + 10.53 ( fireplaceslivingArea )
+ 19,637( bathrooms ) + 3,013( rooms ) + 124,765( waterfront ) – 45,621( newConstruction ) + 14,802( centralAir ), per
one unit increase in each variable.
Variables such as lotSize, landValue, livingArea, bathrooms, rooms , and centralAir positively impact house prices.
Conversely, bedrooms and fireplaces negatively affect prices, indicating that more bedrooms and fireplaces tend to
reduce the overall value. Moreover, intriguing interactions between variables emerge: a combination of larger
livingArea and fireplaces increases house prices, while the product of lotSize and landValue has a diminishing effect.
Notably, waterfront properties command a significant premium. These findings offer a comprehensive view of
house price determinants and provide a useful tool for estimating house values. The regression model has
improved; however, some questions remain. The fact that the newConstruction variable has a negative coefficient
is contrary to the common sense that new houses have higher prices. Consequently, this needs to be further
investigated with more samples and a more robust regression analysis.
[-^29
3 ,^41
(-^27
3 ,^41
(-^26
3 ,^41
(-^24
3 ,^41
(-^23
3 ,^41
(-^21
3 ,^41
(-^19
3 ,^41
(-^18
3 ,^41
(-^16
3 ,^41
(-^15
3 ,^41
(-^13
3 ,^41
(-^11
3 ,^41
(-^10
3 ,^41
(-^86
,^415
8 ,^ - …
(-^70
,^415
8 ,^ - …
(-^54
,^415
8 ,^ - …
(-^38
,^415
8 ,^ - …
(-^22
,^415
8 ,^ - …
(-^60
03 ,^4
(^999
6 ,^58
(^259
96 ,^5
(^419
96 ,^5
(^579
96 ,^5
(^739
96 ,^5
(^899
96 ,^5
(^105
(^121
(^137
(^153
(^169
(^185
(^201
(^217
(^233
(^249
(^265
(^281
(^297
(^313
(^329
(^345
(^361
(^377