This is Part 1 of a series discussing basics of Linear Regression. This post will be updated as other parts become available.
In this tutorial, we are going to discuss the basics of what linear regression is. As way of an example, we are going to use the Boston dataset, which has the median prices of houses in the Boston area. This is not going to get deep into the mechanics of writing a predictive model. There are many other important steps in that process I cannot cover in this tutorial for example training set, validation, test error, regularisation and the like. Here are are just testing the waters of linear regression and it is in line with the chapters I cover in my free Introduction to Machine Learning course on IQmates (posts are coming for other ways to improve this model in accordance with chapters on that course). Okay, now that that is out of the way, let’s start.
As an exercise, what factors are going to be important here? Maybe:
- Square area of the land/stand of the house
- Number of bedrooms
- Proximity to a school
- Crime rate in the area
We are going to use the Boston dataset housed in the MASS library and the ISLR package which has other datasets. So let’s load that up
(Remember, if you get the “Error in library(ISLR) : there is no package called ‘ISLR’” it means you haven’t installed the required package yet so just run
Great, now we have loaded the packages. Let’s load the dataset itself:
boston_data <- Boston
What factors did people who put together this dataset think would be important in order to predict the price of house? Let’s see:
 "crim" "zn" "indus" "chas" "nox" "rm" "age" "dis"  "rad" "tax" "ptratio" "black" "lstat" "medv"
If you are building your own dataset, it is highly recommended you use descriptive column names that are easy to understand. If not, make sure you have well documented so you do not get lost, for example, from above, what does variable “zn” mean? Luckily, the people who made this dataset included a way for us to know about these variables by simply typing:
crim - per capita crime rate by town. zn - proportion of residential land zoned for lots over 25,000 sq.ft. indus - proportion of non-retail business acres per town. chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox - nitrogen oxides concentration (parts per 10 million). rm - average number of rooms per dwelling. age - proportion of owner-occupied units built prior to 1940. dis - weighted mean of distances to five Boston employment centres. rad - index of accessibility to radial highways. tax - full-value property-tax rate per $10,000. ptratio - pupil-teacher ratio by town. black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat - lower status of the population (percent). medv - median value of owner-occupied homes in $1000s.
We were not very far with some of our assumptions at the begging of this post but clearly there were other variables we missed that may be important factors in determining the median price of a house in the Boston area.
To see how many data points we have,
 506 14
So we have 506 observations (or rows) and 14 variables (or columns). You can view the data frame with these observations by typing
This tutorial is not about extensive data exploration. I will write another post dedicated to different methods of data cleaning and data exploration. I will highlight some aspects to give you a sense of the process, for example let us create a scatterplot with each town’s crime rate (crim) plotted against the median house price (medv) using the plot() that comes installed with R.
The plot makes sense, right? Areas that have low crime rates (closer to the 0 point on the x-axis) have high median house prices (y-values). Conversely, areas with high crime rates (further to the right on the x-axis) have low median house prices (y-values). This fits very well with what we understand about the real world.
Let us do the same with the lstat variable, which is the percentage of the area’s population who have a lower economy status. What you will see is as the lstat increases, the medv decreases. Again, this makes sense doesn’t it? The poorer the people are in an area, we wouldn’t expect their house prices to be higher than in areas with a fewer proportion of less resourced families. If we only had the lstat variable in our data, we could conclude this as a good way to estimate house prices in an area and maybe try fit a predictive model. Let us fit a simple linear regression model with medv as the response we want to predict (or target) and lstat as the predictor (or variable). [NOTE: I have discussed this in detail in the Introduction to Machine Learning course on IQmates so if you do not understand something, feel free to refer to that course as it covers the theoretical concepts of linear regression]
lm.fit1 = lm(medv ~ lstat, data = boston_data)
This syntax says we are going to fit a linear model (lm) by regressing the medv predictor onto the lstat variable from the boston_data set. This is a simple linear regression model fit because we do not have any powers of the variable involved i.e. we do not have an (lstat)2 term.
Let’s see the simple linear model that was fit to this data
Call: lm(formula = medv ~ lstat, data = boston_data) Coefficients: (Intercept) lstat 34.55 -0.95
As simple linear regression has the format y = mx + c , we see fitting this model to this dataset or, in other words, using lstat to predict medv, the straight line that comes out of that will have an intercept of 34.55 (in the units of the y-axis) and a gradient of -0.95 (in the units of y-axis / units of x-axis).
Let’s see some more information regarding this model:
summary(lm.fit1)Call: lm(formula = medv ~ lstat, data = boston_data) Residuals: Min 1Q Median 3Q Max -15.168 -3.990 -1.318 2.034 24.500 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 34.55384 0.56263 61.41 <2e-16 *** lstat -0.95005 0.03873 -24.53 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.216 on 504 degrees of freedom Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432 F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
Okay wait, where is this all coming from? Let’s look at the pieces of information we can get back after fitting a linear model:
 "coefficients" "residuals" "effects" "rank"  "fitted.values" "assign" "qr" "df.residual"  "xlevels" "call" "terms" "model"
We can extract any piece we want from this pack of information about the linear model, for example if we want to look at the coefficients:
If you are following my Introduction to Machine Learning free course on IQmates, you would have learnt about confidence intervals and prediction intervals. In R, we can use the predict() function to product these for the prediction of medv for a given value of lstat:
predict(lm.fit1, data.frame(lstat = c(5, 10, 15)), interval = "confidence")
fit lwr upr 29.80359 29.00741 30.59978 25.05335 24.47413 25.63256 20.30310 19.73159 20.87461
predict(lm.fit1, data.frame(lstat = c(5, 10, 15)), interval = "prediction")
fit lwr upr 29.80359 17.565675 42.04151 25.05335 12.827626 37.27907 20.30310 8.077742 32.52846
These lines of codes are similar except for what interval we want to know about (confidence or prediction). Basically they are saying, “With the model called lm.fit1 we have fit already as a way to predict medv given lstat, what are the confidence and prediction intervals for three distinct values of lstat: 5, 10 and 15.” The fit column in the result of predict() function here gives you what the linear model is predicting the medv to be given an lstat value. For instance, when lstat = 5, the model predicts the medv value to be 29.80359. The confidence interval is being predicted to be [29.00741 , 30.59978] and the prediction interval is [17.565675 , 42.04151].
Let us visually see how our model did well in capturing the trend in the data (that is the whole point of coming up with a prediction model, right?):
abline(lm.fit1, lwd = 3, col = "red") #draw a line with the coefficients of the lm.fit1 model and width 3
The red line is supposed to give us a way to predict the medv given any lstat but, as you can see, it does not do a very good job at this because there are many data points which do not lie even close to this line. What this tells you is that maybe using only lstat to predict medv is not the best way. Maybe we need to include other predictors. You can confirm this suspicion by plotting the residuals. When you do that and you there that there might be some form of trend in the plot, it means your model is not accurately capturing the trend in the data; maybe you need a more complex one, for example. I covered the theory behind this in the course. Here is how to plot the residuals:
What this line does is it creates a scatterplot of the predicted lm.fit1 values on the x-axis and the residuals of that lm.fit1 model on the y-axis. Any sign of a trend is a sign that your model is not complex enough as is the case in the example in this tutorial post.
Great! You have made it to the end of the first tutorial in this series. I hope you were able to follow along. What we did was use only one variable (lstat) to predict the median price of a house (medv) in the Boston area. As we saw, our dataset has 13 other variables, some of which we believe do affect the price of a house, for example the crime rate (crim). In the next tutorial post, we will see how to incorporate these in our linear model.