# Classification Part 1: Logistic Regression

Chapter 4 of the free Introduction to Machine Learning course I am teaching on IQmates introduces Classification methods. The videos touch on four methods: Logistic Regression, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA) and k-Nearest Neighbours (kNN). In this post, we discuss the lab for Logistic Regression. We will use the Stock Market dataset (Smarket) that is part of the ISLR package.

According to the description, the data “consists of percentage returns of the S&P 500 stock market index over 1 250 days, from the beginning of 2001 until the end of 2005. For each date, the data has the percentage return for each of the five previous trading days, Lag1 through Lag5. There is also Volume (the numbe rof shares traded on the previous day in billions), Today (the percentage return on the date is question) and Direction (whether the market was Up or Down on this date).”

Our goal is to use a machine learning algorithm to predict Direction of movement of the stock price.

library(ISLR) names(Smarket)

[1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
[7] "Volume" "Today" "Direction"

dim(Smarket)

[1] 1250 9

summary(Smarket)

Year Lag1 Lag2 Lag3
Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
Lag4 Lag5 Volume Today
Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
Direction
Down:602
Up :648

To fit a logistic regression model, we use the glm() function that fits generalised linear models, of which logistic regression is a part of. The syntax is not very different to linear regression we saw in previous posts that were aimed at Chapter 3 of the course. The “family=binomial” part tells R that we are doing Logistic Regression. Remember logistic regression does very well when we are predicting one of two options (binary), for example Yes/No or in our case Up/Down. Let’s see how to do that:

glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial) summary(glm.fit)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = Smarket)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.446 -1.203 1.065 1.145 1.326

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000 0.240736 -0.523 0.601
Lag1 -0.073074 0.050167 -1.457 0.145
Lag2 -0.042301 0.050086 -0.845 0.398
Lag3 0.011085 0.049939 0.222 0.824
Lag4 0.009359 0.049974 0.187 0.851
Lag5 0.010313 0.049511 0.208 0.835
Volume 0.135441 0.158360 0.855 0.392

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1731.2 on 1249 degrees of freedom
Residual deviance: 1727.6 on 1243 degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3

Let’s look at the p-values of our logistic model:

summary(glm.fit)\$coef[,4]

(Intercept)    Lag1      Lag2      Lag3      Lag4      Lag5    Volume
0.6006983 0.1452272 0.3983491 0.8243333 0.8514445 0.8349974 0.3924004

Lag1 has the smallest p-value at 0.1452272. If we are happy with these values, we can take a look at the coefficients themselves of the variables:

coef(glm.fit)

(Intercept)          Lag1         Lag2        Lag3        Lag4        Lag5     Volume
-0.126000257 -0.073073746 -0.042301344 0.011085108 0.009358938 0.010313068 0.135440659

Negative coefficients show a negative relationship between the target (Direction) and the predictor (for example Lag1). It means when the previous day’s Direction was Up, then it is less likely to be up again today.

If you recall from the videos, when you apply logistic regression, you will get back probabilities of the observation being in the two classes. When you are using R, you can use the the predict() function which will predict the probability of a “Yes” or, in our case, the market will go up given some values of the variables. The “type = response” option tells R to output probabilities of the form P(Y = 1 | X) as opposed to outputting other information such as the logit (also discussed in the videos).

In later posts, we are going to separate our dataset into a training, validation and test set. The goal of machine learning is to build the algorithm using the training set and then check how well it is doing by using the test dataset. For now, we did not do the split so we are going to use the predict on the training data we already gave it just to get our heads around the predict function and how it works (in R, it will automatically use the training set if you do not provide a dataset to the predict() function). Again, I emphasise, this is just for illustrative purposes. Please do not do this on your projects. You need to test how well your algorithm is doing on a dataset it has never seen before (the test set). So without providing a dataset to predict on, let’s just see what the predict() function does:

glm.probabilities <- predict(glm.fit, type = "response")

This line is code is basically just saying use the predict function to output probabilities by applying the model to each observation in our training data (remember this is the default since we didn’t provide predict() with data to use). Remember that glm.fit is a logistic regression model and so is of this form:

$p(X) = \frac{e^{\beta 0 + \beta 1X}}{1 + e^{\beta 0 + \beta 1X}}$

The Betas are the coefficients the model figures out that we displayed above there with the summary() function. When we use the predict() function, we are asking R to apply these coefficients to every observation in the data we give to the function and come up with probabilities using that p(X) equation. For our case, let us see the first ten probabilities that have been calculated that P(Y = 1 | X) i.e the predicted Direction = 1 given the values of that particular observations’ values of the variables Lag1, Lag2 etc:

glm.probabilities[1:10]

        1         2         3         4         5         6         7         8         9        10
0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292 0.5176135 0.4888378

You might be wondering how I know that these probabilities are calculating the “Up” direction for each observation. We said R is calculating P(Y = 1 | X), probability of Direction = 1 given X. For Logistic Regression, predict function will always predict the probability of your Y = 1. Question is how do you know which one is Y = 1? Is Yes the “1” or is it the No? In our case is Up = 1 or is Down = 1? That’s a very good question and when you are modelling like how we did here, you definitely need to check what your “1”. It is important to do so because your interpretation of the results strongly depends on knowing this. For example, for the first observation there, the P(Y= 1 | X = 1) is 0.5070841. What probability is this for? Is this probability that that day the Direction was Up (if Up = 1) or is it giving us the probability that that day the Direction was Down (if Down = 1). Without knowing what Y = 1 means, you run the risk of misinterpreting your results. How do we check this? Luckily, R is a very helpful tool and you might have figured this out if you followed my posts on Linear Regression. When R creates dummy variables (Direction is changed into a dummy variable as it is a qualitative variable), you can use the contrasts() function to see what coding R used. Let’s do that:

contrasts(Direction)

        Up
Down    0
Up      1

What this shows us is that R has chosen Up = 1. So those probabilities it calculated are probabilities of the market going up on each day. Again to emphasise the need to know the coding, imagine in our case we had assumed that Down = 1. We would be saying, “Okay so the probability that the market was down on day 1 is 0.5070841, which is entirely wrong because, according to the actual coding R used, that is the probability of the market being Up. Not knowing the coding will affect even the decisions we will make. Let’s see how:

Now that we have these probabilities, we want to change them back to Up or Down. What we can do is say if the probability we get back from the predict function (predicting an Up) is greater than 0.5, then say “Up”. If it is less than 0.5 then say the market is “Down”. [We would have completely got this the other way round if we didn’t get the ride coding that R used for the Direction!]. How do we quickly get the Directions back?

We can first create a vector with all “Down” for each observation in our data (we have 1 250 days so let’s create a 1 250 long vector):

glm.predictions <- rep("Down", 1250)

The code says create a vector called glm.predictions with the word “Down” repeated 1250 times. Now that we have done that, for each index in that vector, we want to see what the corresponding probability is in our glm.probabilities vector that we got above there from the predict() function. The thinking is, if the probability at position 1 in our glm.probabilities vector is greater than 0.5, then the prediction at position 1 in our glm.predictions vector should change to “Up”; if not, just keep it as “Down” and move on to the next position and so on for all 1 250 positions. We can easily do this using this code:

glm.predictions[glm.probabilities > 0.5] = "Up"

This code first looks at the glm.probabilities > 0.5   part and it will go through each entry in the glm.probabilities vector, checking whether it is greater than 0.5 or not. If the entry is greater than 0.5, it puts a True at that position and False if it is not. This means that inside position is a vector of Trues and Falses, with Trues being at positions where that statement holds True, say maybe position or index = 99. The outside part then changes all the words in glm.predictions at positions in where the Trues are, for example position 99’s default “Down” is then changed to “Up”.

To get a sense of how well our model has done, the correct and incorrect classifications, we can use the table() function to create a confusion matrix (discussed at length in the videos):

table(glm.predictions, Direction)

glm.predictions     Down    Up
Down                 145   141
Up                   457   507

Correct predictions are the diagonal elements hence our model correctly predicted that the market would go up on 507 days and down on 145 days (so 652 correct predictions i.e. 52.2% accuracy).

This is the basics of how Logistic Regression. How can we improve this accuracy? Let us go to the next post where we start splitting our dataset into training and test sets.