Classification Part 2: Logistic Regression with Test Data Set Reading Time: 5 minutes

In the previous post, we discussed the basics of Logistic Regression using the Smarket dataset from the ISLR package. Same as the previous post, this too is in line with free-to-access Introduction to Machine Learning course I am teaching on IQmates. If anything does no make sense, I bet I covered it on on the videos. Classification is Chapter 4 of that course.

We didn’t have a test set to use to check how well our model was doing, so we just used the whole dataset. Our accuracy was 52.2% and I said in this post we are going to create a training and a test set to start us on the journey of improving the accuracy of our models. Why should we create a training a test set you might ask. The reason is because we want our model to capture the overall trend in the data so as to be useful as an estimator or a predictor when we give it data it has never seen. If our model correctly captured the relationship between the direction of the stock market and the variables we gave it, it means tomorrow we can load it up and use it to predict the direction of movement of the stock market, right? How it is currently, we will be right 52.2% of the time. This seems fairly high for a simple model but we can do better. The 48.8% error rate is the training error. I am sure it is the training error because, if you recall, when we used the predict() function in the previous post, we simply predicted on every day data point we had. In other words, the same dataset we used to train our model, is the one we used to predict and test accuracy. If our model was superb, it should have done much much better than 52.2% because it had already seen and played around with the data we are using to test its accuracy on. Hence the 48.8% is the training error because the model didn’t train well – which is good because it means we can improve it. What else do we know about training error? It is often an overly optimistic estimator the test error i.e. it often underestimates what he actual test error is (test error being the error we will get when we start applying our model to situations it has never seen). So we might end up with a model that is wrong more than 48.8% of the time because the training error here has lied to us. We need a way to more accurately determine what the test error could be and hence the need for a test dataset.

What we will do is split the dataset into two: one for training and one we will hold out and use for testing. The held-out set will yield a better estimation of the test error and is the one we are mostly interested in because it will show us how accurately we can use our model for situations it has never seen before, in our case future dates.

One strategy we can use to split this data is to take observations from 2001 to 2004 as the training set and the remaining ones as part of the test set (this is a very terrible way to do it but it will suffice for our explanations).

train <- (Year < 2005) #boolean vector which is True is Year < 2005; False otherwise
test_set <- Smarket[!train, ] #submatrix for which train is False i.e. Year is greater than 2005
dim(test)

 252 9

So we have 252 observations in our test set. Let’s get the directions associated with these test observations:

test_directions <- Direction[!train]

Let’s fit  a Logistic Regression model using only the subset of observations with dates before 2005. We can do this using the subset argument.

glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial, subset = train)

Now let us use this trained model to get predicted probabilities for the observations in the test set, a set the with data points the model has never seen before:

glm.probabilities <- predict(glm.fit, test_set , type = "response")

Great! So for each of our 252 observations in the test set, we now have probabilities that Y = 1 or in other words that the day’s movement was “Up” according to R’s coding of the Direction variable (you can use the contrasts() function like in the previous post to check this). Let’s change the probabilities into “Up” or “Down” words using probability = 0.5 as the threshold as we did previously:

glm.predictions <- rep("Down", 252)
glm.predictions[glm.probabilities > 0.5] = "Up"

Confusion matrix time! How did our model do this time? We need to compare what we predicted and what were the actual directions. Previously we didn’t really have to worry about the data we will use at this part but here we do. We have the test set. We want to compare what we predicted for our 252 observations and what their actual directions were. Just a few lines above, we created the variable “test_directions” for this exact reason. We are going to say, “Okay observation number 20, we predicted that you will be a “Down” but you are actually an “Up” because we can see your direction from the test dataset so we will mark you as an error of our model.” The principles of the confusion matrix are well explained in the videos. For now, let’s see what we have got:

table(glm.predictions, test_directions)

test_directions
glm.predictions           Down         Up
Down                                   77          97
Up                                        34         44

This is disappointing 🙁

Our test error is 52%. You might think the previous error we got, the 48.8%, was much better than this but that’s not the case. That one was the training error and we know it underestimates the test error, which is what we got here explicitly by holding out some observations and using them for testing.

What we can do is play around with the variables we are using for our logistic regression. If you remember well from the previous post, Lag1 had the lowest p-value and it wasn’t all that low. This is an indication that it is difficult to really get a relationship between the variables we have and the direction of movement of the stock market. If it was, I strongly doubt I would publishing these posts and not be sitting at a beach in Malibu! Maybe to improve our model, we can take away a couple of them; those variables that had too high p-values. Let’s model with only Lag1 and Lag2:

glm.fit <- glm(Direction ~ Lag1 + Lag2, data = Smarket, family = binomial, subset = train)
glm.probabilities <- predict(glm.fit, test_dat, type = "respone")
glm.predictions <- rep("Down", 252)
glm.predictions[glm.probabilities > 0.5] = "Up"
table(glm.predictions, test_directions)

test_directions
glm.predictions       Down      Up
Down                    77      97
Up                      34      44

Now that looks promising. We have correctly predicted the daily movement for 56% of the observations in the test data set (77 + 44 / 252) compared to the previous 48%. If we did deeper into the confusion matrix, we see that the logistic regression model predicts an “Up” for a total of 44 + 34 observations. Of these 78 data points, the model correctly predicted an “Up”  for 44 of them (they had “Up” in the test data set), meaning on days when the logistic regression model predicts an increase in the market, it has a 58% accuracy rate ( 44 / 78). This might tell you to buy when your model says the market will be up because you will be right 58% of the time!

That’s logistic regression in a nutshell, well over two posts. It has helped us explain the concept of holding out a set for testing purposes. Some people hold out a third set called the validation set which they use to figure out model parameters and the test set to check the accuracy. We will see almost similar methods when we do Cross Validation and Model Selection methods. 