This lab on Logistic Regression in R comes from p. 154-161 of "Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. It was re-implemented in Fall 2016 in `tidyverse`

format by Amelia McNamara and R. Jordan Crouser at Smith College.

Want to follow along on your own machine? Download the rMarkdown or Jupyter Notebook version.

Let's return to the `Smarket`

data from `ISLR`

.

```
library(ISLR)
library(dplyr)
library(tidyr)
names(Smarket)
dim(Smarket)
summary(Smarket)
```

In this lab, we will fit a logistic regression model in order to predict `Direction`

using `Lag1`

through `Lag5`

and `Volume`

. The `glm()`

function fits **generalized linear models**, a class of models that includes logistic regression.

The syntax of the `glm()`

function is similar to that of `lm()`

, except that we must pass in the argument `family=binomial`

in order to tell `R`

to run a logistic regression rather than some other type of generalized linear model.

```
glm_fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
data = Smarket,
family = binomial)
summary(glm_fit)
```

The smallest p-value here is associated with `Lag1`

. The negative coefficient
for this predictor suggests that if the market had a positive return yesterday,
then it is less likely to go up today. However, at a value of 0.145, the p-value
is still relatively large, and so there is no clear evidence of a real association
between `Lag1`

and `Direction`

.

If you want, you can use helper methods like `coefficients()`

and `residuals()`

in order to access specific parts of this fitted model:

```
coefficients(glm_fit)
residuals(glm_fit)
```

`predict()`

function can be used to predict the probability that the
market will go up, given values of the predictors. The `type="response"`

option tells `R`

to output probabilities of the form `P(Y = 1|X)`

, as opposed
to other information such as the `logit`

. If no data set is supplied to the
`predict()`

function, then the probabilities are computed for the training
data that was used to fit the logistic regression model.

```
glm_probs = data.frame(probs = predict(glm_fit, type="response"))
head(glm_probs)
```

`contrasts()`

function indicates that `R`

has created a dummy variable with
a 1 for `Up`

.

```
contrasts(Smarket$Direction)
```

`Up`

or `Down`

. The following two commands create a vector
of class predictions based on whether the predicted probability of a market
increase is greater than or less than 0.5.

```
glm_pred = glm_probs %>%
mutate(pred = ifelse(probs>.5, "Up", "Down"))
```

`Down`

elements. The second line
transforms to `Up`

all of the elements for which the predicted probability of a
market increase exceeds 0.5. Given these predictions, we can `count()`

how many observations were correctly or incorrectly classified.

```
glm_pred = cbind(Smarket, glm_pred)
glm_pred %>%
count(pred, Direction) %>%
spread(Direction, n, fill = 0)
glm_pred %>%
summarize(score = mean(pred == Direction))
```

The diagonal elements of the confusion matrix indicate correct predictions,
while the off-diagonals represent incorrect predictions. Hence our model
correctly predicted that the market would go up on 507 days and that
it would go down on 145 days, for a total of 507 + 145 = 652 correct
predictions. The `mean()`

function can be used to compute the fraction of
days for which the prediction was correct. In this case, logistic regression
correctly predicted the movement of the market 52.2% of the time.

At first glance, it appears that the logistic regression model is working
a little better than random guessing. But remember, this result is misleading
because we trained and tested the model on the same set of 1,250 observations.
In other words, 100− 52.2 = 47.8% is the **training error rate**. As we
have seen previously, the training error rate is often overly optimistic — it
tends to underestimate the *test* error rate.

In order to better assess the accuracy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. This will yield a more realistic error rate, in the sense that in practice we will be interested in our model’s performance not on the data that we used to fit the model, but rather on days in the future for which the market’s movements are unknown.

Like we did with KNN, we will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.

```
train = Smarket %>%
filter(Year < 2005)
test = Smarket %>%
filter(Year >= 2005)
```

```
glm_fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
data = train,
family = binomial)
glm_probs = data.frame(probs = predict(glm_fit,
newdata = test,
type="response"))
```

```
glm_pred = glm_probs %>%
mutate(pred = ifelse(probs>.5, "Up", "Down"))
glm_pred = cbind(test, glm_pred)
glm_pred %>%
count(pred, Direction) %>%
spread(Direction, n, fill = 0)
glm_pred %>%
summarize(score = mean(pred == Direction),
recip = mean(pred != Direction))
```

The $!=$ notation means **not equal to**, and so the last command computes
the test set error rate. The results are rather disappointing: the test error
rate is 52%, which is worse than random guessing! Of course this result
is not all that surprising, given that one would not generally expect to be
able to use previous days’ returns to predict future market performance.
(After all, if it were possible to do so, then the authors of this book [along with your professor] would probably
be out striking it rich rather than teaching statistics.)

We recall that the logistic regression model had very underwhelming pvalues
associated with all of the predictors, and that the smallest p-value,
though not very small, corresponded to `Lag1`

. Perhaps by removing the
variables that appear not to be helpful in predicting `Direction`

, we can
obtain a more effective model. After all, using predictors that have no
relationship with the response tends to cause a deterioration in the test
error rate (since such predictors cause an increase in variance without a
corresponding decrease in bias), and so removing such predictors may in
turn yield an improvement.

In the space below, refit a logistic regression using just `Lag1`

and `Lag2`

, which seemed to have the highest predictive power in the original logistic regression model.

```
glm_fit_lag1_lag2 = # Write your code to fit the new model here
# -----------------------------------
# This will test your new model; you
# don't need to change anything below
# -----------------------------------
glm_probs = data.frame(probs = predict(glm_fit_lag1_lag2,
newdata = test,
type = "response"))
glm_pred = glm_probs %>%
mutate(pred = ifelse(probs>.5, "Up", "Down"))
glm_pred = cbind(test, glm_pred)
glm_pred %>%
count(pred, Direction) %>%
spread(Direction, n, fill = 0)
glm_pred %>%
summarize(score = mean(pred == Direction))
```

Now the results appear to be more promising: 56% of the daily movements have been correctly predicted. The confusion matrix suggests that on days when logistic regression predicts that the market will decline, it is only correct 50% of the time. However, on days when it predicts an increase in the market, it has a 58% accuracy rate.

Finally, suppose that we want to predict the returns associated with **particular
values** of `Lag1`

and `Lag2`

. In particular, we want to predict Direction on a
day when `Lag1`

and `Lag2`

equal 1.2 and 1.1, respectively, and on a day when
they equal 1.5 and −0.8. We do this using the `predict()`

function.

```
predict(glm_fit_lag1_lag2,
newdata = data.frame(Lag1 = c(1.2, 1.5),
Lag2 = c(1.1, -0.8)),
type = "response")
```

`Lag1`

and `Lag2`

, and then post to #lab4 about what you found. If you're feeling adventurous, try fitting models with other subsets of variables to see if you can find a better one!