Statistics Tutorial: Logistic regression

Introduction

Logistic regression is a statistical model used to predict the probability of a binary outcome based on one or more predictor variables. It is commonly used in many fields.

Key Concepts

  • Logistic regression is a statistical model used to predict the probability of a binary outcome.
  • It is based on the logistic function, which maps any real-valued number to a value between 0 and 1.
  • The predictor variables in logistic regression can be continuous or categorical.
  • The coefficients in logistic regression represent the influence of each predictor variable on the outcome.
  • Logistic regression can be used for both binary and multi-class classification problems.

(PPDAC)

Problem & Plan

  • What factors influence survival in song sparrows?

Data

data file songsparrow.csv

The song sparrow population on the island of Mandarte has been studied for many years by Jamie Smith, Peter Arcese, and collaborators. The birds were measured and banded and their fates on the island have recorded over many years. Here we will look for evidence of natural selection using the relationship between phenotypes and survival.

The data file songsparrow.csv gives survival of young-of-the-year females over their first winter (1=survived, 0=died). The file includes measurements of beak and body dimensions: body mass (g), wing length, tarsus length, beak length, beak depth, beak width (all in mm), year of birth, and survival. These data were analyzed previously in D. Schluter and J. N. M Smith (1986, Evolution 40: 221-231).

library(ggplot2)
library(visreg)
library(MASS)

x <- read.csv("files/songsparrow.csv")

# Show the header of the data
head(x)
  mass wing tarsus blength bdepth bwidth year sex survival
1 23.7 67.0   17.7     9.1    5.9    6.8 1978   f        1
2 23.1 65.0   19.5     9.5    5.9    7.0 1978   f        0
3 21.8 65.2   19.6     8.7    6.0    6.7 1978   f        0
4 21.7 66.0   18.2     8.4    6.2    6.8 1978   f        1
5 22.5 64.3   19.5     8.5    5.8    6.6 1978   f        1
6 22.9 65.8   19.6     8.9    5.8    6.6 1978   f        1

Plot the data.

# 2. Year as categorical variable
x$year <- as.character(x$year)

# 3. Plot survival against tarsus length
ggplot(x, aes(tarsus, survival)) +
        geom_jitter(color = "blue", 
                    size = 3, height = 0.04, 
                    width = 0, alpha = 0.5) +
        labs(x = "Tarsus length (mm)", y = "Survival") + 
        theme_classic()

Visualise with a trend line.

# add trend line with geom_smooth()
ggplot(x, aes(tarsus, survival)) +
        geom_jitter(color = "blue", size = 3, 
                    height = 0.04, width = 0, alpha = 0.5) +
        geom_smooth(method = "loess", size = 1, 
                    col = "red", lty = 2, se = FALSE) +
        labs(x = "Tarsus length (mm)", y = "Survival") + 
        theme_classic()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'

Analysis

Fit the model.

# Fit generalized linear model
z <- glm(formula = survival ~ tarsus, 
          family = binomial(link="logit"), 
          data = x)

Visualize the model.

# Visualize model fit (data points added with points() )
visreg(z, xvar = "tarsus", scale = 'response',
        rug = FALSE, ylim = c(-.1, 1.1))

points(jitter(survival, 0.2) ~ tarsus, data = x, 
        pch = 1, col = "blue", cex = 1, lwd = 1.5)

Examine the model results

# Estimate regression coefficient for tarsus
summary(z)

Call:
glm(formula = survival ~ tarsus, family = binomial(link = "logit"), 
    data = x)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  24.6361     6.7455   3.652 0.000260 ***
tarsus       -1.2578     0.3437  -3.659 0.000253 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 200.95  on 144  degrees of freedom
Residual deviance: 185.04  on 143  degrees of freedom
AIC: 189.04

Number of Fisher Scoring iterations: 4
# Test null hypothesis of zero slope
anova(z, test = "Chi")
Analysis of Deviance Table

Model: binomial, link: logit

Response: survival

Terms added sequentially (first to last)

       Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
NULL                     144     200.95             
tarsus  1   15.908       143     185.04 6.65e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Calculate R^2
library(fmsb)
Warning: package 'fmsb' was built under R version 4.4.2
NagelkerkeR2(z)
$N
[1] 145

$R2
[1] 0.1385605

Conclusion

Report the results.

We found a highly significant, negative effect of tarsus length on survival in the song sparrow (logistic regression: chi sq. = 15.9, df = 1, 143, p < 0.0001, Nagelkerke’s R^2 = 13.9% of variance explained).

References

GLM page for Ed’s c7041

Wolff, A., Gooch, D., Montaner, J.J.C., Rashid, U., Kortuem, G., 2016. Creating an Understanding of Data Literacy for a Data-driven Society. The Journal of Community Informatics 12. https://doi.org/10.15353/joci.v12i3.3275

Schluter, D., Smith, J.N.M., 1986. Natural Selection on Beak and Body Size in the Song Sparrow. Evolution 40, 221–231. https://doi.org/10.1111/j.1558-5646.1986.tb00465.x

Nagelkerke N (1991) A note on a general definition of the coefficient of determination. Biometrika, 78: 691-692.

Faraway JJ (2006) Extending the linear models with R: Generalized linear, mixed effects and nonparametric regression models. Chapman and Hall.