Epi Vignettes: Cohort study

Post date: May 19, 2015

A brief synopsis of epidemiologic study design and methods with sample analytic code in R.

This is the first post in a new series I’m beginning on analysis techniques for a given study design. As I have become more exposed to a variety of study design, the natural question is “How do I analyze them?” I intend to use these posts to accomplish a few things:

  1. Briefly describe the study design.
  2. Qualitatively talk about the analysis strategy.
  3. Quantitatively demonstrate the analysis and provide sample R code.

Too often, epidemiology texts present analysis methods separate from study design, but clearly they go hand in hand, and these posts will be my analytic strategy for specific research questions that I encounter in real-life. That is, this may or may not be appropriate and applicable to your particular study, but I hope it will help in some capacity. At the end of the day, I wrote these to further my own knowledge on these common study types, and by all means if you don’t agree (or I’ve erred) on the analytic strategy, please send me an email. There is also a strong caveat I need to mention upfront: the methods below have many underlying assumptions and properties that ultimately allow the inference to be valid – some are enumerated but many are not. When in doubt consult an epidemiologist or a (bio)statistician.

The layout of these vignettes will be fairly well structured. I begin this series with considering a cohort study. Study design: Cohort. Sampling in a cohort is done by the exposure, with the participants followed up over time to determine who develops the outcome. The cohort can be assembled in the present time and followed into the future (prospective), or have already occurred (retrospective). Further, the cohort may be a fixed group of people (closed) or comprise a population that is transient (open). A typical scenario is to recruit participants based on presence or absence of an exposure, track over time to see who develops said outcome, and compute incidence measures.

  • Data description: Assume a binary exposure, a binary outcome, and several covariates.
  • Goal of analysis: Describe incidence of outcome within the cohort. Compare outcome between exposure groups, both crude and adjusted for potential confounding.
  • Statistical techniques: Regression techniques are appropriate when adjusting or controlling for potential confounding. Otherwise, a standard contingency table can be used to directly calculate incidence (cumulative or incidence rate) and ratio measures of risk (relative risk or rate ratios).
    • If only crude estimates are needed: The choice of the type of incidence measured can be as arbitrary as personnel preference, or driven by a pragmatic need.
      • Cumulative incidence: Represents a proportion of probability of having an outcome (numerator) over the initial population (denominator). Is most appropriate when follow up is complete.
      • Incidence rates: Represents a density (or rate) of having an outcome (numerator) over the person-time contribution for all study participants. Is useful when there is loss to follow up (as is common in a cohort study). Person-time is calculated as time to event or censoring (for loss to follow up or study conclusion).
    • If the outcome is continuous: Linear regression. Not applicable here. Incidence would need to be defined based on a threshold level or other categorization of the outcome. For example, instead of blood pressure as the outcome, dichotomize into hypertensive vs non-hypertensive, and then compare by the exposure group(s).
    • If the outcome is dichotomous (occurred/did not occur):
      • Logistic regression: Cumulative incidence only. Assumes complete follow up for all study participants (no withdrawals) and time to event is not important. Estimates are on the log-odds scale (Odds Ratios, OR), therefore need to be interpreted in this context (i.e., how well would they approximate the Relative Risk, RR). In some cases, like a rare disease (<10% prevalence), little loss to follow up, and time independent outcomes, the estimates from the logistic regression will be robust for the RR: they will be similar to the log-binomial and cox proportional hazard models described below. This technique is often used despite violation of assumptions because of ease of use and interpretability of results, therefore if using this approach, the investigator needs to justify why this is appropriate.
      • Log-binomial regression: Same assumptions and limitations as logistic regression, except using a log link function instead of a logit (log odds) link function. Can directly estimate the RR rather than the OR. Use would be more appropriate when the OR is a biased measure compared to the RR.
      • Cox proportional hazard regression: When the exposure is assumed to immediately increase risk of an outcome, compared to a baseline (unexposed) state, time-to-event (aka survival) analysis is appropriate. This analysis takes into account the time of occurrence of an event, unlike the logistic regression approach, and also time of censoring of participants (where outcome did not occur). The Hazard Ratio (HR) approximates the RR. Be sure to check the proportional hazards assumption: the exposure multiplies risk by a constant factor compared to the unexposed. If the proportional hazards assumption fails (it is a time dependent variable) try interacting it with the time variable. This technique is useful for time-varying exposures/covariates.
      • Poisson (rate) regression: Incidence rate only. When the investigator wishes to measure a true rate (that incorporates a time element, like person years) this model produces rate ratio (RR) estimates that are interpretable like the OR/HR estimates. The outcome becomes a count of the number of events that occurred, and person-time contributed in the study becomes part of the intercept term (specified via the offset option as a log transformed variable). Be sure to check for overdispersion, which can be spotted by checking if the residual deviance >> degrees of freedom.

Sample codes in R

Cumulative incidence relative risk (package:epitools)


Incidence rate ratio (package:epitools)


Linear regression

model = lm(outcome~as.factor(exposure)+covariates, data=cohort)
summary(model) #summary of the model
round(coef(model),2) #coefficient estimates: change in outcome
round(confint(model),2) #confidence intervals

Logistic regression

model = glm(outcome~as.factor(exposure)+covariates, data=cohort, family=binomial(link="logit"))
summary(model) #summary of the model
round(exp(coef(model)),2) #coefficient estimates: odds ratios
round(exp(confint(model)),2) #confidence intervals

Log-binomial regression

model = glm(outcome~as.factor(exposure)+covariates, data=cohort, family=binomial(link="log"))
summary(model) #summary of the model
round(exp(coef(model)),2) #coefficient estimates: relative risks
round(exp(confint(model)),2) #confidence intervals

Cox proportional hazards regression (package:survival)

model = coxph(Surv(timevar, outcome)~as.factor(exposure)+covariates, data=cohort)
summary(model) #summary of the model
round(exp(coef(model)),2) #coefficient estimates: hazard ratios
round(exp(confint(model)),2) #confidence intervals
cox.zph(model) #check proportional hazard assumption, e.g., violated if p<0.05
plot(survfit(Surv(timevar,outcome)~as.factor(exposure), data=cohort)) #survival plot

Poisson (rate) regression

model = glm(outcome~as.factor(exposure), offset=log(timevar), data=cohort, family=poisson())
model = glm(outcome~as.factor(exposure), offset=log(timevar), data= cohort, family=quasipoisson()) #overdispersion
summary(model) #summary of the model
round(exp(coef(model)),2) #coefficient estimates: rate ratios
round(exp(confint(model)),2) #confidence intervals