Neal D. Goldstein, PhD, MBI

About | Blog | Books | CV | Data | Lab


Jan 2, 2019

Demonstrating report generation using R and Sweave

Editor's Note: Last month, we saw a beautiful example of a report generated entirely within R using the function Sweave by our guest blogger, Daniel Vader. In his second post, Dan shares the code for creating this report.

By way of a bit of background for those new to report generation within R, Sweave is a function that allows reports to be written and compiled entirely within R. This is an extremely useful timesaver for those who use R to analyze data and then wish to present the written results to an audience, especially if multiple reports need to be created over some period of time. It is also possible to generate scientific articles in this way and importantly add a step of reproducibility and transparency that are not otherwise possible in traditional publishing since the manuscript is separated from the computing codes.

The Sweave function uses the LaTeX (pronounced la-tech) markup language for word processing. Unlike a typical word processor that is WYSIWYG (i.e., formatted on screen, such as Microsoft Word), LaTeX documents are written entirely in plain text, with markup tags that tell the LaTeX compiler how to format the text. Sweave is an intermediary in the process. That is, both the report and data analysis are written in R, and then Sweave calls the TeX compiler to create the finished product, for example, a PDF file. As the data change, the report is dynamically updated. No more re-running analyses or regenerating graphics or tables. All one needs to do is recompile the report in R.

As a refresher of the finished product, here is a link to Dan's post from last month. Note that I have presented it on this blog as a webpage; the code shared below generates a PDF when compiled within RStudio.

Again, my humble thanks to Dan for providing the wonderful material and allowing me to share it to the world. Dan can be contacted at dtv28 at drexel dot edu.

\documentclass[10pt]{article}

% Sweave required packages %
\usepackage{booktabs}
\usepackage{longtable}
\usepackage{array}
\usepackage{multirow}
\usepackage[table]{xcolor}
\usepackage{wrapfig}
\usepackage{float}
\usepackage{colortbl}
\usepackage{pdflscape}
\usepackage{tabu}
\usepackage{threeparttable}
\usepackage{threeparttablex}
\usepackage[normalem]{ulem}
\usepackage{makecell}
\usepackage{amsmath}

% Load required R packages
<>=
library(knitr)
library(kableExtra)
library(dplyr)
library(ggplot2)
@

\begin{document}
\SweaveOpts{concordance=TRUE}

\title{Digging into relationships between observed outcomes, sensitivity/specificity, and true prevalence: a teaching resource}
\author{Daniel Vader}
\maketitle

\section{Introduction}

\paragraph{} Whenever I introduce the topic of misclassification to a class, I will invariably show students a 2x2 test validity table with the true classification of a condition in the columns and the test-based classification in the rows (Table 1). Since my lesson on misclassification comes after I introduce students to the basics of probability, I present measures of validity -- sensitivity (Se), specificity (Sp), positive predictive value (PPV), and negative predictive value (NPV) -- in terms of conditional probability. We conclude by discussing how wildly positive predictive value can change depending on the true prevalence of the outcome, hammering home some of the epistemological consequences of misclassification. 


<>=
# Construct table as data frame
ex.table <- as.data.frame(matrix(c("Test", " ", "$+$", "$-$", "$T_+$", "$F_-$", "$F_+$", "$T_-$"), nrow =2, ncol=4, byrow = F))
names(ex.table) <- c(" ", " ", "$+$", "$-$")

# Convert to latex
kable(ex.table, "latex", caption = "Simple 2x2 validity table", booktabs = T, escape=F) %>%
    add_header_above(c(" " = 2, "Truth" = 2)) %>%
    kable_styling(latex_options="hold_position")
@

\paragraph{} Perhaps this lesson sounds familiar. I have seen variants of it (usually excluding discussion of conditional probability) in courses that I took as a masters student, in graduate and undergraduate courses where I was a teaching assistant, and finally in the materials left to me by the prior instructor of a graduate-level introduction to epidemiology and biostatistics course that I have now taught for three years. Perhaps the lesson is so prevalent in my academic circle because, at its core, it is a good lesson, an easy gateway to the beautifully tangled world of measurement error where example data are neatly formatted in a small table and probabilities can be described using simple proportions.

\paragraph{} However, while I continue to believe that this approach serves as good introduction to the basic concepts of misclassification, what may limit more curious students is that my discussion of the \textit{relationship} between measures of validity is fairly superficial. I, or another teacher, may communicate the hard and fast rule that ``When true prevalence decreases, PPV also decreases," but we may not directly give students the tools to understand the relationship between parameters more thoroughly (``Here is how we can predict PPV based on sensitivity, specificity, and prevalence"). 

\paragraph{} This article is intended to be a bridge between very basic lessons on misclassification and relational understanding of observed prevalence (P*), positive predictive value, true prevalence (P), sensitivity, and specificity. The article assumes that your students have already had a lesson in basic probability, including the rule of subtraction, $Pr(A)=1-Pr(A')$, and the rule of multiplication, $Pr(A \cap B)=Pr(A)Pr(B|A)$. I understand that, when a particular cohort has vocal phobia of basic algebra, avoiding discussion of the relationship between PPV or P* and P, Se, and Sp is a tempting shortcut. But we should be prepared to expect more of our students on this topic, or at the very least have materials at hand to give questioning students the tools they need to satisfy their curiosity.

\section{Deriving and describing relationships}

<>=
# Set vectors for new row and column
rtot <- c("$\\mathbf{R_+}$", "$\\mathbf{R_-}$")
ctot <- c(" ", "Total", "$\\mathbf{C_+}$", "$\\mathbf{C_-}$", "\\textbf{N}")

# Add new vectors to data frame
ex.table2 <- cbind(ex.table, rtot) 
ex.table2[] <- lapply(ex.table2, as.character)
ex.table2 <- rbind(ex.table2, ctot)
names(ex.table2) <- c(" ", " ", "$+$", "$-$", "Total")

# Latex
kable(ex.table2, "latex", caption = "Validity table with row and column totals", booktabs = T, escape=F) %>%
    add_header_above(c(" " = 2, "Truth" = 2)) %>%
    kable_styling(latex_options="hold_position")
@

\paragraph{} Let us begin by extending table 1 so that we have the notation to more simply discuss the marginal probabilities (table 2). With notation for the row and column totals established, we can now prescribe meaning to each of the cells in the table. You may already be familiar with a similar notation for definining measures of validity, but sometimes a refresher can be useful:
\begin{itemize}
  \item{$T_+ =$ true positives, $F_- =$ false negatives, $T_+ + F_- = C_+ =$ all subjects who actually have the condition}
  \item{$F_+ =$ false positives, $T_- =$ true negatives, $F_+ + T_- = C_- =$ all subjects who actually do not have the condition}
  \item{$R_+$ and $R_-$ represent the total number of subjects who tested postive and tested negative respectively.}
  \item{N = the total number of subjects.}
\end{itemize}

\paragraph{} Next, let us define prevalence (P), observed prevalence (P*), sensitivity(Se), specificity (Sp), and positive predictive value (PPV) using both proportions and probabilities. If you learned about or best remember these measures using proporitons, take a moment to familiarize yourself with the probablities and what they mean.

\begin{gather*}
  P=Pr(C_+)=\frac{T_p+F_n}{N}\\
  P^*=Pr(R_+)=\frac{T_p+F_p}{N}\\
  Se=Pr(R_+ | C_+)=\frac{T_p}{T_p+F_n}\\
  Sp=Pr(R_- | C_-)=\frac{T_n}{T_n+F_p}\\
  PPV=Pr(C_+ | R_+)=\frac{T_p}{T_p+F_p}\\
\end{gather*}

\subsection{Positive predictive value}
\paragraph{} By revisiting the rule of multiplication, $Pr(A \cap B)=Pr(A)Pr(B|A)$, and the rule of subtraction, $Pr(A)=1-Pr(A')$, we can rewrite the component parts of PPV in terms of P, Se and Sp:

\begin{gather*}
  Pr(T_p) = Pr(C_+ \cap R_+) = Pr(C_+)*Pr(R_+ | C_+) = \mathbf{P*S_e}\\
  Pr(F_p) = Pr(C_- \cap R_+) = Pr(C_-)*Pr(R_+ | C_-) = \mathbf{(1-P)*(1-S_p)}\\
\end{gather*}
\begin{gather*}
  PPV = \frac{T_p}{T_p+F_p} = \frac{P*S_e*N}{P*S_e*N + (1-P)*(1-S_p)*N} = \\ 
  \mathbf{\frac{P*S_e}{(P*S_e) + (1-P)*(1-S_p)}}\\
\end{gather*}

\paragraph{} With a formula in hand, we can now begin to make observations about
the relationship between PPV and P, Se, and Sp. I find that visualizing this relationship
helps both my students and myself to internalize its operation. To this end, a straightforward
approach is to write a basic function that calculates PPV using the described formula:


<>=
ppv.calc <- function(p, se, sp){
  ppv <- (p*se)/(p*se+(1-p)*(1-sp))
  return(ppv)
}
@

\paragraph{} The \textbf{ppv.calc} function allows us to estimate PPV simply by inputting
values for P, Se, and Sp. While setting these numbers individually may be helpful 
in circumstances where we have a very good idea how to define our parameters, examining
a range of values is more useful for a general description of the relationship between
those parameters and PPV (Figure 1).

\begin{figure}[H]
\caption{Positive predictive value by prevalence given fixed values of sensitivity
and specificity.}
\centering
<>=
# Set color and style labels/values
cols <- c(".99"="red", ".90"="blue", ".70"="orange")
lines <- c(".99"="solid", ".90"="dashed" ,".70"="dotted")

# engage plotting
ppv.plot1 <- ggplot(data=data.frame(x = 0), mapping = aes(x = x))
ppv.plot1 + 
  stat_function(fun=ppv.calc, aes(colour=".99", linetype=".99"), args = list(se=.99, sp=.99)) + 
  stat_function(fun=ppv.calc, aes(colour=".90", linetype=".99"), args = list(se=.99, sp=.9)) + 
  stat_function(fun=ppv.calc, aes(colour=".70", linetype=".99"), args = list(se=.99, sp=.7)) + 
  stat_function(fun=ppv.calc, aes(colour=".99", linetype=".90"), args = list(se=.9, sp=.99)) + 
  stat_function(fun=ppv.calc, aes(colour=".90", linetype=".90"), args = list(se=.9, sp=.9)) + 
  stat_function(fun=ppv.calc, aes(colour=".70", linetype=".90"), args = list(se=.9, sp=.7)) + 
  stat_function(fun=ppv.calc, aes(colour=".99", linetype=".70"), args = list(se=.7, sp=.99)) + 
  stat_function(fun=ppv.calc, aes(colour=".90", linetype=".70"), args = list(se=.7, sp=.9)) + 
  stat_function(fun=ppv.calc, aes(colour=".70", linetype=".70"), args = list(se=.7, sp=.7)) + 
  xlim(0,1) + 
  labs(x ="Prevalence", y="Positive Predictive Value") + 
  scale_colour_manual(name="Specificity", values=cols) +
  scale_linetype_manual(name="Sensitivity", values=lines)+
  theme_classic()
@
\end{figure}

\paragraph{} On Figure 1 we see much greater variation by Sp (color) than by 
Se (line style) across values of prevalence. When Sp is high, the PPV curve 
is more favorable than when Sp is low. On the other hand, the PPV curve is only 
marginally more favorable for higher values of Se. Furthermore, notice how PPV approaches 0 in all scenarios as prevalence approaches 0. Likewise, note how PPV always approaches 1 as prevalence approaches 1.

\paragraph{} The bones of these observations are also evident in the PPV equation that 
we derived. Se is equally weighted by P in both the numerator and denominator of 
the equation, meaning that these two pieces of the fraction tend to balance each 
other out. On the other hand, Sp is weighted by the inverse of P and is only 
present in the denominator. Thus, as P decreases, the weight of Sp in the 
denominator increases while simultaneously the weight of Se decreases. 

\subsection{Observed prevalence}

\paragraph{} Using the same substitution approach employed to describe PPV, we can also rewrite the equation for observed prevalence (P*):
\begin{gather*}
  P^* = \frac{T_p+F_p}{N} = \frac{P*S_e*N + (1-P)*(1-S_p)*N}{N} = \\ \mathbf{P*S_e + (1-P)*(1-S_p)} 
\end{gather*}

\paragraph{} Since Se is no longer balanced on the numerator and denominator of the equation, we know immediately that Se plays a more important role when determining P* than when determining PPV. To examine these relationships more closely, let us again store the equation as a function and graph that function.

<>=
prev.calc <- function(p, se, sp){
  op <- p*se + (1-p)*(1-sp)
  return(op)
}
@


\begin{figure}[H]
\caption{Observed prevalence by true prevalence given fixed values of sensitivity
and specificity.}
\centering
<>=
linedraw <- function(x){x} # Function for underestimate shading (could be done inline)
fillers <- c("underestimate" = "#eeeeee") # Label/value for shader

# engage plotting
prev.plot1 <- ggplot(data=data.frame(x = 0), mapping = aes(x = x))
prev.plot1 + geom_area(stat="function", fun=linedraw, aes(fill="underestimate")) +
  stat_function(fun=prev.calc, aes(colour=".99", linetype=".99"), args = list(se=.99, sp=.99)) + 
  stat_function(fun=prev.calc, aes(colour=".90", linetype=".99"), args = list(se=.99, sp=.9)) + 
  stat_function(fun=prev.calc, aes(colour=".70", linetype=".99"), args = list(se=.99, sp=.7)) + 
  stat_function(fun=prev.calc, aes(colour=".99", linetype=".90"), args = list(se=.9, sp=.99)) + 
  stat_function(fun=prev.calc, aes(colour=".90", linetype=".90"), args = list(se=.9, sp=.9)) + 
  stat_function(fun=prev.calc, aes(colour=".70", linetype=".90"), args = list(se=.9, sp=.7)) + 
  stat_function(fun=prev.calc, aes(colour=".99", linetype=".70"), args = list(se=.7, sp=.99)) + 
  stat_function(fun=prev.calc, aes(colour=".90", linetype=".70"), args = list(se=.7, sp=.9)) + 
  stat_function(fun=prev.calc, aes(colour=".70", linetype=".70"), args = list(se=.7, sp=.7)) + 
  xlim(0,1) + 
  labs(x ="True Prevalence", y="Observed Prevalence") + 
  scale_colour_manual(name="Specificity", values=cols) +
  scale_linetype_manual(name="Sensitivity", values=lines) +
  scale_fill_manual(name="Bias", values=fillers) +
  theme_classic()
@
\end{figure}

\paragraph{} The pattern in Figure 2 is a little more difficult to see at first glance compared to the pattern in Figure 1, but becomes clear once you spend a little time with it. Observe how lines of the same color converge as P approaches 0 and diverge as P approaches 1. Since line style represents levels of sensitivity and line color represents levels of specificity, we can say that observed prevalence becomes more dependent on sensitivity as true prevalence increases. Likewise, observing that lines of the same style converge as P approaches 1 and diverge as it approaches 0, we know that observed prevalence becomes more dependent on specificity as true prevalence decreases. 

\paragraph{} As with our observations on PPV, the patterns we see in Figure 2 are also seen in the equation that we derived for P*. Se is weighted by P and Sp is weighted by 1-P. Thus, the shifting importance of Se and Sp with P is to be expected.

\paragraph{} What is perhaps less intuitive is how the bias associated with P* as an estimator of P changes with different levels of Se, Sp, and P. When Se is very high, P* is more likely to overestimate P, particularly when P is low. This is due to the fact that few subjects who are positive for the condition of interest are being classified as negative when Se is very high, but there is still potential for subjects without the condition to test postive if Sp is not also very high. Likewise, when Sp is very high, P* is more likely to underestimate P, particularly when P is high. When both Se and Sp are equal and not near perfect, P becomes the dominant indicator of over or underestimation.

\paragraph{}For those of us who like to see tables filled with numbers in addition to graphs, Table 3 shows how observed prevalence (P*) changes depending on values of P, Se, and Sp. The P*-P column represents the degree to which observed prevalence over or under-estimates true prevalence in each scenario. Tables like these may be useful when discussing specific examples, Figures 1 \& 2 may be more useful when discussing the equations and the relationships the equations indicate.


<>=
# Function to calculate a series of observed prevalences for different combos of se/sp
prev.iter <- function(p = 0.2) {
  seq05 <- c(.7,.9,.99)
  ppdat <- NULL
  for(sp in seq05){
    for(se in seq05){
      pp <- prev.calc(p=p, se=se, sp=sp)
      datline <- c(sp, se, pp, round(pp-p, digits=2))
      ppdat <- rbind(ppdat, datline, deparse.level = 0)
    }
  }
  ppdat <- as.data.frame(ppdat)
  names(ppdat) <- c("Sp", "Se", "P*", "P*-P")
  return(ppdat)
}

# Run calcs (this could have also been done with a loop)
prev05 <- prev.iter(.05)
prev10 <- prev.iter(.10)
prev25 <- prev.iter(.25)
prev75 <- prev.iter(.75)

# drop unwanted cols and combine
prev.tot <- cbind(prev05, prev10[,c(3,4)], prev25[,c(3,4)], prev75[,c(3,4)])

# convert to latex
kable(prev.tot, "latex", caption = "Observed prevalence (P*) at different values of true prevalence (P), sensitivity (Se), and specificity (Sp).", booktabs = T, 
      linesep =c('', '', '\\addlinespace')) %>%
    add_header_above(c("P =" = 2, "5%" = 2, "10%"=2, "25%"=2, "75%"=2)) %>%
    kable_styling(latex_options="hold_position",  font_size = 10)
@

\section{Conclusion}
\paragraph{} Exploring the validity of screening test results in the classroom gives students an opportunity to engage with a number of important topics that are foundational to epidemiology (including misclassification and probability), and it is one of my favorite lessons to teach. This paper has presented resources for helping your students to move beyond a superficial understanding of the relationship between features of the test and observed outcome. Of particular note is our examination of observed prevalence which, in my experience, is often completely neglected. 

\paragraph{} Feel free to let me know if you have any questions or corrections. I can be contacted at dtv28 at drexel dot edu (please excuse my attempt to evade very simple web crawlers).

\end{document}

Cite: Vader DT. Demonstrating report generation using R and Sweave. Jan 2, 2019. DOI: 10.17918/goldsteinepi.


About | Blog | Books | CV | Data | Lab