# To Bayes or Not To Bayes

Post date: Dec 16, 2014

I’ve recently become involved in Bayesian techniques for misclassification correction, much to my initial chagrin. Now, had you asked me a year ago would I ever use Bayesian inference in my work, the answer would have been a resounding no. Frankly, I just didn’t get it. But as an advisee in the doctoral program, you pretty much work on what your adviser says you will work on, within reason of course. And the flavor this month was misclassification of self-reported sexuality, using Bayes theorem to correct for the misclassification. Fortunately I had a colleague to collaborate with that was well versed in these methods, and a well-written publication to guide me.

The culmination of this was writing several publications and presenting a seminar on the basics of Bayesian applications to the field of epidemiology. I realized that if I could break it down verbally into digestible pieces, I could probably do the same in writing, so this post became my simple intro to Bayesian inference for the epidemiologist: To Bayes or Not To Bayes. Please, hold your applause. I’ll keep the math to an absolute minimum, as I probably don’t understand it anyway. But that’s also part of my point and intention: to make this very approachable.

The easiest place to begin is simply with Bayes’ theorem, which states that the probability of an outcome can be dependent (or conditioned) on something else. It’s actually a pretty intuitive concept. Phenomena do not exist in isolated worlds, but rather a system of interactions with many things around them. Bayes articulated this mathematically.

Stated another way, this equation represents a conditional probability of A given our knowledge of B. Ok, great. So how can we use this in epidemiology? Well the simplest answer may be a straightforward observational study. You’re given a dataset, the exposure, the outcome, and you’re asked to perform some statistical inference to arrive at the association between said exposure and outcome. Let’s make this more concrete. Going back to the work that started me down this path, the exposure was being a man who had sex with men (MSM), and the outcome was HIV positivity. So you’re given this dataset and asked to now estimate the relative risk of HIV given MSM behavior, represented by this diagram.

This is what we would call a naïve analysis. Your measure of association is specific to this dataset. Well, it’s pretty well known that MSM have higher rates of HIV (for example due to more sexual contacts via social networks or riskier unprotected anal sex), so what if you could bring this “prior knowledge” into your world to arrive at a more informed answer?

At its essence, this is what Bayesian inference allows us to do. Notice how I said more “informed” answer. You may wonder, is this a “better” answer? Well, that depends. Now I don’t mean to use that as a cop-out but more of a caveat emptor, let the buyer beware. As the researcher exploring Bayesian techniques, it all depends on: 1) your assumptions, 2) your implementation, and 3) your interpretation. We’ll explore each of these in a moment, but first another example where Bayes’ theorem is used and you may not realize it: screening tests.

Suppose a screening test had a 99% sensitivity (SN) and a 95% specificity (SP). This tells us the true positive rate (sensitivity) and the true negative rate (specificity) of the test. This is a property of the test itself; it is intrinsic to the performance of the test and has nothing to do with, for example, the burden of some disease or condition in the population. But all you may care about at this point is the answer to one or two questions, namely: (1) If you test positive, what is the probability of it being a true positive? Or, (2) If you test negative, what is the probability of it being a true negative? These questions depend on the prevalence of disease in the population. For example, a highly prevalence disease will mean a greater posttest probability, and vice-versa an extremely rare disease will mean a smaller posttest probability. Specifically, we’re talking about the positive predictive value (PPV) of the test for the answer to question (1) above and the negative predictive value (NPV) for the answer to question (2) above.

Notice how I used the word “depend” earlier. This implies some sort of conditional probability, right? Well, as it happens PPV and NPV can be derived via Bayes’ theorem. Just to drive home the point, and switch to the vernacular used in Bayesian inference, prevalence of the disease becomes our “prior” knowledge that we’ve applied to the naïve data (our result on the screening test).

At this point, I’m going to switch gears a little and focus on a specific application of Bayesian techniques: misclassification correction. This should also make sense after the next few paragraphs, but I first need to set the stage for how misclassification is occurring and ways to deal with it.

As I started to introduce earlier, my work was focused on misclassification of MSM. Take this scenario: you are a researcher asking someone their sexual identity, a potentially intimate and possibility stigmatizing behavior. The response truthfulness is likely to vary based on a number of factors. Some of these include, the way you ask the question (e.g., in person versus anonymous computer survey), the venue (e.g., at bar in the gayborhood versus at a church), and the sociodemographics of your respondent (e.g., age or race). There’s many more, but my point is that there will be some degree of misclassification here, and this misclassification could bias your analysis.

In general, stigmatizing behaviors may result in “false negatives” responses (lower SN), because the respondent may be trying to obscure his or her identity. On the other hand, you would not expect there to be many (or any) false positives (excellent SP), as why would someone admit to a stigmatizing behavior they are not doing (granted there are exceptions). If we tried to depict this mathematically it may look something like this: SP ~ 1 >> SN.