A Short Primer on Geocoding, Working with Census Data, and Creating Maps for Publication
Spatial analysis of epidemiologic data is quite popular, and when these data include outbreak investigations, will often be accompanied by a slick map. A quick search through any public health department will yield many examples (e.g., see the annual vital statistics reports produced by the Philadelphia Department of Health). Taking data that start as case reports and ends as useable, geocoded coordinates for ploting may seem like quite an undertaking requiring use of expensive, commercial software (e.g., Esri's ArcGIS), or hiring a GIS expert, but can in fact be accomplished in a straightforward manner at no cost to the researcher.
This primer is a conversational overview of geocoding and mapping using R, based on the strategies outlined in these two short articles (article #1 - geocoding, article #2 - census mapping). For implementable code, please see the appendices to both articles, or email me. This write-up proceeds in three general steps: 1) Geocoding the addresses, 2) Resolving these geocodes to census-defined regions, and 3) Generating the maps. These steps are mutually exclusive, and therefore the researcher can perform any or all that are relevant to the task at hand.
As somewhat of a motivating example, let's assume a large batch of addresses that need to be represented in census tracts, the most commonly used census geography, and further linked to median household income within the census tract.
Step 1: Geocoding
The first step in geocoding is perhaps the most laborious and tedious. To arrive at reliable estimates of latitude and longitude, the original addresses need to be as clean as possible, minimizing any errors and maximizing the output. I highly advocate the use of Excel for this step: not only is it help to visualize the data, but there are many ways of using Excel to parse or reshape them. If using Excel is not an option for the entire dataset, a random sample can be manually inspected to check for any systematic errors (for example, missing digits on ZIP codes, excessive PO box addresses, an so on). At the end of the data preparation step, you'll want consistently coded addresses that can be automatically fed into the geocoding algorithm.
At this point, it may be tempting to throw the entire list of addresses into the "geocoder," grab a snack, and go watch a movie. As someone who has been burned by this approach, start small! Say 100 or fewer addresses. This will serve as a proof of concept and ensure: 1) the geocoding algorithm itself is working correctly, 2) there are no issues with your data, and 3) the output is what you expect/need. By way of the geocoding algorithm working correctly (point #1), I literally mean reverse geocode your latitude and longitude coordinates to see if they resolve to the original addresses. This can be as simple as putting the coordinate into Google Maps and checking that the address is more-or-less where you expected it to be, or, depending on the specific geocoding algorithm you use (I use Google Maps in the publication I reference above) you may have various "return codes" that can aid in identifying the successes and failures. Again, it is a lot easier to do this with a random sample of your data (n~100) versus the entire data set (n~10,000).
Let's assume at this point that your addresses have been resolved to their respective coordinates (i.e., latitude and longitude). From these coordinates, it's now possible to resolve these to the census geographies, covered in the ensuing section.
Step 2: Mapping to Census Data
In short, this next step is not so much a technological hurdle but a conceptual exercise. As the researcher you'll need to identify: 1) what census geographies you're interested in (again, I'm assuming census tracts in this tutorial but it could conceivably be many other options), and 2) what census data you'd like to retrieve for these geographies (e.g., median income, population density, sociodemographic composition, economic indicators, etc., etc.).
The geography that ultimately will be used in your ecological or multilevel analyses will be driven entirely by your research need: have concrete hypothesis to test a priori. In general, though I would choose the smallest unit that conceptually makes sense. You can always aggregate to larger units later. For a list of geographic units, click this link, and to map the coordinates to these geographies will require the appropriate Census TIGER product as well as this sample code (provided for convenience as an R implementation). And, if you're further linking to Census data, be sure to keep the "GEOID" variable, which we'll come back to in a moment.
One important caveat to be aware of: census geographies change over time, specifically inline with the decennial censuses (2000 and 2010). Most likely, this will only affect data that span longer periods of time (e.g., 2004 - 2014, but it may affect smaller timeframes, if they span a decennial census (1999 - 2001). While the easiest thing to do is code all addresses to a single census year (and thus avoiding the changing boundaries) that may induce a bias in the data. It's possible to code the data according to the census tract relevant to that year, however for analysis that take into account the data as a whole, you will still need to ultimately choose a single census year to allow comparisons.
Fortunately, the Census Bureau releases relation files that map between the 2000 and 2010 census boundaries. Working with them can be kind of awkward, but essentially it comes to one of three possibilities: 1) the census geography is unchanged (the majority of cases I've encountered), 2) the geography was divided into smaller units (such as due to population growth; the second most common), or 3) the geography was aggregated from smaller units (such as due to population decline; the least common).
At this point, let's assume that you know what unit you would like to work with and have mapped the coordinates to that unit (again, see sample code mentioned earlier for an R implementation). Now you're ready to bring in the actual census data. A visit to American FactFinder is the next stop and you'll need to identify a dataset that includes the specific data you require (remember your concrete hypotheses from earlier?). For convenience, here's a rough outline of how to proceed on the FactFinder website:
With these steps completed, it's now a matter of opening up each CSV and linking (by the GEOID) the variables to your data
Step 3: Generating Maps
Although the data are completely usable at this point for analysis (say a multilevel model), one final step may be to produce a map. I cover this in more details in the second publication mentioned in the opening. In short, you can use the same TIGER files that define the geographic boundaries to draw maps that can then be annotated as appropriate from the analysis.