Neal D. Goldstein, PhD, MBI, FCPP

About | Blog | Books | CV | Consulting

Feb 23, 2015

A Short Primer on Geocoding, Working with Census Data, and Creating Maps for Publication

Spatial analysis of epidemiologic data is quite popular, and when these data include outbreak investigations, will often be accompanied by a slick map. A quick search through any public health department will yield many examples (e.g., see the annual vital statistics reports produced by the Philadelphia Department of Health). Taking data that start as case reports and ends as useable, geocoded coordinates for ploting may seem like quite an undertaking requiring use of expensive, commercial software (e.g., Esri's ArcGIS), or hiring a GIS expert, but can in fact be accomplished in a straightforward manner at no cost to the researcher.

This primer is a conversational overview of geocoding and mapping using R, based on the strategies outlined in these two short articles (article #1 - geocoding, article #2 - census mapping). For implementable code, please see the appendices to both articles, or email me. This write-up proceeds in three general steps: 1) Geocoding the addresses, 2) Resolving these geocodes to census-defined regions, and 3) Generating the maps. These steps are mutually exclusive, and therefore the researcher can perform any or all that are relevant to the task at hand.

As somewhat of a motivating example, let's assume a large batch of addresses that need to be represented in census tracts, the most commonly used census geography, and further linked to median household income within the census tract.

Step 1: Geocoding

The first step in geocoding is perhaps the most laborious and tedious. To arrive at reliable estimates of latitude and longitude, the original addresses need to be as clean as possible, minimizing any errors and maximizing the output. I highly advocate the use of Excel for this step: not only is it help to visualize the data, but there are many ways of using Excel to parse or reshape them. If using Excel is not an option for the entire dataset, a random sample can be manually inspected to check for any systematic errors (for example, missing digits on ZIP codes, excessive PO box addresses, an so on). At the end of the data preparation step, you'll want consistently coded addresses that can be automatically fed into the geocoding algorithm.

At this point, it may be tempting to throw the entire list of addresses into the "geocoder," grab a snack, and go watch a movie. As someone who has been burned by this approach, start small! Say 100 or fewer addresses. This will serve as a proof of concept and ensure: 1) the geocoding algorithm itself is working correctly, 2) there are no issues with your data, and 3) the output is what you expect/need. By way of the geocoding algorithm working correctly (point #1), I literally mean reverse geocode your latitude and longitude coordinates to see if they resolve to the original addresses. This can be as simple as putting the coordinate into Google Maps and checking that the address is more-or-less where you expected it to be, or, depending on the specific geocoding algorithm you use (I use Google Maps in the publication I reference above) you may have various "return codes" that can aid in identifying the successes and failures. Again, it is a lot easier to do this with a random sample of your data (n~100) versus the entire data set (n~10,000).

If you don't want to completely do the geocoding within code (or statistical programming environment), there are a host of free websites out there that you can offload the geocoding to: search for "batch geocoding" on any search engine. It is then only a matter of copying the data from Excel and pasting it onto the website. But regardless of your approach, save the output from the geocoder. A side not at this point: be aware of any privacy concerns, particularly if you are using identifiable data. Read the privacy policy of any public geocoding utility to know if and how they store the data, and to ensure that you are transferring the data over a secure connection (https).

Let's assume at this point that your addresses have been resolved to their respective coordinates (i.e., latitude and longitude). From these coordinates, it's now possible to resolve these to the census geographies, covered in the ensuing section.

Step 2: Mapping to Census Data

In short, this next step is not so much a technological hurdle but a conceptual exercise. As the researcher you'll need to identify: 1) what census geographies you're interested in (again, I'm assuming census tracts in this tutorial but it could conceivably be many other options), and 2) what census data you'd like to retrieve for these geographies (e.g., median income, population density, sociodemographic composition, economic indicators, etc., etc.).

The geography that ultimately will be used in your ecological or multilevel analyses will be driven entirely by your research need: have concrete hypothesis to test a priori. In general, though I would choose the smallest unit that conceptually makes sense. You can always aggregate to larger units later. For a list of geographic units, click this link, and to map the coordinates to these geographies will require the appropriate Census TIGER product as well as this sample code (provided for convenience as an R implementation). And, if you're further linking to Census data, be sure to keep the "GEOID" variable, which we'll come back to in a moment.

One important caveat to be aware of: census geographies change over time, specifically inline with the decennial censuses (2000 and 2010). Most likely, this will only affect data that span longer periods of time (e.g., 2004 - 2014, but it may affect smaller timeframes, if they span a decennial census (1999 - 2001). While the easiest thing to do is code all addresses to a single census year (and thus avoiding the changing boundaries) that may induce a bias in the data. It's possible to code the data according to the census tract relevant to that year, however for analysis that take into account the data as a whole, you will still need to ultimately choose a single census year to allow comparisons.

Fortunately, the Census Bureau releases relation files that map between the 2000 and 2010 census boundaries. Working with them can be kind of awkward, but essentially it comes to one of three possibilities: 1) the census geography is unchanged (the majority of cases I've encountered), 2) the geography was divided into smaller units (such as due to population growth; the second most common), or 3) the geography was aggregated from smaller units (such as due to population decline; the least common).

At this point, let's assume that you know what unit you would like to work with and have mapped the coordinates to that unit (again, see sample code mentioned earlier for an R implementation). Now you're ready to bring in the actual census data. A visit to American FactFinder is the next stop and you'll need to identify a dataset that includes the specific data you require (remember your concrete hypotheses from earlier?). For convenience, here's a rough outline of how to proceed on the FactFinder website:

Select the advanced search.
From among the options on the left hand side (Topics, Geographies, Race/Ethnic Groups, Industry Codes, Occupation Codes) start first in the Geographies. There will be a list of the most common geographic types. For this example, I chose Census Tracts. After the locale is selected, click "Add to Your Selections". As an aside, the results pane will begin to populate. Yet since we haven't added any additional criteria, there will be a large number of possibilities; you can ignore these for now.
Select Topics. Choose the dataset you're interesting in (e.g., American Community Survey 5-year estimates). If unusure, spend some additional time researching the Census datasets that cover the year(s) of your research. They have many options and it's likely more than one will suit your needs.
If you're looking for general indicators, it's likely that one of the first few results contains the information you're after. For example, see results that contain the words economic characteristics, demographics characteristics, and so on. Unless you're looking for something a bit more obscure, the data are generally in one of these broader tables.
Select the individual result and you're taken to a preview screen of the data that includes: the geographies (in a drop-down box), the variables, and their corresponding values. While sometimes all the necessary variables can be found in a single Table, often multiple tables will be needed. Therefore, for each table, select to "Download" the data as a CSV (what I consider the easiest and most universal format).

With these steps completed, it's now a matter of opening up each CSV and linking (by the GEOID) the variables to your data

Step 3: Generating Maps

Although the data are completely usable at this point for analysis (say a multilevel model), one final step may be to produce a map. I cover this in more details in the second publication mentioned in the opening. In short, you can use the same TIGER files that define the geographic boundaries to draw maps that can then be annotated as appropriate from the analysis.

Cite: Goldstein ND. A Short Primer on Geocoding, Working with Census Data, and Creating Maps for Publication. Feb 23, 2015. DOI: 10.17918/goldsteinepi.