R-lecture-CB03 | Bay Area Demographic

Using tidycensus to access Census Bureau data

As we learned on another page, there are lots of packages that make doing particular tasks on R easier. One of these, tidycensus, helps us deal with the really big, difficult to manage data that come from the Census Bureau. Go ahead and install tidycensus and then use library(tidycensus) to activate it.

With tidycensus activated, we can create a dictionary to look up American Community Survey (ACS) variables.

This puts a new object called acs_lookup in our environment (upper right hand quadrant). If we click on it, it will display in the upper left.

The object acs_lookup lists every one of the 27,886 variables in the 2021 5-year ACS. That's a lot of variables to go through, which is why there is a "filter" button (circled in red) that allows us to search through this long list. Let's say we wanted to find variables which look at poverty among African Americans. If we type "poverty" into the label search box and "black" into the concept search box, we will reduce our list to the 114 variables that look at poverty specifically among African Americans.

Looking at the names of these variables, we find that there is a whole set that start with B17001B, another set that starts with B17010B, and another that starts with B17020B. These are called variable groups. In each group, there are a bunch of different related variables. The group B17001B gives the poverty status of individuals broken down by age and gender. For example, B17001B_004 is the total number of male children under 5 years old who lived in poverty during the past year. B17001B_005 is the total number of male children age 5 who lived in poverty during the past year. If we wanted to look at childhood poverty in African American communities, the B17001B variables would be good ones to draw on.

We can use tidycensus to draw down all the B17001B variables into an R dataset (by the way, datasets in R are called data frames). Go ahead and copy the code below and paste it into the R console

This gives us a new dataframe in our environment called aapov. It has 120 variables, all describing African American poverty among the 244 San Francisco Census tracts. The R code above is pretty intuitive. For geography, we can put down "nation", "state", "county", "tract", or "block". For table, we put in whatever table we found to serve our purposes. For us, it was B17001B. The state and county can be specified by name (e.g. "California") or by FIPS code. The FIPS code for California is 06 and for San Francisco it is 075. You can also put in more than one state or county at a time. For example, if we wanted data on both Alameda and San Francisco counties, we would use this code.

The c(1,75) simply tells R that you want to combine Alameda County (1) and San Francisco County (75). The subcommand output = 'wide' tells R to put each variable of the table B17001B into a separate column rather than stacking them.

This page explains how to create new variables in more detail, but here's a quick example of what we might do with these data. Let's take B17001B_002E (number of African Americans who were in poverty during the past 12 months) and B17001B_032E (number of African Americans who were not in poverty during the past 12 months). Since everyone is counted in one variable or the other, if we add these two we have the total number of African Americans in each tract. Therefore, this code will create a variable that gives the percent of African Americans in the tract who are in poverty.

We can get a really quick sense of this variable by asking for some basic descriptive statistics and graphics.

As you can see from this output, pct_poor is strongly unimodal and seriously positively skewed. The skew can be detected by the mean differing from the median. The bigger the difference, the more serious the skew. If the mean is higher than the median, it is called positive skew. If the mean is lower than the median, it's called negative skew. About 20% of the African American population in a typical San Francisco tract is below the poverty line. But there is a lot of variety among neighborhoods. Even the middle 50% of tracts range from less than 2% to more than 45%.