R-lecture-R02 | Bay Area Demographic

Opening a file and playing around with some initial commands

Opening a file already in R format is pretty easy. In the upper right hand quadrant of R Studio (the "environment"), you go to the second button from the left and navigate to the file you want to open. In R, this is called "loading" a datafile.

Let's go ahead and load this file called sf21.Rdata I've already downloaded and cleaned up for use in this course; it's consists of some data from the American Community Survey for Census tracts in San Francisco. When you load a file, you'll notice text like the below appears in the lower left hand quadrant of R Studio (the "console").

You can always just enter this load command by typing it into the console (lower left quadrant) or putting it in a script rather than using the automated button. You will also notice that "sf21" now appears in the environment to the upper right. One of the great things about R Studio is it helps you keep track of files and objects you create by making them visible (rather than having to constantly check what you've got open with the ls() command).

To see what is in this datafile, we can click the blue arrow icon (to get a dropdown menu) or click sf21 to have the file display like a spreadsheet in the upper left quadrant (the file display window).

In sf21, 244 rows and 22 columns. Each row is a different case (observation). Since these data consist of Census tracts, there are 244 Census tracts in San Francisco in 2021. Each column is a different variable describing the tracts in San Francisco. For example, the first column (labeled "total_race") is the total number of people living in that tract. How do we know this? Well, you really need a codebook to tell you what each of these variables are about. Here, I've tried to use intuitive labels so you can guess at what they are about, but in real work you'll want to be more careful than that.

The second column, labeled "latinx", is the total number of people in a tract that identify as Hispanic or Latina/o. If we were to divide latinx by total_race, we'd get the proportion of people in each tract who are Latinx (or we could multiply that be 100 to get the percentage). To do this in R, we type into the console:

This creates a new object called "pct_latinx" which is the percent of each tract's population who are Latinx. The name is arbitrary. I called it this to help me remember what it is but you can call it anything you want using acceptable characters (no spaces or symbols that have mathematical meanings like * or /). In the formula to the right of the = sign, we have to write "sf21" first in order to tell R which dataset to look in. You can have multiple datasets open in your environment. If we had wanted pct_latinx to be in the sf21 dataset rather than floating loose, we could have named it sf21$pct_latinx instead.

In the console, if we type in pct_latinx, R will display the percent Latinx in every tract in San Francisco.

Looking through these numbers, you can see that some tracts lack much of a Latinx presence while others are majority Latinx. You will also see tracts with NaN displayed. This indicates that the formula we put in for pct_latinx did not yield a number (NaN stands for "not a number") in three cases. This is because three tracts had no residents; in other words, total_race equaled 0. Since you can't divide by zero (that's just a rule of mathematics), R puts in NaN instead.

Looking through all these numbers gives you a sense of what San Francisco tracts are like when it comes to Latinx presence, but summary statistics and univariate displays give us a much better sense. We will go over both summary statistics and univariate displays more systematically, but for now, try out a few of these commands by typing them into the console.

By the way, some commands like "summary" are built to ignore NaN's. Others, like "mean" you have to tell to ignore NaN's by adding in ",na.rm=T".