R-lecture-UA01 | Bay Area Demographic

Finding typicality and variety

All quantitative analysis begins with looking at one variable at a time. We use univariate statistics and graphics to describe one variable at a time. Let me be clear: univariate statistics are sort of boring and not very sociological because sociology is all about the relationship between variables. Knowing the mean percent of people who are in poverty among Census tracts in the country isn't nearly as interesting as knowing that this percentage varies greatly by race or gender or . . . Relationships between variables are simply more interesting because they hint at how social life might be structured.

However (and this is a very important "however"), in order to look at relationships in a way that maximizes our chances of discerning reality rather than being misled, we need to choose from among an enormous number of statistical tools. Choosing a tool to look at a given relationship is entirely dependent on understanding what each variable in a relationship looks like. We need univariate statistics if we want to examine relationships well. What we need to know about each variable is:

the level at which the information is measured (nominal, ordinal, interval, ratio),
what a typical case looks like,
how much variety there is among cases,
what the distribution of cases looks like, and
how many "missing" cases exist (missing cases are those without any information on that variable).

Level of measurement

The level of measurement is something we actually figure out from the codebook, not from univariate statistics. Knowing the level of measurement should come first, because it determines what univariate statistics and displays we will use to help us understand our data.

Nominal

An example of a variable measured at the nominal level is majority group in tract. We could go through the sf21 dataset and code every neighborhood as being majority White, majority Asian, majority Latinx, majority African American, or having no majority population. Here's the code that will create this variable.

If you are curious, here's an explanation of the code above. The first line tells R to create a variable in sf21 called majority and to make this variable say "no majority" for every case in the dataframe. The next line tells R to change that value to "majority White" if the proportion of people in the tract identify as non-Latinx White. The stuff in the brackets [ ] is a condition that has to be met in order to change the value. In this case, the condition is if the majority of people in the tract identify as White. The next three lines repeat this logic but with different groups. The last line simply puts the categories into an order. We don't really have to do this since nominal variables don't have ordered categories. But I want the largest groups to be listed first on a table (this is convention), so I put them in that order.

With the nominal level variable, we can easily create a frequency table, a pie chart, or a bar chart to look at these data. Here's some quick code to get us started.

This gives us:

majority White majority Asian majority Latinx
80 45 1
majority African American no majority
2 116

This quick output is good enough for a superficial overview of our data. If we wanted to share these things with an audience, or if something caught our eye that we'd want to explore with a better table or graphic, we would have to improve these considerably. You can learn how to make better frequency tables, pie charts, and bar charts on other pages.

It's clear from the frequency table, pie chart, and bar chart that "no majority" is the plurality among San Francisco tracts. It's easy to see that 116 is the largest number in the table and the biggest pie slice and the highest bar. The table is nice in that it gives us an exact number. The pie chart is nice because you can see that "no majority" is just slightly under 50%. And the bar chart is nice because it's easier to see the relative size of the five different categories.

Stepping away from statistics and thinking about these data sociologically, I can't help but be struck by a few of things. First, San Francisco tracts actually do seem like they are racially mixed places. Second, with Whites and Asians representing roughly the same proportion in the city (39% and 34% respectively), why are there so many more majority White tracts? Third, with Latinx representing 15% of the city and African Americans representing 5%, why are there two majority African American areas and only one majority Latinx tract? How is it even possible that a group representing just 5% of the city can find themselves so concentrated that two separate tracts are majority African American? Exploring possible answers to these questions entails using bivariate statistics to see how "majority" relates to other variables. Because we know "majority" is nominal level data, things like contingency table analysis come to mind as potential bivariate tools.

Ordinal

Variables with ordinal level data have values that are still just categories rather than counts. But unlike nominal level data, ordinal level categories are ordered. If we think about poverty rates, the federal government designates any area where more than 20% of the population is living under the poverty line as a "high poverty area". Any area where more than 40% of residents are in poverty is designated an "extreme poverty area". Every tract, therefore, can be classified as "low", "high", or "extreme" in level of poverty. These are ordered categories. We can use the data in sf21 to categorize all tracts as either "extreme", "high", or "low" in poverty. Here's the code that makes a variable (pov_level) with these categories.

The same sorts of displays that were good for nominal level data are also good for ordinal level data. Therefore, let's just repeat what we used above.

This gives us:

low high extreme
215 25 4

It's obvious that the vast majority of tracts are not in high or extreme poverty. You would hope there wouldn't be many areas of concentrated poverty given that San Francisco has a tax base of around $14B spread over fewer than a million people. The fact that there are 29 areas where poverty is high or extreme is depressing and begs explanation. In particular, the four areas of extreme poverty stand out as bizarre anomalies in a city of wealth. Where do you think these might be located? How do you think the variable "majority" might relate to "pov_level"? Again, even simple univariate analysis gets us thinking about bivariate analysis and informs us about what techniques will work best.

Measurement

Measurement data naturally occur in numeric form. For example, if I ask you your age, you will generally respond with a number. Any sort of data where we are counting things -- the number of college educated adults in a tract, the number of children a household has, your total units of college classes taken, etc. -- are measurement data. In the sf21 dataframe, we can calculate the percent of individuals in a tract who identify as Asian (pct_asian) as a piece of measure level data. This is measurement level data because it occurs naturally in numeric form. 6% is exactly one more percent than 5%. It's double 3%. We can play these sorts of simple mathematical games with measurement level data. We can't do this with the categorical level data described above.

To get pct_asian, we use this code:

There is a great range of univariate statistics to describe measurement data. Try out the following code and then let's walk through the output.

Going back to the list at the top of this page, we know that one thing we want to get a sense of is what a typical case looks like. What is a typical "percent Asian" among San Francisco tracts? The first line, mean(pct_asian,na.rm=T), simply asks R for the mean. This is our preferred measure of central tendency (what is typical). The "na.rm=T" part tells R to ignore cases that have a value of NA. If we don't include this, R won't run the command because it doesn't know what to do with the NAs. Other commands, like "summary", have the na.rm=T set as a default. The mean you should have gotten is 33.2. This means in a typical tract is San Francisco, around a third of residents identify as Asian.

We also want to get a sense of how much variety there is among cases. Measures of dispersion do this. And every measure of dispersion is built on the logic of a particular measure of central tendency. The standard deviation is used with the mean because it is built on the mean. The second line of code is how we get the standard deviation in R. In this case, the figure is 19.6. Most cases (a majority) will be within one standard deviation of the mean. In other words, most tracts have somewhere between 13.6% and 52.8% of residents identifying as Asian.

The mean and standard deviation are our preferred measures. However, they are easily thrown off by something called skew. When we have a lot of skew, the mean and standard deviation become unreliable. In such cases, we use measures based on rank order because rank is resistant to skew. Imagine a set of ten homes with the following prices:

$1,105,000

$975,000

$985,000

$1,005,000

$778,000

$1,200,000

$1,150,000

$1,100,000

$990,000

$1,000,000

These cases aren't very skewed. The mean ($1,028,800) is almost identical to the median ($1,002,500). But if we add one extremely expensive house to this list, it will change the mean more than the median.

$1,105,000

$975,000

$985,000

$1,005,000

$778,000

$1,200,000

$1,150,000

$1,100,000

$990,000

$1,000,000

$4,750,000

With this one additional case, the mean ratchets up to $1,367,091 but the median hardly shifts; it's now $1,005,000. The mean becomes a really bad descriptor of what is typical. Every house on the list costs less than the mean except for one. That's a bad description.

With heavily skewed measurement data, we rely on the median as a measure of central tendency. For dispersion, we rely on something called the interquartile range (IQR). The IQR is the difference between the 75th percentile and the 25th. In other words, it's how widely the middle 50% of cases are spread out. The higher the IQR, the more variety there is among our cases.

The "summary" command in the fifth line of code above gets R to produce the mean, median, and quartiles (the 25th and 75th percentiles). These six numbers, what's often called the "five number summary" plus the mean, are a great place to start when you are trying to get a sense of how a variable is distributed.

I put down the "quantiles" command just in case you need to produce different percentiles. This command isn't usually where we start when we are trying to get a feel for our data, but it is often useful as we refine our feel.

The last two lines of code produce univariate graphics. This "hist" command makes this histogram.

The numbers at the bottom represent the percent of a tract's population who identify as Asian. For example, the first bar includes all tracts where 0 to 10 percent of residents identify as Asian. The height of this bar (around 22 I think) represents the number of tracts with that percent of residents identifying as Asian. Histograms are great for getting a sense of where cases are concentrated and how cases are spread out; statisticians call this the shape of the distribution. As you can see, there is a wide variety of percentages -- literally from 0% to 100%. There's sort of a concentration in the 10% to 50% range, but this is (1) still a broad range of percentages and (2) doesn't have bars that are that much higher than elsewhere on the histogram. We could call this distribution weakly unimodal and roughly symmetric. Being roughly symmetric is why the mean and median are pretty much in agreement. Being weakly unimodal is why the standard deviation is so large (the weaker the concentration of cases, the more spread out they are).

The second graphic, produced by the last line of code, is called a boxplot.

This boxplot shows pct_asian along the left side. The box of the boxplot represents the middle 50% of cases. The bottom of the box is the 25th percentile. The top of the box is the 75th percentile. And the dark line near the middle is the median. The lowest 25% of tracts are below the box and the highest 25% are above the box. The dashed lines are drawn out to the highest and lowest cases that are NOT extreme cases. Extreme cases, called "outliers", are marked with circles. Because they make it so easy to detect outliers, boxplots are really useful in describing skew and in identifying odd cases.

By the way, an outlier is defined as any case more than 1.5 IQRs below the box or above the box. We see three outliers here, all of which are on the high end. If we had a case that was even higher (a logical impossibility here since that would be more than 100%), it could be an "extreme outliers". Extreme outliers are more than 3 IQRs away from the first or third quartile. Outliers are always worth worrying about because enough of them will make the mean unreliable. With extreme outliers, you only need a handful to throw off the mean.