This lesson assumes you have current versions of the following installed on your computer:
Your working directory should be the folder that this Rmd file is saved in. This tells R where in our file system to look when we come to reference and manipulate files later on.
# this is the command to change the working directory
setwd("/Users/George/OneDrive/Projects/SES-Obesity-Status")
# IF YOU USE WINDOWS
#setwd("C:/Users/George/OneDrive/Projects/SES-Obesity-Status")
This dataset is the 2018 Behavioral Risk Factor Surveillance System from the CDC. The data comes from a series of “national health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.” The survey collects a huge amount of data from individuals regarding their health - if you’re interested in looking at everything the survey asks about, you can view the codebook here.
In our case, we’re interested in two specific variables: income of the responder and their “Obesity Status,” as classified by BMI. We want to determine if there is a significant relationship between these two variables; in essence, we want to see if BMI is independent from income level in the state of Ohio. Because both of these variables are categorical variables, we’re going to use the Chi-Squared Test of Independence. This is because we have categorical variables, independent observations we can put in a table, and non-sparse data (no low counts in any group). The BRFSS is a complex survey. It has a complex sampling design (survey observations do not have an equal probability of being selected), so analyzing this data requires considerable care. Complex surveys are used to save money, represent small sub-populations, and deal with naturally structured data. If you’re interested in learning more, you can take the DataCCamp course “Analyzing Survey Data in R.” Alternatively, many CDC datasets are taken from the BRFSS and have been pre-weighted for you (for example). In this case, we’ve already created our dataset by extracting it from the BRFSS and weighing it for you. An actual manuscript using the BRFSS would take the additional step of using survey aware statistical tests.
We first need to figure out what data from this file/dataset we want exactly. If we go into the codebook for the BRFSS dataset (see link above), we can find BMI values collected under the label of “Computed Body Mass Index.” The variable associated with it is x.bmicat. There are four categories, and each range of BMIs has a classification.
| Coding | BMI | Classification |
|---|---|---|
| 1 | < 18.50 | Underweight |
| 2 | 18.50 - 25.00 | Normal Weight |
| 3 | 25.00 - 30.00 | Overweight |
| 4 | 30.00 - 99.99 | Obese |
We also need to find data on income level, so if we do some more digging into the codebook, we’ll find what we want under the label “Income Level." The variable associated with it is income2. There are 11 categories here (note the don’t know/refused/not asked categories):
| Coding | Income Level |
|---|---|
| 1 | Less than $10,000 |
| 2 | $10,000 - $15,000 |
| 3 | $15,000 - $20,000 |
| 4 | $20,000 - $25,000 |
| 5 | $25,000 - $35,000 |
| 6 | $35,000 - $50,000 |
| 7 | $50,000 - $75,000 |
| 8 | $75,000+ |
| 77 | Don’t know / Not sure |
| 99 | Refused |
Finally, we want to know which state (by FIPS code) each respondent lives in so that we can perform our analysis on Ohio residents. We’ll find what we want under the label “State.” The variable associated with it is x.state.
Like I mentioned, we did some of the pre-work and cleaning of this data for you - because it’s a complex survey, the data needs to take into account survey weighting among some other things. We’ve also taken only the variables we need to perform our specific analysis. The code used to do so is beyond the scope of the course at the moment, but if you’re curious in doing something similar for your project, feel free to reach out and we can help you out.
We first need to install and load some packages that’ll help us out. If you want to learn more about them, just do a quick Google search of “R package packagename.” Install dplyr and readxl copying and pasting install.packages(c(“readxl”, “tidyverse”, “kableExtra”)) into your console. If you put all three into a vector using c(), we can do this in one command. readxl lets us read in excel files. tidyverse is a package that will also help us out with data science. kableExtra allows us to make pretty tables.
# Load dplyr, readxl, and tidyverse, kableExtra
# We can't use the same trick for loading multiple packages at once, unfortunately
# We will need all of these later
library("readxl")
library("kableExtra")
library("tidyverse")
As you will see in a moment, 90% of the hard work for this example comes from actually cleaning our data. Running the statistical test is a breeze in comparison - despite the difficulty of the former, it’s important to know to clean and recode data because you will almost never receive data organized exactly the way you want.
Make sure you download the ‘brfss_untidy.csv’ from the Github repository where this is being hosted. When we look at this file, we see that our data is arranged in a table that looks like this:
| x.bmicat | income2 | x.state | Freq |
|---|---|---|---|
| 1 | 1 | 1 | 14041 |
| 2 | 1 | 1 | 50283 |
| 3 | 1 | 1 | 42382 |
| 4 | 1 | 1 | 83175 |
| 1 | 2 | 1 | 4781 |
| … | … | … | … |
We want to make this easier for us to work with and understand. As a result, we’re going to recode the data so that it makes more sense to us at a glance. For our BMI Data, we’ll replace 1-3 with non-obese, 4 with obese, and set any BLANK data as “missing”. We’ll also condense our income levels; we can make 1-6 a single category of under $50,000, 5-8 another category of $50,000+, and 77 and 99 we will both assign as “missing.” Hopefully this makes sense - if not, scroll up and take another look at the tables above. Finally, we have another issue - our “state” column is not text, but rather state FIPS codes, which are federal numerical identifiers for states.
Next, we’ll read this file in:
# We will read the data in our CSV into a dataframe.
obeseDataFrame <- read_csv('brfss_untidy.csv', col_types='fffi')
head(obeseDataFrame)
## # A tibble: 6 x 4
## x.bmi5cat income2 x.state Freq
## <fct> <fct> <fct> <int>
## 1 1 1 1 14041
## 2 2 1 1 50283
## 3 3 1 1 42382
## 4 4 1 1 83175
## 5 1 2 1 4781
## 6 2 2 1 36478
We can quickly look at some basic aggregate information about our dataset before we move on:
summary(obeseDataFrame)
## x.bmi5cat income2 x.state Freq
## 1:530 1 :212 1 : 40 Min. : 0
## 2:530 2 :212 2 : 40 1st Qu.: 9760
## 3:530 3 :212 4 : 40 Median : 42124
## 4:530 4 :212 5 : 40 Mean : 110425
## 5 :212 6 : 40 3rd Qu.: 119614
## 6 :212 8 : 40 Max. :3884361
## (Other):848 (Other):1880
We want to be able to convert our FIPS codes to actual state names. We can download FIPS codes from here as an XLSX file. Put it in the same folder as everything else. We want to read this in as a dataframe:
fipsDataFrame <- read_excel('statefips.xlsx', skip = 6, col_names = c('fips', 'state'), col_types=c('skip', 'skip', 'text', 'text')) %>% # we're importing the xlsx, skipping the first 6 rows (which only have labels/nonsense),
# and skipping the first two columns as we don't need those either. we're naming the two columns we
# do need fips and state, and THEN:
mutate(fips=as.character(as.integer(fips))) # a bit confusing but basically we're mutating the fips variable
# so we can convert our fips codes first into integers and then those integers into characters.
head(fipsDataFrame)
## # A tibble: 6 x 2
## fips state
## <chr> <chr>
## 1 0 Northeast Region
## 2 0 New England Division
## 3 9 Connecticut
## 4 23 Maine
## 5 25 Massachusetts
## 6 33 New Hampshire
We’ll now recode our data. This is a long command - I will try to explain what is happening on each line using comments:
cleanedObeseDF <- obeseDataFrame %>% # we're taking our obeseDataFrame and THEN (%>%)
mutate(bmi=fct_recode(x.bmi5cat, # we're going to rename our bmi variable just "bmi," recoding the original "x.bmi5cat".
not_obese = '1', # set 1s to "not obese"
not_obese = '2', # set 2s to "not obese"
not_obese = '3', # set 3s to "not obese"
obese = '4'), # set 4s to "obese"
income=fct_recode(income2,
under_50k='1', # set 1-6 to "under_50k"
under_50k='2',
under_50k='3',
under_50k='4',
under_50k='5',
under_50k='6',
over_50k='7', # set 7-8 to "over_50k"
over_50k='8',
missing='77', # set 77 and 99 to "missing"
missing='99'),
fips=as.character(x.state), # convert each x.state fips code to a character, rather than integer, rename to fips
count=Freq) %>% #rename Freq variable to count, and THEN
filter(fips != '66', fips != '72') %>% # we're filtering out FIPS codes 66/72, which are Guam and PR.
# Our XLSX file with FIPS code did not include these, so we won't either, and THEN:
left_join(fipsDataFrame, by='fips') %>% # a left join will merge two data frames:
# each row will be merged only if a variable (FIPS code in our case) exists in both data frames
# and THEN
group_by(state, bmi, income) %>%
summarize(count=sum(Freq)) # these last two lines group our data so it's more organized:
# we want to group first by state, then by bmi, then by income, and then sum these up by category
# So, for example, we have multiple row for Alabama with a count for people who are underweight
# and who make under 25,000 dollars - we want to sum this up so that we only have row per state/bmi/income set.
cleanedObeseDF %>% write_csv('brfss_tidy.csv') # put this dataframe into a CSV in case we need it later
head(cleanedObeseDF)
## # A tibble: 6 x 4
## # Groups: state, bmi [2]
## state bmi income count
## <chr> <fct> <fct> <int>
## 1 Alabama not_obese under_50k 978348
## 2 Alabama not_obese over_50k 814149
## 3 Alabama not_obese missing 459511
## 4 Alabama obese under_50k 618779
## 5 Alabama obese over_50k 429904
## 6 Alabama obese missing 231091
summary(cleanedObeseDF)
## state bmi income count
## Length:306 not_obese:153 under_50k:102 Min. : 12015
## Class :character obese :153 over_50k :102 1st Qu.: 155117
## Mode :character missing :102 Median : 399548
## Mean : 756237
## 3rd Qu.: 885189
## Max. :10093259
That was a lot, I know. Almost there, though - stick with it! Next we want to 1) select only rows for the state of Ohio, and we’ll want to exclude rows where data is missing. Remember to always consider how many missing values there are compared to non-missing values and also ensure missing values are distributed across other variables in a relatively organized way before removing them.
ohioDF <- cleanedObeseDF %>% filter(state == 'Ohio') # select only rows where the state is Ohio
ohioDF <- ohioDF %>% filter(income != 'missing') # select all rows except where income equals missing
print(ohioDF)
## # A tibble: 4 x 4
## # Groups: state, bmi [2]
## state bmi income count
## <chr> <fct> <fct> <int>
## 1 Ohio not_obese under_50k 2429052
## 2 Ohio not_obese over_50k 2350285
## 3 Ohio obese under_50k 1363563
## 4 Ohio obese over_50k 1228532
That’s it! We now have all the data we need to run our chi-squared test in our finalDF dataframe.
First, though, we’re going to use kableExtra (another package) to make a nice looking table. We already loaded it earlier in this example.
kable(ohioDF) # make a "kable" out of our ohioDF!
| state | bmi | income | count |
|---|---|---|---|
| Ohio | not_obese | under_50k | 2429052 |
| Ohio | not_obese | over_50k | 2350285 |
| Ohio | obese | under_50k | 1363563 |
| Ohio | obese | over_50k | 1228532 |
This looks pretty ugly, so let’s make it look a little nicer. We can apply a “theme” to it, so that it looks more presentable. We’ll use the Twitter Bootstrap theme, a very popular HTML/CSS styling theme (these are the languages used to create webpages).
ohioDF %>% # take our ohioDF, then:
kable() %>% # make a kable with it, then:
# apply kable styling, which by default will apply the twitter bootstrap theme.
# We've also included a vector as an argument: c("striped", "hover"). This is not necessary, but it will add
# stripes to our rows and add a slight tint to single rows when we hover on them.
# Without these, we would just write "kable_styling()" instead.
kable_styling(c("striped", "hover"))
| state | bmi | income | count |
|---|---|---|---|
| Ohio | not_obese | under_50k | 2429052 |
| Ohio | not_obese | over_50k | 2350285 |
| Ohio | obese | under_50k | 1363563 |
| Ohio | obese | over_50k | 1228532 |
For any hypothesis test, we need a null hypothesis and an alternative hypothesis:
\(H_0\): the two variables are independent
\(H_A\): the two variables are not independent (ie: they are dependent)
To test the hypothesis we will perform a chi-squared (\(\chi^2\)) independence test for the data. This will give us a p-value that we can then interpret. I will not go into more detail than this - the first week of this course has a video on chi-squared and how to properly interpret it.
First, we want to put our data into a 3x4 table. Then we’ll run our chi-squared test.
obeseTable = xtabs(count ~ bmi + income, ohioDF) # creates a contingency table with BMI and Income as row/column,
# using count as the value for each pairing.
obeseTable = obeseTable[, -3] # remove the "missing" column (which is the 4th column) which persists, oddly
addmargins(obeseTable) # adds margins when printing the table
## income
## bmi under_50k over_50k Sum
## not_obese 2429052 2350285 4779337
## obese 1363563 1228532 2592095
## Sum 3792615 3578817 7371432
chisq <- chisq.test(obeseTable) # run chi-squared independence test on our data!
print(chisq)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: obeseTable
## X-squared = 2133.2, df = 1, p-value < 2.2e-16
We get a p-value of 2.2e-16, which is considerably lower than the 0.05 significance level, so we reject the null hypothesis that “obesity status” is independent of income level in the state of Ohio. It is important to note that a chi-squared test cannot indicate the directionality of a relationship; for example, it is impossible for us to tell if obesity “increases”" as income decreases, or if income decreases as the obesity “increases.”’ This is true for any chi-squared test.
To get an effect size and directionality, we can run this:
#just for fun, we'll also make a prop table (this will show us proportions, rather than counts):
addmargins(round(100*prop.table(obeseTable, 2), 1), 1) %>% #calculate proportions again, but
#this time *within* columns. We also add a row of margin totals to let us see the total of
#these columns to make sure we did it correctly. This gives us both effect size
#and proportionality for our sample.
kable() %>% # make a kable with it, then:
kable_styling(c("striped", "hover")) # add kable styling
| under_50k | over_50k | |
|---|---|---|
| not_obese | 64 | 65.7 |
| obese | 36 | 34.3 |
| Sum | 100 | 100.0 |
The p value is statistically significant in part because we have a huge sample size. Huge sample sizes will find statistical significance even when there is little clinical significance. So we should always check the effect size. In our sample, 36.0% of people with income < $50k were obese vs 34.3% of people with income >= $50k. Is this a big enough difference to worry about epidemiologically? Decide for yourself, but some disciplines would argue that even the tiniest excess risk is too much. In this case, they would probably say that the difference isn’t even that tiny and it is an epidemiologically important finding.
That’s it! If you have any questions, feel free to post them in the discussion board on Canvas for this example.