George Saieed & Dustin DeMeo

4/22/2020

Downloading R and R Studio

This lesson assumes you have current versions of the following installed on your computer:

The R software itself, and RStudio Desktop

Set Your Working Directory

Your working directory should be the folder that this Rmd file is saved in. This tells R where in our file system to look when we come to reference and manipulate files later on.

# this is the command to change the working directory
setwd("/Users/George/OneDrive/Projects/SES-Obesity-Status")

# IF YOU USE WINDOWS
#setwd("C:/Users/George/OneDrive/Projects/SES-Obesity-Status")

The Data We’ll Be Using + Choice of Test

This dataset is the 2018 Behavioral Risk Factor Surveillance System from the CDC. The data comes from a series of “national health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.” The survey collects a huge amount of data from individuals regarding their health - if you’re interested in looking at everything the survey asks about, you can view the codebook here.

In our case, we’re interested in two specific variables: income of the responder and their “Obesity Status,” as classified by BMI. We want to determine if there is a significant relationship between these two variables; in essence, we want to see if BMI is independent from income level in the state of Ohio. Because both of these variables are categorical variables, we’re going to use the Chi-Squared Test of Independence. This is because we have categorical variables, independent observations we can put in a table, and non-sparse data (no low counts in any group). The BRFSS is a complex survey. It has a complex sampling design (survey observations do not have an equal probability of being selected), so analyzing this data requires considerable care. Complex surveys are used to save money, represent small sub-populations, and deal with naturally structured data. If you’re interested in learning more, you can take the DataCCamp course “Analyzing Survey Data in R.” Alternatively, many CDC datasets are taken from the BRFSS and have been pre-weighted for you (for example). In this case, we’ve already created our dataset by extracting it from the BRFSS and weighing it for you. An actual manuscript using the BRFSS would take the additional step of using survey aware statistical tests.

Looking at our Data and Determining What Exactly We Need

We first need to figure out what data from this file/dataset we want exactly. If we go into the codebook for the BRFSS dataset (see link above), we can find BMI values collected under the label of “Computed Body Mass Index.” The variable associated with it is x.bmicat. There are four categories, and each range of BMIs has a classification.

Coding BMI Classification
1 < 18.50 Underweight
2 18.50 - 25.00 Normal Weight
3 25.00 - 30.00 Overweight
4 30.00 - 99.99 Obese

We also need to find data on income level, so if we do some more digging into the codebook, we’ll find what we want under the label “Income Level." The variable associated with it is income2. There are 11 categories here (note the don’t know/refused/not asked categories):

Coding Income Level
1 Less than $10,000
2 $10,000 - $15,000
3 $15,000 - $20,000
4 $20,000 - $25,000
5 $25,000 - $35,000
6 $35,000 - $50,000
7 $50,000 - $75,000
8 $75,000+
77 Don’t know / Not sure
99 Refused

Finally, we want to know which state (by FIPS code) each respondent lives in so that we can perform our analysis on Ohio residents. We’ll find what we want under the label “State.” The variable associated with it is x.state.

Putting our Data into a CSV

Like I mentioned, we did some of the pre-work and cleaning of this data for you - because it’s a complex survey, the data needs to take into account survey weighting among some other things. We’ve also taken only the variables we need to perform our specific analysis. The code used to do so is beyond the scope of the course at the moment, but if you’re curious in doing something similar for your project, feel free to reach out and we can help you out.

Loading Necessary Packages

We first need to install and load some packages that’ll help us out. If you want to learn more about them, just do a quick Google search of “R package packagename.” Install dplyr and readxl copying and pasting install.packages(c(“readxl”, “tidyverse”, “kableExtra”)) into your console. If you put all three into a vector using c(), we can do this in one command. readxl lets us read in excel files. tidyverse is a package that will also help us out with data science. kableExtra allows us to make pretty tables.

# Load dplyr, readxl, and tidyverse, kableExtra
# We can't use the same trick for loading multiple packages at once, unfortunately
# We will need all of these later

library("readxl")
library("kableExtra")
library("tidyverse")

Tidying up our data

As you will see in a moment, 90% of the hard work for this example comes from actually cleaning our data. Running the statistical test is a breeze in comparison - despite the difficulty of the former, it’s important to know to clean and recode data because you will almost never receive data organized exactly the way you want.

Make sure you download the ‘brfss_untidy.csv’ from the Github repository where this is being hosted. When we look at this file, we see that our data is arranged in a table that looks like this:

x.bmicat income2 x.state Freq
1 1 1 14041
2 1 1 50283
3 1 1 42382
4 1 1 83175
1 2 1 4781

We want to make this easier for us to work with and understand. As a result, we’re going to recode the data so that it makes more sense to us at a glance. For our BMI Data, we’ll replace 1-3 with non-obese, 4 with obese, and set any BLANK data as “missing”. We’ll also condense our income levels; we can make 1-6 a single category of under $50,000, 5-8 another category of $50,000+, and 77 and 99 we will both assign as “missing.” Hopefully this makes sense - if not, scroll up and take another look at the tables above. Finally, we have another issue - our “state” column is not text, but rather state FIPS codes, which are federal numerical identifiers for states.

Next, we’ll read this file in:

# We will read the data in our CSV into a dataframe.
obeseDataFrame <- read_csv('brfss_untidy.csv', col_types='fffi')
head(obeseDataFrame)
## # A tibble: 6 x 4
##   x.bmi5cat income2 x.state  Freq
##   <fct>     <fct>   <fct>   <int>
## 1 1         1       1       14041
## 2 2         1       1       50283
## 3 3         1       1       42382
## 4 4         1       1       83175
## 5 1         2       1        4781
## 6 2         2       1       36478

We can quickly look at some basic aggregate information about our dataset before we move on:

summary(obeseDataFrame)
##  x.bmi5cat    income2       x.state          Freq        
##  1:530     1      :212   1      :  40   Min.   :      0  
##  2:530     2      :212   2      :  40   1st Qu.:   9760  
##  3:530     3      :212   4      :  40   Median :  42124  
##  4:530     4      :212   5      :  40   Mean   : 110425  
##            5      :212   6      :  40   3rd Qu.: 119614  
##            6      :212   8      :  40   Max.   :3884361  
##            (Other):848   (Other):1880

We want to be able to convert our FIPS codes to actual state names. We can download FIPS codes from here as an XLSX file. Put it in the same folder as everything else. We want to read this in as a dataframe:

fipsDataFrame <- read_excel('statefips.xlsx', skip = 6, col_names = c('fips', 'state'), col_types=c('skip', 'skip', 'text', 'text')) %>% # we're importing the xlsx, skipping the first 6 rows (which only have labels/nonsense),
  # and skipping the first two columns as we don't need those either. we're naming the two columns we 
  # do need fips and state, and THEN:
  
  mutate(fips=as.character(as.integer(fips))) # a bit confusing but basically we're mutating the fips variable
  # so we can convert our fips codes first into integers and then those integers into characters.
head(fipsDataFrame)
## # A tibble: 6 x 2
##   fips  state               
##   <chr> <chr>               
## 1 0     Northeast Region    
## 2 0     New England Division
## 3 9     Connecticut         
## 4 23    Maine               
## 5 25    Massachusetts       
## 6 33    New Hampshire

We’ll now recode our data. This is a long command - I will try to explain what is happening on each line using comments:

cleanedObeseDF <- obeseDataFrame %>% # we're taking our obeseDataFrame and THEN (%>%)
  mutate(bmi=fct_recode(x.bmi5cat, # we're going to rename our bmi variable just "bmi," recoding the original "x.bmi5cat".
                        not_obese = '1', # set 1s to "not obese"
                        not_obese = '2', # set 2s to "not obese"
                        not_obese = '3', # set 3s to "not obese"
                        obese = '4'), # set 4s to "obese"
         income=fct_recode(income2,
                           under_50k='1', # set 1-6 to "under_50k"
                           under_50k='2',
                           under_50k='3',
                           under_50k='4',
                           under_50k='5', 
                           under_50k='6',
                           over_50k='7', # set 7-8 to "over_50k"
                           over_50k='8', 
                           missing='77', # set 77 and 99 to "missing"
                           missing='99'),
         fips=as.character(x.state), # convert each x.state fips code to a character, rather than integer, rename to fips
         count=Freq) %>% #rename Freq variable to count, and THEN
  
  filter(fips != '66', fips != '72') %>% # we're filtering out FIPS codes 66/72, which are Guam and PR.
  # Our XLSX file with FIPS code did not include these, so we won't either, and THEN:
  
  left_join(fipsDataFrame, by='fips') %>% # a left join will merge two data frames:
  # each row will be merged only if a variable (FIPS code in our case) exists in both data frames
  # and THEN
  
  group_by(state, bmi, income) %>% 
  summarize(count=sum(Freq)) # these last two lines group our data so it's more organized: 
  # we want to group first by state, then by bmi, then by income, and then sum these up by category
  # So, for example, we have multiple row for Alabama with a count for people who are underweight 
  # and who make under 25,000 dollars - we want to sum this up so that we only have row per state/bmi/income set.

cleanedObeseDF %>% write_csv('brfss_tidy.csv') # put this dataframe into a CSV in case we need it later
head(cleanedObeseDF)
## # A tibble: 6 x 4
## # Groups:   state, bmi [2]
##   state   bmi       income     count
##   <chr>   <fct>     <fct>      <int>
## 1 Alabama not_obese under_50k 978348
## 2 Alabama not_obese over_50k  814149
## 3 Alabama not_obese missing   459511
## 4 Alabama obese     under_50k 618779
## 5 Alabama obese     over_50k  429904
## 6 Alabama obese     missing   231091
summary(cleanedObeseDF)
##     state                  bmi            income        count         
##  Length:306         not_obese:153   under_50k:102   Min.   :   12015  
##  Class :character   obese    :153   over_50k :102   1st Qu.:  155117  
##  Mode  :character                   missing  :102   Median :  399548  
##                                                     Mean   :  756237  
##                                                     3rd Qu.:  885189  
##                                                     Max.   :10093259

Isolating the Exact Data We Want

That was a lot, I know. Almost there, though - stick with it! Next we want to 1) select only rows for the state of Ohio, and we’ll want to exclude rows where data is missing. Remember to always consider how many missing values there are compared to non-missing values and also ensure missing values are distributed across other variables in a relatively organized way before removing them.

ohioDF <- cleanedObeseDF %>% filter(state == 'Ohio') # select only rows where the state is Ohio
ohioDF <- ohioDF %>% filter(income != 'missing') # select all rows except where income equals missing
print(ohioDF)
## # A tibble: 4 x 4
## # Groups:   state, bmi [2]
##   state bmi       income      count
##   <chr> <fct>     <fct>       <int>
## 1 Ohio  not_obese under_50k 2429052
## 2 Ohio  not_obese over_50k  2350285
## 3 Ohio  obese     under_50k 1363563
## 4 Ohio  obese     over_50k  1228532

That’s it! We now have all the data we need to run our chi-squared test in our finalDF dataframe.

First, though, we’re going to use kableExtra (another package) to make a nice looking table. We already loaded it earlier in this example.

kable(ohioDF) # make a "kable" out of our ohioDF!
state bmi income count
Ohio not_obese under_50k 2429052
Ohio not_obese over_50k 2350285
Ohio obese under_50k 1363563
Ohio obese over_50k 1228532


This looks pretty ugly, so let’s make it look a little nicer. We can apply a “theme” to it, so that it looks more presentable. We’ll use the Twitter Bootstrap theme, a very popular HTML/CSS styling theme (these are the languages used to create webpages).

ohioDF %>% # take our ohioDF, then:
  kable() %>% # make a kable with it, then:
  
  # apply kable styling, which by default will apply the twitter bootstrap theme.
  # We've also included a vector as an argument: c("striped", "hover"). This is not necessary, but it will add
  # stripes to our rows and add a slight tint to single rows when we hover on them.
  # Without these, we would just write "kable_styling()" instead.
  kable_styling(c("striped", "hover")) 
state bmi income count
Ohio not_obese under_50k 2429052
Ohio not_obese over_50k 2350285
Ohio obese under_50k 1363563
Ohio obese over_50k 1228532

What is the Chi-Squared Independence Test?

For any hypothesis test, we need a null hypothesis and an alternative hypothesis:

\(H_0\): the two variables are independent

\(H_A\): the two variables are not independent (ie: they are dependent)

To test the hypothesis we will perform a chi-squared (\(\chi^2\)) independence test for the data. This will give us a p-value that we can then interpret. I will not go into more detail than this - the first week of this course has a video on chi-squared and how to properly interpret it.

Performing the Actual Statistical Analysis

First, we want to put our data into a 3x4 table. Then we’ll run our chi-squared test.

obeseTable = xtabs(count ~ bmi + income, ohioDF) # creates a contingency table with BMI and Income as row/column,
# using count as the value for each pairing.
obeseTable = obeseTable[, -3] # remove the "missing" column (which is the 4th column) which persists, oddly
addmargins(obeseTable) # adds margins when printing the table
##            income
## bmi         under_50k over_50k     Sum
##   not_obese   2429052  2350285 4779337
##   obese       1363563  1228532 2592095
##   Sum         3792615  3578817 7371432
chisq <- chisq.test(obeseTable) # run chi-squared independence test on our data!
print(chisq)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  obeseTable
## X-squared = 2133.2, df = 1, p-value < 2.2e-16

We get a p-value of 2.2e-16, which is considerably lower than the 0.05 significance level, so we reject the null hypothesis that “obesity status” is independent of income level in the state of Ohio. It is important to note that a chi-squared test cannot indicate the directionality of a relationship; for example, it is impossible for us to tell if obesity “increases”" as income decreases, or if income decreases as the obesity “increases.”’ This is true for any chi-squared test.

To get an effect size and directionality, we can run this:

#just for fun, we'll also make a prop table (this will show us proportions, rather than counts):
addmargins(round(100*prop.table(obeseTable, 2), 1), 1) %>% #calculate proportions again, but
  #this time *within* columns. We also add a row of margin totals to let us see the total of
  #these columns to make sure we did it correctly. This gives us both effect size
  #and proportionality for our sample.
  kable() %>% # make a kable with it, then:
  kable_styling(c("striped", "hover")) # add kable styling
under_50k over_50k
not_obese 64 65.7
obese 36 34.3
Sum 100 100.0

The p value is statistically significant in part because we have a huge sample size. Huge sample sizes will find statistical significance even when there is little clinical significance. So we should always check the effect size. In our sample, 36.0% of people with income < $50k were obese vs 34.3% of people with income >= $50k. Is this a big enough difference to worry about epidemiologically? Decide for yourself, but some disciplines would argue that even the tiniest excess risk is too much. In this case, they would probably say that the difference isn’t even that tiny and it is an epidemiologically important finding.

That’s it! If you have any questions, feel free to post them in the discussion board on Canvas for this example.