Meet data

In this exercise, we build on the tools developed in the previous exercise to work with actual data related to a question of substantive interest to us - Does democracy influence health?

Your first step is to create a new markdown file. Follow the same steps you did in the previous lab. Once the file is created, save it, and note the folder in which it is saved. Your next step is to get hold of a data set and put it in the right place. You can find the data for the lab at this url: https://www.dropbox.com/s/eclwj2137icxpu7/lab2Data.csv?raw=1. After downloading, save the data in the same folder where you saved your markdown file. Missing this critical step will make completing the exercise impossible.

With the data downloaded and in the proper place, we can now bring the data set into R for analysis. There are many different ways to read data into R, and covering that topic is beyond the scope of our exercises. We will forus on one simple method, which relies on the read.csv() function. read.csv() takes as its argument a specially formated file, and generates as its value a data frame that can be used for analysis. A key step here is that we want to capture the output of read.csv() so it can be used. We do that via assignment (do you remember assignment from the previous exercise?). Here’s what the code looks like:

dat <- read.csv('https://www.dropbox.com/s/eclwj2137icxpu7/lab2Data.csv?raw=1')

Do you remember how to insert a code snippet? If not, go back to the previous lab to refresh your memory.

Some useful functions to explore a data set

A few functions are especially useful for exploring a data set. The first of these is head(), which shows the first few lines of the data set. Most of the time, it’s preferable to look at the result of head() at the console, rather than in your markdown file. Do you remember where the console is located in RStudio? Run head(dat) at the console now.

  1. What is the value of the female.deaths variable in Albania in 1980?

Other useful functions are nrow() (gives the number of rows in the data set), ncol(), summary() (produces summary statistics for all of the variables), and names() (lists the names of the varialbes) . Run all four of these commands in a code snippet.

  1. How many rows are in our data set? How many columns? What is the mean of the female.deaths variable?

Understanding your data

There are two core questions you must answer in order to have a basic grasp of your data. They are:

  • What does each line of the data set represent?
  • What does each column of the data set represent?

Understanding the first requires some logical thinking and some exploration.

  1. What does each line of our current data set represent?

Answering the second question usually depends on having a codebook. The codebook for the polity data are linked on my website. Use the codebook to answer the following question:

  1. What concepts are measured by each of the following columns of the data set: {“fragment”, “democ”, “autoc”, “polity”, “polity2”, “durable”, “xrreg”, “xrcomp”, “xropen”, “xconst”, “parreg”, “parcomp”, “exrec”, “exconst”, “polcomp”}?

Occassionally, a codebook may be non-existent or incomplete. In our case, the health variables we are using are taken from the Global Burden of Disease data, but I haven’t provided you with a codebook. I’ll simply tell you the concept captured by that variable here:

  • ‘female.deaths’ estimates the number of female deaths due to all causes per 100,000 people (where? when? see your answer to 1, above?)

Now that you know what the variables represent, we can begin with the analysis.

Analyzing the data

Let’s begin by looking at the data for a particular year. We’ll start with 1980, the earliest year in the data set. To do this, we’ll want to extract those rows of the data set for which the year variable is equal to 1980. We can assign this subset of the data to a new object for use in the future. Here’s the code:

dat1980 <- dat[dat$year == 1980,]

Do you see how we’re combining two pieces of syntax we learned in the previous lab. First, we’re using square brackets to index a vector, looking for only those values of year that are equal to 1980. As before, we use the ==. Second, we’re using $ to point R to a particular column of a data frame, year in this case.

Now that we have created this subset of the data, let’s look at the main independent and dependent variable in our analysis. A general measure of the level of democracy (is this the independent or the dependent variable?) is given by polity2. Let’s examine that variable using a histogram, which we’ll generate using the hist() function. You can consult the wikipedia definition of a histogram here: https://en.wikipedia.org/wiki/Histogram. Here’s the code to generate the histogram:

hist(dat1980$polity2)

We may want a nicer looking title and axis label for this plot. Here’s the code for that:

hist(dat1980$polity2, main = 'Democracy in 1980', xlab = 'Polity 2 score')

Now let’s examine the female.deaths variable. Of course this is one of many possible measures of health we could examine.

  1. Generate a histogram of the rate of female deaths in 1980. Give the plot a nice title and lable the x axis appropriately. Your plot should look similar to this:

Now we can put these two variables together to see whether there seems to be an association. A very simple way to do this is to plot the values of each variable agains the other. This is achieved using the plot() function. Here’s the basic code:

plot(dat1980$polity2, dat1980$female.deaths)