Tidy Data with tidyr

When working with a data set, the first thing you need to do is organize your data to make it more “tidy”; what is commonly referred to as “tidy data”. Data sets usually have errors in them or need some manipulation. For instance, a value could be a variable, you may need to combine variables, or may be you need to separate a variable into two different variables.

In my Data Transformation with dplyr post, I covered how you can transform data with dplyr functions. In this post, I will introduce some tidyr functions, another part of the tidyverse, that could help with making tidy data much easier!

Gathering

Sometimes you may need to gather the values from two columns in order to create new variables. Let’s say you were working with a flu data set that had a column for country names and two columns that were the years ‘2002’ and ‘2003’. The values in the latter two columns have the number of cases of the flu for each year across the different countries. Wouldn’t it be more convenient to have one variable for the year, and another variable for the number of flu cases?

This can easily be solved with the gather() function. In this function, you first want to call ‘2002’ and ‘2003’ so that R knows which columns you are trying to gather. Then you would need to set the key = “year”, which is the variable whose values are the column names and set value = “flu_cases”, the variable whose values belonged to those columns.

Let’s take a look at what that code might look like:

table %>%
gather('2002', '2003', key = "year", value = "flu_cases")

The above code will create once column for the “years” and another for the number of “flu_cases” respective to those years.

Spreading is the opposite of gathering, where you might come across observations in multiple rows and want those observations to be their own variables.

For example, sticking with the flu example, let’s say that the data set has one column named “type” with observations ‘flu_cases’ and ‘population’ another column called “count” right next to it. Your goal is to instead have one variable for “flu_cases” and another for “population” with their counts as the values for each variable.

You can spread these observations to their own columns by setting key = type and value = count:

spread(table, key = type, value = count)

The key attribute will find all the different types and make each one a new variable, while the values for each variable will be the respective value from the “count” variable.

Separating

What if you had a variable that contained two values in each observation, where the values are separated by some character. An example of this could be a rate.

Returning to our imaginary flu data set, let’s say there was a rate variable that had “flu_cases”/”population”, and you wanted each of these values as their own variable. These values can be pulled apart with the separate() function by defining what you want the new variables to be called in the order that the values appear in the “rate” variable and indicating what the separator character is:

table %>%
separate(rate, into = c("flu_cases", "population", sep = "/")

The above code will separate the “rate” variable by what’s on either side of the “/” and assign those values into their own variables.

Unite

You probably guessed it! The unite() function does the exact opposite of the separate() function. It can combine multiple columns into one single column. You can use the unite() function to rejoin the “flu_cases” and “population” variables that were created above:

table %>%
unite(rate, flu_cases, population, sep = "/")

Hopefully these tidyr functions give you more clarity and flexibility when trying to tidy your data. Combining these tidyr functions with dplyr functions can save time when organizing your data, allowing you to spend more time on the actual data analysis!