I have a long table with multiple variables (CPI - Workers, CPI - Consumers, (Seas) Unemployment Level (thous) and many more, but am truncating the data set to 3 variables and 6 time periods for brevity. I want to create a new variable, which is a combination of the first two. Let's call it CPI - Average which of course is simply the average of the first two or (CPI - Workers + CPI - Consumers) / 2. This is a simple calculation in a wide table, however, in order to satisfy ggplot, I have stored my data in long form.
Note, I am storing all of my variables in one long table. When I need to visualize the trend, I filter to the desired variable or variables in my ggplot command.
My question is how do I create the new variable without first converting the data to wide format?
First, here is my dataset:
DT_long <- as.data.table(read.table(header=TRUE, text='year period periodName value variable_name date
1994 M01 January 143.8 "CPI - Workers" 1994-01-01
1994 M02 February 144.0 "CPI - Workers" 1994-02-01
1994 M03 March 144.3 "CPI - Workers" 1994-03-01
1994 M04 April 144.5 "CPI - Workers" 1994-04-01
1994 M05 May 144.8 "CPI - Workers" 1994-05-01
1994 M06 June 145.3 "CPI - Workers" 1994-06-01
1994 M01 January 146.3 "CPI - Consumers" 1994-01-01
1994 M02 February 146.7 "CPI - Consumers" 1994-02-01
1994 M03 March 147.1 "CPI - Consumers" 1994-03-01
1994 M04 April 147.2 "CPI - Consumers" 1994-04-01
1994 M05 May 147.5 "CPI - Consumers" 1994-05-01
1994 M06 June 147.9 "CPI - Consumers" 1994-06-01
1994 M01 January 8630 "(Seas) Unemployment Level (thous)" 1994-01-01
1994 M02 February 8583 "(Seas) Unemployment Level (thous)" 1994-02-01
1994 M03 March 8470 "(Seas) Unemployment Level (thous)" 1994-03-01
1994 M04 April 8331 "(Seas) Unemployment Level (thous)" 1994-04-01
1994 M05 May 7915 "(Seas) Unemployment Level (thous)" 1994-05-01
1994 M06 June 7927 "(Seas) Unemployment Level (thous)" 1994-06-01
'))
Second, here is what the output of the calculation should look like:
DT_long <- as.data.table(read.table(header=TRUE, text='year period periodName value variable_name date
1994 M01 January 143.8 "CPI - Workers" 1994-01-01
1994 M02 February 144.0 "CPI - Workers" 1994-02-01
1994 M03 March 144.3 "CPI - Workers" 1994-03-01
1994 M04 April 144.5 "CPI - Workers" 1994-04-01
1994 M05 May 144.8 "CPI - Workers" 1994-05-01
1994 M06 June 145.3 "CPI - Workers" 1994-06-01
1994 M01 January 146.3 "CPI - Consumers" 1994-01-01
1994 M02 February 146.7 "CPI - Consumers" 1994-02-01
1994 M03 March 147.1 "CPI - Consumers" 1994-03-01
1994 M04 April 147.2 "CPI - Consumers" 1994-04-01
1994 M05 May 147.5 "CPI - Consumers" 1994-05-01
1994 M06 June 147.9 "CPI - Consumers" 1994-06-01
1994 M01 January 8630 "(Seas) Unemployment Level (thous)" 1994-01-01
1994 M02 February 8583 "(Seas) Unemployment Level (thous)" 1994-02-01
1994 M03 March 8470 "(Seas) Unemployment Level (thous)" 1994-03-01
1994 M04 April 8331 "(Seas) Unemployment Level (thous)" 1994-04-01
1994 M05 May 7915 "(Seas) Unemployment Level (thous)" 1994-05-01
1994 M06 June 7927 "(Seas) Unemployment Level (thous)" 1994-06-01
1994 M01 January 145.05 "CPI - Average" 1994-01-01
1994 M02 February 145.35 "CPI - Average" 1994-02-01
1994 M03 March 145.70 "CPI - Average" 1994-03-01
1994 M04 April 148.85 "CPI - Average" 1994-04-01
1994 M05 May 146.15 "CPI - Average" 1994-05-01
1994 M06 June 146.60 "CPI - Average" 1994-06-01
'))
The fourth variable (CPI - Average) takes the average of the first two for each date. Please ignore the fact that this average makes no sense economically, I just wanted a simple calculation for the example.
Such calculations are quite straight forward in wide format. So let's first transform the data to wide and then make the calculation.
DT_wide <- DT_long %>% pivot_wider(names_from = variable_name, values_from = value)
DT_wide_with_average <- DT_wide %>% mutate(`CPI - Average` = (`CPI - Workers` + `CPI - Consumers`) / 2)
This takes the wide table and adds a new column with the calculated results:
DT_wide_with_average <- as.data.table(read.table(header=TRUE, text='year period periodName date `CPI - Workers` `CPI - Consumers` `(Seas) Unemployment Level (thous)` `CPI - Average`
1994 M01 January 1994-01-01 144. 146. 8630 145.
1994 M02 February 1994-02-01 144 147. 8583 145.
1994 M03 March 1994-03-01 144. 147. 8470 146.
1994 M04 April 1994-04-01 144. 147. 8331 146.
1994 M05 May 1994-05-01 145. 148. 7915 146.
1994 M06 June 1994-06-01 145. 148. 7927 147.
'))
Please ignore the fact that the decimals have been truncated by pivot_wider.
Working in wide mode, creating variables, analyzing them, revising the calculations, reordering the column orders, deleting unneeded columns is the way we mere humans think when analyzing simple data tables.
Unfortunately, ggplot requires the long format, considered "tidy" by the gods of R, but is quite messy in the eyes of us mere mortals. Sorry for the dig, but if I were to stack my couch, table, chairs, lamp and rug in one corner of the room, it would be quite messy, while if I left them as I normally keep my room, they would be quite tidy. In the real world, I might stack the furniture in one corner in order to paint the room or sand the floors. This would be useful for the task at hand, but it would be considered messy and not useful for ordinary living. So, considering long tables as tidy and wide tables as messy is counterintuitive. It took me a long time to figure out this counterintuitive logic when I was first introduced to tidyverse. Sorry for the rant, but hopefully it is useful customer feedback to the gods of R. At a minimum, it would be helpful to R learners if the gods would admit to the counterintuitive nomenclature. If I am warned, before entering the bathroom, that the faucet handle with an "H" is the cold water and the one with a "C" is the hot water, I am less likely to scald my hands!
Data analysis is iterative. I don't want to have to take the following steps for each iteration:
I would rather:
In short, I want to focus on my economic analysis, rather than on unnecessary R programming.
So, how can I select a subset of variables from my long format table, use them in a calculation to create a new variable and ensure that the new variable is rbind-ed to the end of my long table...without having to convert to wide format?
Thanks for your help!
How about this?
bind_rows(
DT_long,
DT_long %>%
filter(variable_name %>% str_detect("CPI")) %>%
group_by(year, period, periodName, date) %>%
summarize(value = mean(value)) %>%
mutate(variable_name = "CPI - Average")
)
In this case the math can be accomplished by a mean across the group, but this assumes that both Workers / Consumer CPI are present, and each only once within each group, and you want them evenly weighted. It could get more convoluted, and in many cases you're absolutely right that many calculations that involve relationships between variables are much more straightforward in wide format.
(Especially in a case like this where it's a gray area about whether these different data points are truly different observations or different dimensions of the same "economic snapshot" observation, so arguably your wide version is already "tidy.")
Jon Spring's answer works perfectly for the case that I originally described, but what is really needed is a more generalized approach to allow arbitrary calculations. As mere mortals, we conceptualize data in rows and columns, so the trick is to take our wide table concept of a calculation and translate into a long table implementation.
Based on his solution, we can generalize this to the case of arbitrary calculations across columns, but within a row (think of the wide table framework or a spreadsheet formula that only refers to cells in the same row). The calculations are usually mathematical, but could be string manipulation.
First we need to dissect the calculation. Let's take an example of an arbitrary calculation that requires specific reference to each variable, unlike the example I gave previously that Jon responded to. ((VarA * 6) / VarB) / (VarB) / (VarA * 6)), which of course by definition is always 1 unless either VarA or VarB is 0. If we get all 1's with our test data, then we know our solution works since there are no 0's.
Second, we select our variables. In our test data we will use CPI - Consumers and CPI - Workers and not (Seas) Unemployment Level (thous). We do this through Jon's filter command or DT_long[variable_name %in% c("CPI - Workers", "CPI - Consumers") in data.table parlance. Note, I use a list to ensure unique selection of variables.
Third, we need to ensure that the calculation is restricted to the row (thinking in the wide table format). That is the group_by command which restricts the calculation to the date. That would be a unique row in the wide table.
Fourth we need a way to distinguish among the chosen variables. In the original example that was not necessary, but in the generalized case (and our new calculation) it is. This can be accomplished through keyby = .(variable_name) in data.table parlance, which puts the variables in alphabetical order. So now we can refer to CPI Consumers as value[1] and CPI Workers as value[2] because in long table our (wide table) columns of data became rows and by restricting our calculations to unique dates, we know that for each calculation there will only be two values, ordered by their respective variable_name. So our calculation becomes summarize( value = ((value[1] * 6) / value[2]) * (value[2] / (value[1] * 6)) ).
Fifth, we give our newly calculated value a variable name with Jon's mutate command.
Sixth, we append the new data to our long table with the bind_rows command.
Putting this all together, we have:
bind_rows(
DT_long,
DT_long[variable_name %in% c("CPI - Workers", "CPI - Consumers"), .SD, keyby = .(variable_name)] %>%
group_by(year, period, periodName, date) %>%
summarize( value = ((value[1] * 6) / value[2]) * (value[2] / (value[1] * 6)) ) %>%
mutate(variable_name = "CPI - Average3")
)
This works perfectly with all 1's.
Now we have generalized steps to create arbitrary calculations across what would be the same row in wide table, but implemented on a long table.
Economists often think about change across time. How much have prices increased each year? Is the rate of inflation increasing over the years or diminishing? We cannot see that from the CPI (Consumer Price Index), but can calculate it. Thinking wide table, this problem is not a calculation across columns (variables) within the same row (within the same time period). It is a calculation on a single variable across time or single column across rows.
Here is an attempt:
bind_rows(
DT_long,
DT_long[variable_name %in% c("CPI - Workers"),] %>%
summarize( for(i in 1:6) {value = (((value[i+1] - value[i]) / value[i]) * 100)}) %>%
mutate(variable_name = "CPI_growth")
)
Alas this fails.
But here is a solution for a common calculation for economists involving one variable, with calculation over time. This is the Year over Year growth calculation or more generally period over period growth calculation. The CPI is an index of prices, which starts with a particular base year as 100. (Actually the base is 1982 to 1984, see the link below.) If during the next year prices increase 10 percent, the index for that year is 110. If it continues growing at 10 percent, the index goes to 121 in the second year. Looking at this number we immediately know that prices have grown 21 percent since the base year when the index was 100. But how much prices grew within this second year is not intuitive. What we need is to calculate the growth rate of prices for each year. If CPI were reported yearly, this would be ((CPIt - CPIt-1) / CPIt-1) * 100, but of course it is reported monthly, so the t-1 becomes t-12. however, Sometimes we want monthly inflation rates, so we would use t-1.
Gross Domestic Product (GDP) is reported quarterly, so for yearly growth we want to calculate the growth over the last 4 quarters, ((GDPt - GDPt-4) / GDPt-4) * 100.
How do we make this calculation, easily adjusting for periodicity when our data is stored in a long table?
We start with a growth rate function. Note that the growth.rate in the tis package is not very flexible and forces the calculation to always be yearly. Note that this solution assumes that you have your data in ascending date order.
gr.rate <- function(x, l=1){
(x - lag(x, l)) / lag(x, l) * 100
}
x is the column of figures that we want to calculate growth on and l is the number of lags, i.e., 12 to to from monthly data to year over year growth.
Now we need to apply this to our example long data table DT_long. We do this with the following function.
gr.rate.long <- function(x, var_title, var_name, val_title, new_var_name, lag_periods){
temp <-x
names(temp)[grep(val_title, colnames(x))] <- "value"
names(temp)[grep(var_title, colnames(x))] <- "variable_name"
temp <- temp[variable_name == var_name]
temp$value <- gr.rate(temp[, .(value)], lag_periods)
temp$variable_name <- new_var_name
names(temp)[grep("value", colnames(x))] <- val_title
names(temp)[grep("variable_name", colnames(x))] <- var_title
return(bind_rows(x,temp))
}
Next we call it with the following arguments:
Note that in our long table example, the column of variable names is called "variable_name" and the column of values is called "value," however, your long table may have other names for those columns. Specify those names with the respective arguments and the function will find and use those columns.
So, using our test long table, called "DT_long" we can calculate the monthly rate of inflation with the following call to this function:
gr.rate.long(DT_long, "variable_name", "CPI - Workers", "value", "CPI-W-growth rate", 1)
There are other reasons to calculate across time for a single variable. For example, if we knew the price of eggs over time in dollars and cents and wanted to convert these into an index like the CPI, we could call it the EPI. Or perhaps we want to change the base year of the CPI from the current 1982 to 1984* period to 2020.
To adjust the function for these, we would need to swap the temp$value <- gr.rate(temp[, .(value)], lag_periods) line to the appropriate calculation. That might take some experimentation. Better yet, if we could parameterize that line, it would be even better.
I tend to get a bit wordy, but I like to contextualize coding in real world scenarios. I hope this discussion has been useful to others. Please leave a comment if you found it useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With