Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transform row data into column by certain row name in R

Hey so I'm pretty new to R and only familiar with some functions.I have a row data of around 2,000,000 rows.

Raw data is like this, an item has four kinds of tariff (AHS, BND, MFN, PRF). Some data have PRF and some don't. The goal is to transform each item's tariff into a column by the type of tariff.

AHS      3.00 
BND      3.80
MFN      4.00
PRF      2.00
AHS      4.00
BND      3.80
MFN      4.00

How to transform the raw data into like this:

AHS   BND   MFN   PRF
3.00  3.80  4.00  2.00
4.00  3.80  4.00  NA

I tried rbind, for those don't have PRF, R will assign the AHS to PRF.

Can anyone tell me how to do this transformation? Thanks a lot!

like image 254
StatCC Avatar asked Oct 03 '14 23:10

StatCC


People also ask

How do you subset data based on row names in R?

Method 1: Subset dataframe by row namesThe rownames(df) method in R is used to set the names for rows of the data frame. A vector of the required row names is specified. The %in% operator in R is used to check for the presence of the data frame row names in the vector of required row names.

How do I get data from a specific row in R?

R – Get Specific Row of Matrix To get a specific row of a matrix, specify the row number followed by a comma, in square brackets, after the matrix variable name. This expression returns the required row as a vector.

How do I select a column with certain names in R?

To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.

How do I create a Dataframe with specific column names in R?

We can create a dataframe in R by passing the variable a,b,c,d into the data. frame() function. We can R create dataframe and name the columns with name() and simply specify the name of the variables.


2 Answers

You can use ave in base R or a comparable approach in a package to create the "id" variable. Since some "PRF" values are missing, you probably also need to use cummax during the id creation stage.

Here are some alternatives, all using @G.Grothendieck's sample data. My vote would go for the "data.table" approach.

DF <- data.frame(
  V1 = c("AHS", "BND", "MFN", "PRF", "AHS", "BND", "MFN"), 
  V2 = c(3, 3.8, 4, 2, 4, 3.8, 4), 
  stringsAsFactors = FALSE)

Base R: reshape

Notorious for its syntax... and probably not recommended for working with 2M rows....

reshape(within(DF, {
  id <- cummax(ave(V1, V1, FUN = seq_along))
}), direction = "wide", idvar = "id", timevar = "V1")

Base R: xtabs

Easier to remember syntax, but less flexible. Also, returns a matrix, so you'll need to use as.data.frame.matrix if you want to get a data.frame. Fills missing values with "0", which may not be desirable.

xtabs(V2 ~ id + V1, within(DF, {
  id <- cummax(ave(V1, V1, FUN = seq_along))
}))

"data.table"

Fast. Predictable behavior from dcast.data.table following behavior long established by dcast from "reshape2".

library(data.table)
dcast.data.table(
  as.data.table(DF)[, id := sequence(.N), by = V1][, id := cummax(id)], 
                 id ~ V1, value.var = "V2")
#    id AHS BND MFN PRF
# 1:  1   3 3.8   4   2
# 2:  2   4 3.8   4  NA
like image 198
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 11 '22 10:10

A5C1D2H2I1M1N2O1R2T1


Create a grp variable which is 1 for the first group, 2 for the second, etc. Then use tapply

grp <- cumsum(DF$V1 == "AHS")
tapply(DF$V2, list(grp, DF$V1), sum)

giving:

  AHS BND MFN PRF
1   3 3.8   4   2
2   4 3.8   4  NA

We used this as the data:

DF <- data.frame(V1 = c("AHS", "BND", "MFN", "PRF", "AHS", "BND", "MFN"), 
                 V2 = c(3, 3.8, 4, 2, 4, 3.8, 4), stringsAsFactors = FALSE)
like image 27
G. Grothendieck Avatar answered Oct 11 '22 10:10

G. Grothendieck