Hey so I'm pretty new to R and only familiar with some functions.I have a row data of around 2,000,000 rows.
Raw data is like this, an item has four kinds of tariff (AHS, BND, MFN, PRF). Some data have PRF and some don't. The goal is to transform each item's tariff into a column by the type of tariff.
AHS 3.00
BND 3.80
MFN 4.00
PRF 2.00
AHS 4.00
BND 3.80
MFN 4.00
How to transform the raw data into like this:
AHS BND MFN PRF
3.00 3.80 4.00 2.00
4.00 3.80 4.00 NA
I tried rbind, for those don't have PRF, R will assign the AHS to PRF.
Can anyone tell me how to do this transformation? Thanks a lot!
Method 1: Subset dataframe by row namesThe rownames(df) method in R is used to set the names for rows of the data frame. A vector of the required row names is specified. The %in% operator in R is used to check for the presence of the data frame row names in the vector of required row names.
R – Get Specific Row of Matrix To get a specific row of a matrix, specify the row number followed by a comma, in square brackets, after the matrix variable name. This expression returns the required row as a vector.
To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.
We can create a dataframe in R by passing the variable a,b,c,d into the data. frame() function. We can R create dataframe and name the columns with name() and simply specify the name of the variables.
You can use ave
in base R or a comparable approach in a package to create the "id" variable. Since some "PRF" values are missing, you probably also need to use cummax
during the id creation stage.
Here are some alternatives, all using @G.Grothendieck's sample data. My vote would go for the "data.table" approach.
DF <- data.frame(
V1 = c("AHS", "BND", "MFN", "PRF", "AHS", "BND", "MFN"),
V2 = c(3, 3.8, 4, 2, 4, 3.8, 4),
stringsAsFactors = FALSE)
reshape
Notorious for its syntax... and probably not recommended for working with 2M rows....
reshape(within(DF, {
id <- cummax(ave(V1, V1, FUN = seq_along))
}), direction = "wide", idvar = "id", timevar = "V1")
xtabs
Easier to remember syntax, but less flexible. Also, returns a matrix
, so you'll need to use as.data.frame.matrix
if you want to get a data.frame
. Fills missing values with "0", which may not be desirable.
xtabs(V2 ~ id + V1, within(DF, {
id <- cummax(ave(V1, V1, FUN = seq_along))
}))
Fast. Predictable behavior from dcast.data.table
following behavior long established by dcast
from "reshape2".
library(data.table)
dcast.data.table(
as.data.table(DF)[, id := sequence(.N), by = V1][, id := cummax(id)],
id ~ V1, value.var = "V2")
# id AHS BND MFN PRF
# 1: 1 3 3.8 4 2
# 2: 2 4 3.8 4 NA
Create a grp
variable which is 1 for the first group, 2 for the second, etc. Then use tapply
grp <- cumsum(DF$V1 == "AHS")
tapply(DF$V2, list(grp, DF$V1), sum)
giving:
AHS BND MFN PRF
1 3 3.8 4 2
2 4 3.8 4 NA
We used this as the data:
DF <- data.frame(V1 = c("AHS", "BND", "MFN", "PRF", "AHS", "BND", "MFN"),
V2 = c(3, 3.8, 4, 2, 4, 3.8, 4), stringsAsFactors = FALSE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With