Creating Large Data Frames

Tags:

dataframe

r

Let's say that I want to generate a large data frame from scratch.

Using the data.frame function is how I would generally create data frames. However, df's like the following are extremely error prone and inefficient.

So is there a more efficient way of creating the following data frame.

Click to copy

df <- data.frame(GOOGLE_CAMPAIGN=c(rep("Google - Medicare - US", 928), rep("MedicareBranded", 2983),
                                   rep("Medigap", 805), rep("Medigap Branded", 1914),
                                   rep("Medicare Typos", 1353), rep("Medigap Typos", 635),
                                   rep("Phone - MedicareGeneral", 585),
                                   rep("Phone - MedicareBranded", 2967),
                                   rep("Phone-Medigap", 812),
                                   rep("Auto Broad Match", 27),
                                   rep("Auto Exact Match", 80),
                                   rep("Auto Exact Match", 875)),                   
                 GOOGLE_AD_GROUP=c(rep("Medicare", 928), rep("MedicareBranded", 2983),
                                   rep("Medigap", 805), rep("Medigap Branded", 1914),
                                   rep("Medicare Typos", 1353), rep("Medigap Typos", 635),
                                   rep("Phone ads 1-Medicare Terms",585),
                                   rep("Ad Group #1", 2967), rep("Medigap-phone", 812),
                                   rep("Auto Insurance", 27),
                                   rep("Auto General", 80),
                                   rep("Auto Brand", 875)))

Yikes, that is some 'bad' code. How can I generate this 'large' data frame in a more efficient manner?

577

asked Aug 26 '11 22:08

ATMathew

2 Answers

If your only source for that information is a piece of paper, then you probably won't get much better than that, but you can at least consolidate all that into a single rep call for each column:

Click to copy

#I'm going to cheat and not type out all those strings by hand
x <- unique(df[,1])
y <- unique(df[,2])

#Vectors of the number of times for each    
x1 <- c(928,2983,805,1914,1353,635,585,2967,812,27,955)
y1 <- c(x1[-11],80,875)

dd <- data.frame(GOOGLE_CAMPAIGN = rep(x, times = x1), 
                 GOOGLE_AD_GROUP = rep(y, times = y1))

which should be the same:

Click to copy

> all.equal(dd,df)
[1] TRUE

But if this information is already in a data structure in R somehow and you just need to transform it, that could possibly be even easier, but we'd need to know what that structure is.

133

answered Sep 30 '22 12:09

joran

Manually, (1) create this data frame:

Click to copy

> dfu <- unique(df)
> rownames(dfu) <- NULL
> dfu
           GOOGLE_CAMPAIGN            GOOGLE_AD_GROUP
1   Google - Medicare - US                   Medicare
2          MedicareBranded            MedicareBranded
3                  Medigap                    Medigap
4          Medigap Branded            Medigap Branded
5           Medicare Typos             Medicare Typos
6            Medigap Typos              Medigap Typos
7  Phone - MedicareGeneral Phone ads 1-Medicare Terms
8  Phone - MedicareBranded                Ad Group #1
9            Phone-Medigap              Medigap-phone
10        Auto Broad Match             Auto Insurance
11        Auto Exact Match               Auto General
12        Auto Exact Match                 Auto Brand

and (2) this vector of lengths:

Click to copy

> lens <- rle(as.numeric(interaction(df[[1]], df[[2]])))$lengths
> lens
 [1]  928 2983  805 1914 1353  635  585 2967  812   27   80  875

From these two inputs (dfu and lens) we can reconstruct df (here called df2):

Click to copy

> df2 <- dfu[rep(seq_along(lens), lens), ]
> rownames(df2) <- NULL
> identical(df, df2)
[1] TRUE

answered Sep 30 '22 13:09

G. Grothendieck

Related questions
                            
                                ggplot: how to specify vertical order of multiple boxplots?
                            
                                Given a vector a=[1,2, 3.2, 4, 5] and an element x=3 In vector a, how to find the exact entry which is bigger than x?
                            
                                vector binding in R
                            
                                r code for using Ornstein-Uhlenbeck to estimate time for mean reversion
                            
                                How do I perform a function on each row of a data frame and have just one element of the output inserted as a new column in that row
                            
                                Writing an R package that is different per architecture
                            
                                Dealing with dates and times in R
                            
                                R Language - Sorting data into ranges; averaging; ignore outliers
                            
                                Create column based on another dataframe
                            
                                How to take the union of element in a nested list in R
                            
                                R - Avoiding for loops with automatic iteration
                            
                                quantmod: buildData(,na.rm=FALSE) drops head of time series
                            
                                Is there a "this" reference in R Functions?
                            
                                Customizing aesthetics of faceted barplot
                            
                                Axis position in R scatterplot
                            
                                Read multiple .gpx files
                            
                                Ignore NA's in sapply function
                            
                                Passing script as parameter to RGui
                            
                                R: Generate a Seasonal ARIMA time-series model using parameters of existing data
                            
                                R Ibrokers twsOPT usage

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Creating Large Data Frames

Tags:

dataframe

r

ATMathew

People also ask

2 Answers

joran

G. Grothendieck

Recent Activity

Donate For Us