Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating Large Data Frames

Tags:

dataframe

r

Let's say that I want to generate a large data frame from scratch.

Using the data.frame function is how I would generally create data frames. However, df's like the following are extremely error prone and inefficient.

So is there a more efficient way of creating the following data frame.

df <- data.frame(GOOGLE_CAMPAIGN=c(rep("Google - Medicare - US", 928), rep("MedicareBranded", 2983),
                                   rep("Medigap", 805), rep("Medigap Branded", 1914),
                                   rep("Medicare Typos", 1353), rep("Medigap Typos", 635),
                                   rep("Phone - MedicareGeneral", 585),
                                   rep("Phone - MedicareBranded", 2967),
                                   rep("Phone-Medigap", 812),
                                   rep("Auto Broad Match", 27),
                                   rep("Auto Exact Match", 80),
                                   rep("Auto Exact Match", 875)),                   
                 GOOGLE_AD_GROUP=c(rep("Medicare", 928), rep("MedicareBranded", 2983),
                                   rep("Medigap", 805), rep("Medigap Branded", 1914),
                                   rep("Medicare Typos", 1353), rep("Medigap Typos", 635),
                                   rep("Phone ads 1-Medicare Terms",585),
                                   rep("Ad Group #1", 2967), rep("Medigap-phone", 812),
                                   rep("Auto Insurance", 27),
                                   rep("Auto General", 80),
                                   rep("Auto Brand", 875)))

Yikes, that is some 'bad' code. How can I generate this 'large' data frame in a more efficient manner?

like image 577
ATMathew Avatar asked Aug 26 '11 22:08

ATMathew


People also ask

How large can data frames be?

The short answer is yes, there is a size limit for pandas DataFrames, but it's so large you will likely never have to worry about it. The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.

Can Python handle 1 billion rows?

Introduction to Vaex. Vaex is a python library that is an out-of-core dataframe, which can handle up to 1 billion rows per second. 1 billion rows.

Is Pandas efficient for large data sets?

Use efficient datatypesThe default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

How do you handle a large data frame?

Split data into chunks When data is too large to fit into memory, you can use Pandas' chunksize option to split the data into chunks instead of dealing with one big block.


2 Answers

If your only source for that information is a piece of paper, then you probably won't get much better than that, but you can at least consolidate all that into a single rep call for each column:

#I'm going to cheat and not type out all those strings by hand
x <- unique(df[,1])
y <- unique(df[,2])

#Vectors of the number of times for each    
x1 <- c(928,2983,805,1914,1353,635,585,2967,812,27,955)
y1 <- c(x1[-11],80,875)

dd <- data.frame(GOOGLE_CAMPAIGN = rep(x, times = x1), 
                 GOOGLE_AD_GROUP = rep(y, times = y1))

which should be the same:

> all.equal(dd,df)
[1] TRUE

But if this information is already in a data structure in R somehow and you just need to transform it, that could possibly be even easier, but we'd need to know what that structure is.

like image 133
joran Avatar answered Sep 30 '22 12:09

joran


Manually, (1) create this data frame:

> dfu <- unique(df)
> rownames(dfu) <- NULL
> dfu
           GOOGLE_CAMPAIGN            GOOGLE_AD_GROUP
1   Google - Medicare - US                   Medicare
2          MedicareBranded            MedicareBranded
3                  Medigap                    Medigap
4          Medigap Branded            Medigap Branded
5           Medicare Typos             Medicare Typos
6            Medigap Typos              Medigap Typos
7  Phone - MedicareGeneral Phone ads 1-Medicare Terms
8  Phone - MedicareBranded                Ad Group #1
9            Phone-Medigap              Medigap-phone
10        Auto Broad Match             Auto Insurance
11        Auto Exact Match               Auto General
12        Auto Exact Match                 Auto Brand

and (2) this vector of lengths:

> lens <- rle(as.numeric(interaction(df[[1]], df[[2]])))$lengths
> lens
 [1]  928 2983  805 1914 1353  635  585 2967  812   27   80  875

From these two inputs (dfu and lens) we can reconstruct df (here called df2):

> df2 <- dfu[rep(seq_along(lens), lens), ]
> rownames(df2) <- NULL
> identical(df, df2)
[1] TRUE
like image 35
G. Grothendieck Avatar answered Sep 30 '22 13:09

G. Grothendieck