Let's say that I want to generate a large data frame from scratch.
Using the data.frame function is how I would generally create data frames. However, df's like the following are extremely error prone and inefficient.
So is there a more efficient way of creating the following data frame.
df <- data.frame(GOOGLE_CAMPAIGN=c(rep("Google - Medicare - US", 928), rep("MedicareBranded", 2983),
rep("Medigap", 805), rep("Medigap Branded", 1914),
rep("Medicare Typos", 1353), rep("Medigap Typos", 635),
rep("Phone - MedicareGeneral", 585),
rep("Phone - MedicareBranded", 2967),
rep("Phone-Medigap", 812),
rep("Auto Broad Match", 27),
rep("Auto Exact Match", 80),
rep("Auto Exact Match", 875)),
GOOGLE_AD_GROUP=c(rep("Medicare", 928), rep("MedicareBranded", 2983),
rep("Medigap", 805), rep("Medigap Branded", 1914),
rep("Medicare Typos", 1353), rep("Medigap Typos", 635),
rep("Phone ads 1-Medicare Terms",585),
rep("Ad Group #1", 2967), rep("Medigap-phone", 812),
rep("Auto Insurance", 27),
rep("Auto General", 80),
rep("Auto Brand", 875)))
Yikes, that is some 'bad' code. How can I generate this 'large' data frame in a more efficient manner?
The short answer is yes, there is a size limit for pandas DataFrames, but it's so large you will likely never have to worry about it. The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.
Introduction to Vaex. Vaex is a python library that is an out-of-core dataframe, which can handle up to 1 billion rows per second. 1 billion rows.
Use efficient datatypesThe default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
Split data into chunks When data is too large to fit into memory, you can use Pandas' chunksize option to split the data into chunks instead of dealing with one big block.
If your only source for that information is a piece of paper, then you probably won't get much better than that, but you can at least consolidate all that into a single rep
call for each column:
#I'm going to cheat and not type out all those strings by hand
x <- unique(df[,1])
y <- unique(df[,2])
#Vectors of the number of times for each
x1 <- c(928,2983,805,1914,1353,635,585,2967,812,27,955)
y1 <- c(x1[-11],80,875)
dd <- data.frame(GOOGLE_CAMPAIGN = rep(x, times = x1),
GOOGLE_AD_GROUP = rep(y, times = y1))
which should be the same:
> all.equal(dd,df)
[1] TRUE
But if this information is already in a data structure in R somehow and you just need to transform it, that could possibly be even easier, but we'd need to know what that structure is.
Manually, (1) create this data frame:
> dfu <- unique(df)
> rownames(dfu) <- NULL
> dfu
GOOGLE_CAMPAIGN GOOGLE_AD_GROUP
1 Google - Medicare - US Medicare
2 MedicareBranded MedicareBranded
3 Medigap Medigap
4 Medigap Branded Medigap Branded
5 Medicare Typos Medicare Typos
6 Medigap Typos Medigap Typos
7 Phone - MedicareGeneral Phone ads 1-Medicare Terms
8 Phone - MedicareBranded Ad Group #1
9 Phone-Medigap Medigap-phone
10 Auto Broad Match Auto Insurance
11 Auto Exact Match Auto General
12 Auto Exact Match Auto Brand
and (2) this vector of lengths:
> lens <- rle(as.numeric(interaction(df[[1]], df[[2]])))$lengths
> lens
[1] 928 2983 805 1914 1353 635 585 2967 812 27 80 875
From these two inputs (dfu
and lens
) we can reconstruct df
(here called df2
):
> df2 <- dfu[rep(seq_along(lens), lens), ]
> rownames(df2) <- NULL
> identical(df, df2)
[1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With