I want to split this DataFrame into a predefined number of chunks of approximately the same size:
import pandas as pd
df = pd.DataFrame({
"user": ["A", "A", "B", "C", "C", "C"],
"value": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]})
# user value
# 0 A 0.3
# 1 A 0.4
# 2 B 0.5
# 3 C 0.6
# 4 C 0.7
# 5 C 0.8
The DataFrame is large (millions of lines) so the code should be more or less efficient. The problem is that certain user should only be present in one of the chunks.
E.g. if the number of chunks is 3 then:
# Chunk #1 (DataFrame):
# 0 A 0.3
# 1 A 0.4
# Chunk #2 (DataFrame):
# 2 B 0.5
# Chunk #3 (DataFrame):
# 3 C 0.6
# 4 C 0.7
# 5 C 0.8
This chunking in 3 pieces will be incorrect since user C will be present in 2 chunks:
# Chunk #1 (DataFrame):
# 0 A 0.3
# 1 A 0.4
# Chunk #2 (DataFrame):
# 2 B 0.5
# 3 C 0.6
# Chunk #3 (DataFrame):
# 4 C 0.7
# 5 C 0.8
I think that some solution when we do groupby by user first and then split this DataFrameGroupBy object in chunks should suffice.
You could turn our user column into a categorical one and use qcut for uniform height binning. Unfortunately qcut fails to find unique bin edges for discontinuous distributions so you might have some issue if one user is over represented. You could use duplicates="drop" but you won't always have the number bins you requested as some will be clumped together.
Guess you will have to write some algorithm for proper rebinning, couldn't find anything ready out of the box.
Here's an example with pd.qcut.
Let's build a dummy dataset
user = np.random.choice(["A", "B", "C", "D", "E", "F", "G", "H"], 10000)
value = np.random.random(size=user.shape)
df = pd.DataFrame({"user": user, "value": value})
print(df.user.value_counts())
E 1329
C 1281
G 1277
F 1260
H 1231
D 1223
A 1205
B 1194
Name: user, dtype: int64
assign an integer code to each unique user and rebin with qcut
codes = df.user.astype("category").cat.codes
nbins = 3
df["bin"] = pd.qcut(codes, nbins, labels=False)
df.groupby("user").bin.value_counts()
let's check the results
print(df.bin.value_counts())
1 3788
0 3629
2 2583
Name: bin, dtype: int64
print(df.groupby("user").bin.value_counts())
user bin
A 0 1266
B 0 1158
C 0 1205
D 1 1255
E 1 1246
F 1 1287
G 2 1274
H 2 1309
Name: bin, dtype: int64
Will this suffice?
df_grouped = df.groupby('user')
df_list = [df for user, df in df_grouped]
Out[1352]:
[ user value
0 A 0.3
1 A 0.4,
user value
2 B 0.5,
user value
3 C 0.6
4 C 0.7
5 C 0.8]
This runs relatively fast on my machine:
>>> df.shape
(7200000, 2)
>>> print(end - start)
0.532534122467041
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With