Split pandas DataFrame into approximately the same chunks

Question

I want to split this DataFrame into a predefined number of chunks of approximately the same size:

import pandas as pd

df = pd.DataFrame({
    "user": ["A", "A", "B", "C", "C", "C"],
    "value": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]})


#     user  value
# 0      A    0.3
# 1      A    0.4
# 2      B    0.5
# 3      C    0.6
# 4      C    0.7
# 5      C    0.8

The DataFrame is large (millions of lines) so the code should be more or less efficient. The problem is that certain user should only be present in one of the chunks.

E.g. if the number of chunks is 3 then:

first chunk should have lines [0, 1]
second chunk should have line 2, no line 3 since line 3 is for user C
third chunk should have lines [3, 4, 5]

# Chunk #1 (DataFrame):
# 0      A    0.3
# 1      A    0.4

# Chunk #2 (DataFrame):
# 2      B    0.5

# Chunk #3 (DataFrame):
# 3      C    0.6
# 4      C    0.7
# 5      C    0.8

This chunking in 3 pieces will be incorrect since user C will be present in 2 chunks:

# Chunk #1 (DataFrame):
# 0      A    0.3
# 1      A    0.4

# Chunk #2 (DataFrame):
# 2      B    0.5
# 3      C    0.6

# Chunk #3 (DataFrame):
# 4      C    0.7
# 5      C    0.8

I think that some solution when we do groupby by user first and then split this DataFrameGroupBy object in chunks should suffice.

filippo · Accepted Answer

You could turn our user column into a categorical one and use qcut for uniform height binning. Unfortunately qcut fails to find unique bin edges for discontinuous distributions so you might have some issue if one user is over represented. You could use duplicates="drop" but you won't always have the number bins you requested as some will be clumped together.

Guess you will have to write some algorithm for proper rebinning, couldn't find anything ready out of the box.

Here's an example with pd.qcut.

Let's build a dummy dataset

user = np.random.choice(["A", "B", "C", "D", "E", "F", "G", "H"], 10000)
value = np.random.random(size=user.shape)
df = pd.DataFrame({"user": user, "value": value})
print(df.user.value_counts())

E    1329
C    1281
G    1277
F    1260
H    1231
D    1223
A    1205
B    1194
Name: user, dtype: int64

assign an integer code to each unique user and rebin with qcut

codes = df.user.astype("category").cat.codes    
nbins = 3
df["bin"] = pd.qcut(codes, nbins, labels=False)
df.groupby("user").bin.value_counts()

let's check the results

print(df.bin.value_counts())
1    3788
0    3629
2    2583
Name: bin, dtype: int64

print(df.groupby("user").bin.value_counts())
user  bin
A     0      1266
B     0      1158
C     0      1205
D     1      1255
E     1      1246
F     1      1287
G     2      1274
H     2      1309
Name: bin, dtype: int64

sophocles · Answer

Will this suffice?

df_grouped = df.groupby('user')
df_list = [df for user, df in df_grouped]

Out[1352]: 
[  user  value
 0    A    0.3
 1    A    0.4,
   user  value
 2    B    0.5,
   user  value
 3    C    0.6
 4    C    0.7
 5    C    0.8]

This runs relatively fast on my machine:

>>> df.shape
(7200000, 2)

>>> print(end - start)
0.532534122467041

Split pandas DataFrame into approximately the same chunks

Tags:

pandas

dataframe

Konstantin

2 Answers

filippo

sophocles

Recent Activity

Donate For Us

Split pandas DataFrame into approximately the same chunks

Tags:

pandas

dataframe

Konstantin

2 Answers

filippo

sophocles

Related questions

Recent Activity

Donate For Us