Best way to flatten dataframe based on values on column

Tags:

I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:

df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])

print (df)

   A  B  C
0  a  1  1
1  a  0  0
2  a  0  1
3  b  0  0
4  b  1  0
5  b  0  1
6  c  1  1
7  c  1  0
8  c  1  0

My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column

   A  B_1  B_2  B_3  C_1  C_2  C_3
0  a    1    0    0    1    0    1
3  b    0    1    0    0    0    1
6  c    1    1    1    1    0    0

The code I wrote gives the result I want, but it is pretty slow as it uses a simple for loop on the unique labels. The solution I see is to write some vectorized function that optimize my code. Anyone has some idea? Below I append the code.

added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']

new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)

for e, elem in new_df['A'].iteritems():
    new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()

471

asked Oct 16 '18 16:10

el_Rinaldo

2 Answers

Here is one way:

# create a row number by group
df['rn'] = df.groupby('A').cumcount() + 1

# pivot the table
new_df = df.set_index(['A', 'rn']).unstack()

# rename columns
new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]

new_df.reset_index()
#   A  B_1  B_2  B_3  C_1  C_2  C_3
#0  a    1    0    0    1    0    1
#1  b    0    1    0    0    0    1
#2  c    1    1    1    1    0    0

110

answered Sep 27 '22 21:09

Psidom

In an effort to improve performance, I've used numba and numpy assignment

from numba import njit

@njit
def f(i, vals, n, m, k):

  out = np.empty((n, k, m), vals.dtype)
  out.fill(0)

  c = np.zeros(n, np.int64)

  for j in range(len(i)):
    x = i[j]
    out[x, :, c[x]] = vals[j]
    c[x] += 1

  return out.reshape(n, m * k)


d0 = df.drop('A', 1)
cols = [*d0]

i, r = pd.factorize(df.A)

n = len(r)
m = np.bincount(i).max()
k = len(cols)

vals = d0.values

pd.DataFrame(
    f(i, vals, n, m, k),
    pd.Index(r, name='A'),
    [f"{c}_{i}" for c in cols for i in range(1, m + 1)]
).reset_index()

   A  B_1  B_2  B_3  C_1  C_2  C_3
0  a    1    0    0    1    0    1
1  b    0    1    0    0    0    1
2  c    1    1    1    1    0    0

answered Sep 27 '22 22:09

piRSquared

Related questions
                            
                                how do I compute a weighted moving average using pandas
                            
                                Pandas merge return empty dataframe
                            
                                Python Pandas If value in column B = equals [X, Y, Z] replace column A with "T"
                            
                                How to use variables inside query in Pandas?
                            
                                How does Spark interoperate with CPython
                            
                                Drop columns with low standard deviation in Pandas Dataframe
                            
                                Comparison of a Dataframe column values with a list
                            
                                What is meant by shift in dataframe?
                            
                                KeyError: 0 when accessing value in pandas series
                            
                                Simultaneously melt multiple columns in Python Pandas
                            
                                Add indexed column to DataFrame with pandas
                            
                                Should a pandas dataframe column be converted in some way before passing it to a scikit learn regressor?
                            
                                Combine columns in a Pandas DataFrame to a column of lists in a DataFrame
                            
                                conditional row read of csv in pandas
                            
                                Python Pandas: How to move one row to the first row of a Dataframe?
                            
                                Fast way to split column into multiple rows in Pandas
                            
                                Pandas dataframe: Group by two columns and then average over another column
                            
                                Interval datatype in Pandas - find midpoint, left, center etc
                            
                                Suppress Scientific Format in a Dataframe Column
                            
                                Pandas DataFrame check if column value exists in a group of columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best way to flatten dataframe based on values on column

Tags:

pandas

dataframe

vectorization

el_Rinaldo

People also ask

2 Answers

Psidom

piRSquared

Recent Activity

Donate For Us