Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Flatten a dataframe to a single column

I have dataset in the following format:

df = pd.DataFrame({'x':[1,2,3], 'y':[10,20,30], 'v1':[3,2,3] , 'v2':[13,25,31] })

>> v1 v2  x   y
   3  13  1  10
   2  25  2  20
   3  31  3  30

Setting the index column with x, I want to flatten the data combining v1 and v2 (V), The expected output is like:

>> x   y   V
   1  10   3
   1  10   13
   2  20   2
   2  20   25
   3  30   3
   3  30   31

And again bringing to the original format of df. I tried reshaping using stack and unstack, but I couldn't get it the way, which I was expecting.

Many Thanks!

like image 233
NMSD Avatar asked Jul 27 '16 11:07

NMSD


People also ask

How do I flatten a pandas DataFrame?

The first method to flatten the pandas dataframe is through NumPy python package. There is a function in NumPy that is numpy. flatten() that perform this task. First, you have to convert the dataframe to numpy using the to_numpy() method and then apply the flatten() method.

How do I flatten a column in a data frame?

Flatten columns: use get_level_values() Flatten columns: use to_flat_index() Flatten columns: join column labels. Flatten rows: flatten all levels.

How do I slice a column in pandas?

To slice the columns, the syntax is df. loc[:,start:stop:step] ; where start is the name of the first column to take, stop is the name of the last column to take, and step as the number of indices to advance after each extraction; for example, you can select alternate columns.

What is the difference between Sort_index () and Sort_values () methods in pandas library?

The two major sort functions You can check the API for sort_values and sort_index at the Pandas documentation for details on the parameters. sort_values() : You use this to sort the Pandas DataFrame by one or more columns. sort_index() : You use this to sort the Pandas DataFrame by the row index.


Video Answer


2 Answers

pd.lreshape can reformat wide data to long format:

In [55]: pd.lreshape(df, {'V':['v1', 'v2']})
Out[57]: 
   x   y   V
0  1  10   3
1  2  20   2
2  3  30   3
3  1  10  13
4  2  20  25
5  3  30  31

lreshape is an undocumented "experimental" feature. To learn more about lreshape see help(pd.lreshape).


If you need reversible operations, use jezrael's pd.melt solution to go from wide to long format, and use pivot_table to go from long to wide format:

In [72]: melted = pd.melt(df, id_vars=['x', 'y'], value_name='V'); melted
Out[72]: 
   x   y variable   V
0  1  10       v1   3
1  2  20       v1   2
2  3  30       v1   3
3  1  10       v2  13
4  2  20       v2  25
5  3  30       v2  31

In [74]: df2 = melted.pivot_table(index=['x','y'], columns=['variable'], values='V').reset_index(); df2
Out[74]: 
variable  x   y  v1  v2
0         1  10   3  13
1         2  20   2  25
2         3  30   3  31

Notice that you must hang on to the variable column if you wish to return to df2. Also keep in mind that it is more efficient to simply retain a reference to df than to recompute it using melted and pivot_table.

like image 159
unutbu Avatar answered Oct 28 '22 17:10

unutbu


You can use stack with set_index. Last drop column level_2:

print (df.set_index(['x','y']).stack().reset_index(name='V').drop('level_2', axis=1))
   x   y   V
0  1  10   3
1  1  10  13
2  2  20   2
3  2  20  25
4  3  30   3
5  3  30  31

Another solution with melt and sort_values:

print (pd.melt(df, id_vars=['x','y'], value_name='V')
         .drop('variable', axis=1)
         .sort_values('x'))

   x   y   V
0  1  10   3
3  1  10  13
1  2  20   2
4  2  20  25
2  3  30   3
5  3  30  31
like image 3
jezrael Avatar answered Oct 28 '22 18:10

jezrael