Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a reset_index for columns or a way to move column headers to an inner index leaving their index positions as the outer index?

Tags:

python

pandas

Sample DataFrame:

import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, size=(10, 4)), columns=list('ABCD'))

Is there a way to reset index for columns? or to easily insert a row with column index position values? I'd prefer the index positions to be the outer most index and be left with the column headers as the inner most index.

like image 426
Yale Newman Avatar asked Apr 27 '17 18:04

Yale Newman


People also ask

What does reset_index inplace true do?

When you set inplace = True , the reset_index method will not create a new DataFrame. Instead, it will directly modify and overwrite your original DataFrame.

How do I change the index of a column?

You can use the rename() method of pandas. DataFrame to change column/index name individually. Specify the original name and the new name in dict like {original name: new name} to columns / index parameter of rename() . columns is for the column name, and index is for the index name.

What does reset_index do in Pandas?

Pandas DataFrame reset_index() Method The reset_index() method allows you reset the index back to the default 0, 1, 2 etc indexes. By default this method will keep the "old" idexes in a column named "index", to avoid this, use the drop parameter.


2 Answers

a.1) Drop column names

df.columns = pd.RangeIndex(df.columns.size)
df

Output:

    0   1   2   3
#---------------#
0   0   1   3   3
1   2   2   0   2
2   2   1   3   1
3   2   1   0   0

a.2) Drop column names (one-liner)
Could have performance issues and side effects, see discussion below.

df.T.reset_index(drop=True).T 

Output:

    0   1   2   3
#---------------#
0   0   1   3   3
1   2   2   0   2
2   2   1   3   1
3   2   1   0   0

b.1) Move column names into a row (one-liner)
Same issues, see discussion below.

df.T.reset_index().T

Output:

        0   1   2   3
#-------------------#
index   A   B   C   D
   0    0   1   3   3
   1    2   2   0   2
   2    2   1   3   1
   3    2   1   0   0

b.2) Move column names into a row
Effective way.

 #heterogeneous DataFrame creation
df = pd.DataFrame(np.random.randint(0,4,size=(4, 3)), columns=list('789')).join(
     pd.DataFrame(list('bcde'),columns=['A']))
df.index.name = '4'

#save column as row then reindex column names
df = df.append(pd.Series( df.columns,name = df.index.name,index= df.columns ), )
df.columns = pd.RangeIndex(df.columns.size)
print (df)
print(df.info())

Output: NB you will need extra effort to prevent upcasing of all data

   0  1  2  3
#-----------#
4            
0  2  3  2  b
1  1  0  2  c
2  3  1  3  d
3  3  3  2  e
4  7  8  9  A

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 0 to 4
Data columns (total 4 columns):
0    5 non-null object
1    5 non-null object
2    5 non-null object
3    5 non-null object
dtypes: object(4)

c) Add secondary column index (one-liner)
Could have performance issues and side effects, see discussion below.

df.T.set_index(pd.RangeIndex(df.columns.size),append=True).T

Output:

    A   B   C   D
    0   1   2   3
#---------------#
0   0   1   3   3
1   2   2   0   2
2   2   1   3   1
3   2   1   0   0

One line approach criticism

Performance issues:
For huge datasets could be unacceptable costs of double T , but on simple cases one line that returns copy of DataFrame maybe useful. See test results

In [294]: for i in range (3,7):
     ...:     df = pd.DataFrame(np.random.randint(0,9,size=(10**i, 10**3)))
     ...:     print ('shape:',df.shape)
     ...:     %timeit df.T.reset_index(drop=True)
     ...: 
shape: (1000, 1000)
100 loops, best of 3: 3.2 ms per loop
shape: (10000, 1000)
10 loops, best of 3: 29.3 ms per loop
shape: (100000, 1000)
1 loop, best of 3: 546 ms per loop
shape: (1000000, 1000)
1 loop, best of 3: 9.9 s per loop

In [295]: %timeit df.columns = pd.RangeIndex(df.columns.size)
The slowest run took 28.60 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.74 µs per loop
  

Side effect (upcasting):
Heterogeneous DataFrames will be up-casted

In [352]: df = pd.DataFrame(np.random.randint(0,4,size=(4, 3)), columns=list('789')).join(
     ...:          pd.DataFrame(list('bcde'),columns=['A']))

In [353]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
7    4 non-null int64
8    4 non-null int64
9    4 non-null int64
A    4 non-null object
dtypes: int64(3), object(1)
memory usage: 208.0+ bytes

.T.T upcasting

In [354]: df.T.T.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
7    4 non-null object
8    4 non-null object
9    4 non-null object
A    4 non-null object
dtypes: object(4)
memory usage: 208.0+ bytes

like image 188
ilia timofeev Avatar answered Oct 21 '22 07:10

ilia timofeev


I think you can use numpy.arange or range:

np.random.seed(10)
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))

df.columns = np.arange(len(df.columns))
#alternatively
#df.columns = range(len(df.columns))
print (df)
   0  1  2  3
0  9  4  0  1
1  9  0  1  8
2  9  0  8  6
3  4  3  0  4
4  6  8  1  8
5  4  1  3  6
6  5  3  9  6
7  9  1  9  4
8  2  6  7  8
9  8  9  2  0

But lost column values.

If need MultiIndex without names:

df.columns = [np.arange(len(df.columns)), df.columns]
print (df)
   0  1  2  3
   A  B  C  D
0  9  4  0  1
1  9  0  1  8
2  9  0  8  6
3  4  3  0  4
4  6  8  1  8
5  4  1  3  6
6  5  3  9  6
7  9  1  9  4
8  2  6  7  8
9  8  9  2  0

and for names use MultiIndex.from_arrays:

names = ['a','b']
df.columns = pd.MultiIndex.from_arrays([np.arange(len(df.columns)), df.columns], names=names)
print (df)
a  0  1  2  3
b  A  B  C  D
0  9  4  0  1
1  9  0  1  8
2  9  0  8  6
3  4  3  0  4
4  6  8  1  8
5  4  1  3  6
6  5  3  9  6
7  9  1  9  4
8  2  6  7  8
9  8  9  2  0
like image 33
jezrael Avatar answered Oct 21 '22 06:10

jezrael