Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between df.reindex() and df.set_index() methods in pandas

I was confused by this, which is very simple but I didn't immediately find the answer on StackOverflow:

  • df.set_index('xcol') makes the column 'xcol' become the index (when it is a column of df).

  • df.reindex(myList), however, takes indexes from outside the dataframe, for example, from a list named myList that we defined somewhere else.

However, df.reindex(myList) also changes values to NAs. A simple alternative is: df.index = myList

I hope this post clarifies it! Additions to this post are also welcome!

like image 600
Ricardo Guerreiro Avatar asked Jun 07 '18 12:06

Ricardo Guerreiro


People also ask

What does the set_index () method do?

The set_index() function is used to set the DataFrame index using existing columns. Set the DataFrame index (row labels) using one or more existing columns or arrays of the correct length. The index can replace the existing index or expand on it.

What does set_index do in pandas?

Pandas set_index() is a method to set a List, Series or Data frame as index of a Data Frame. Index column can be set while making a data frame too. But sometimes a data frame is made out of two or more data frames and hence later index can be changed using this method.

What are the two ways of indexing DataFrame?

loc method is used for label based indexing. . iloc method is used for position based indexing.

Which method is used to re order the indexes in DataFrame?

Reindexing the Rows One can reindex a single row or multiple rows by using reindex() method. Default values in the new index that are not present in the dataframe are assigned NaN.


2 Answers

You can see the difference on a simple example. Let's consider this dataframe:

df = pd.DataFrame({'a': [1, 2],'b': [3, 4]})
print (df)
   a  b
0  1  3
1  2  4

Indexes are then 0 and 1

If you use set_index with the column 'a' then the indexes are 1 and 2. If you do df.set_index('a').loc[1,'b'], you will get 3.

Now if you want to use reindex with the same indexes 1 and 2 such as df.reindex([1,2]), you will get 4.0 when you do df.reindex([1,2]).loc[1,'b']

What happend is that set_index has replaced the previous indexes (0,1) with (1,2) (values from column 'a') without touching the order of values in the column 'b'

df.set_index('a')
   b
a   
1  3
2  4

while reindex change the indexes but keeps the values in column 'b' associated to the indexes in the original df

df.reindex(df.a.values).drop('a',1) # equivalent to df.reindex(df.a.values).drop('a',1)
     b
1  4.0
2  NaN
# drop('a',1) is just to not care about column a in my example

Finally, reindex change the order of indexes without changing the values of the row associated to each index, while set_index will change the indexes with the values of a column, without touching the order of the other values in the dataframe

like image 141
Ben.T Avatar answered Oct 10 '22 22:10

Ben.T


Just to add, the undo to set_index would be reset_index method (more or less):

df = pd.DataFrame({'a': [1, 2],'b': [3, 4]})
print (df)

df.set_index('a', inplace=True)
print(df)

df.reset_index(inplace=True, drop=False)
print(df)

   a  b
0  1  3
1  2  4
   b
a   
1  3
2  4
   a  b
0  1  3
1  2  4
like image 36
prosti Avatar answered Oct 10 '22 21:10

prosti