I was confused by this, which is very simple but I didn't immediately find the answer on StackOverflow:
df.set_index('xcol')
makes the column 'xcol'
become the index (when it is a column of df).
df.reindex(myList)
, however, takes indexes from outside the dataframe, for example, from a list named myList
that we defined somewhere else.
However, df.reindex(myList)
also changes values to NAs. A simple alternative is: df.index = myList
I hope this post clarifies it! Additions to this post are also welcome!
The set_index() function is used to set the DataFrame index using existing columns. Set the DataFrame index (row labels) using one or more existing columns or arrays of the correct length. The index can replace the existing index or expand on it.
Pandas set_index() is a method to set a List, Series or Data frame as index of a Data Frame. Index column can be set while making a data frame too. But sometimes a data frame is made out of two or more data frames and hence later index can be changed using this method.
loc method is used for label based indexing. . iloc method is used for position based indexing.
Reindexing the Rows One can reindex a single row or multiple rows by using reindex() method. Default values in the new index that are not present in the dataframe are assigned NaN.
You can see the difference on a simple example. Let's consider this dataframe:
df = pd.DataFrame({'a': [1, 2],'b': [3, 4]})
print (df)
a b
0 1 3
1 2 4
Indexes are then 0 and 1
If you use set_index
with the column 'a' then the indexes are 1 and 2. If you do df.set_index('a').loc[1,'b']
, you will get 3.
Now if you want to use reindex
with the same indexes 1 and 2 such as df.reindex([1,2])
, you will get 4.0 when you do df.reindex([1,2]).loc[1,'b']
What happend is that set_index
has replaced the previous indexes (0,1) with (1,2) (values from column 'a') without touching the order of values in the column 'b'
df.set_index('a')
b
a
1 3
2 4
while reindex
change the indexes but keeps the values in column 'b' associated to the indexes in the original df
df.reindex(df.a.values).drop('a',1) # equivalent to df.reindex(df.a.values).drop('a',1)
b
1 4.0
2 NaN
# drop('a',1) is just to not care about column a in my example
Finally, reindex
change the order of indexes without changing the values of the row associated to each index, while set_index
will change the indexes with the values of a column, without touching the order of the other values in the dataframe
Just to add, the undo to set_index
would be reset_index
method (more or less):
df = pd.DataFrame({'a': [1, 2],'b': [3, 4]})
print (df)
df.set_index('a', inplace=True)
print(df)
df.reset_index(inplace=True, drop=False)
print(df)
a b
0 1 3
1 2 4
b
a
1 3
2 4
a b
0 1 3
1 2 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With