Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error when using pandas dataframe map function in ipython notebook

I'm just starting out with Python and getting stuck on something while playing with the Kaggle Titanic data. https://www.kaggle.com/c/titanic/data

Here's what I am typing in an ipython notebook (train.csv comes from the titanic data from the kaggle link above):

import pandas as pd
df = pd.read_csv("C:/fakepath/titanic/data/train.csv")

I then continue with this to check if there's any bad data in the 'Sex' column:

df['Sex'].value_counts()

Which returns:

male      577

female    314

dtype: int64
df['Gender'] = df['Sex'].map( {'male': 1, 'female': 0} ).astype(int)

this doesn't produce any errors. To verify that it created a new column called 'Gender' with integer values :

df

which returns:

#    PassengerId    Survived    Pclass  Name    Sex Age SibSp   Parch   Ticket  Fare    Cabin   Embarked    Gender
    0   1   0   3   Braund, Mr. Owen Harris male    22  1   0   A/5 21171   7.2500  NaN S   1
    1   2   1   1   Cumings, Mrs. John Bradley (Florence Briggs Th...   female  38  1   0   PC 17599    71.2833 C85 C   0
    2   3   1   3   Heikkinen, Miss. Laina  female  26  0   0   STON/O2. 3101282    7.9250  NaN S   0
    3   4   1   1   Futrelle, Mrs. Jacques Heath (Lily May Peel)    female  35  1   0   113803  53.1000 C123    S   0

...success, the Gender column is appended to the end and is 0 for female, 1 for male. Now, I create a new pandas dataframe which is a subset of the df dataframe.

df2 = df[ ['Survived', 'Pclass', 'Age', 'Gender', 'Embarked'] ]
df2

which returns:

    Survived    Pclass  Age Gender  Embarked
0   0   3   22  1   S
1   1   1   38  0   C
2   1   3   26  0   S
3   1   1   35  0   S
4   0   3   35  1   S
5   0   3   NaN 1   Q
df2['Embarked'].value_counts()

...shows that there are 3 unique values (S, C, Q):

S    644
C    168
Q     77
dtype: int64

However, when I try to execute what I think is the same type of operation as when I converted male/female to 1/0, I get an error:

df2['Embarked_int'] = df2['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2}).astype(int)

returns:

    ValueError                                Traceback (most recent call last)
<ipython-input-29-294c08f2fc80> in <module>()
----> 1 df2['Embarked_int'] = df2['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2}).astype(int)

C:\Anaconda\lib\site-packages\pandas\core\generic.pyc in astype(self, dtype, copy, raise_on_error)
   2212 
   2213         mgr = self._data.astype(
-> 2214             dtype=dtype, copy=copy, raise_on_error=raise_on_error)
   2215         return self._constructor(mgr).__finalize__(self)
   2216 

C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in astype(self, dtype, **kwargs)
   2500 
   2501     def astype(self, dtype, **kwargs):
-> 2502         return self.apply('astype', dtype=dtype, **kwargs)
   2503 
   2504     def convert(self, **kwargs):

C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in apply(self, f, axes, filter, do_integrity_check, **kwargs)
   2455                                                  copy=align_copy)
   2456 
-> 2457             applied = getattr(b, f)(**kwargs)
   2458 
   2459             if isinstance(applied, list):

C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in astype(self, dtype, copy, raise_on_error, values)
    369     def astype(self, dtype, copy=False, raise_on_error=True, values=None):
    370         return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 371                             values=values)
    372 
    373     def _astype(self, dtype, copy=False, raise_on_error=True, values=None,

C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in _astype(self, dtype, copy, raise_on_error, values, klass)
    399             if values is None:
    400                 # _astype_nansafe works fine with 1-d only
--> 401                 values = com._astype_nansafe(self.values.ravel(), dtype, copy=True)
    402                 values = values.reshape(self.values.shape)
    403             newb = make_block(values,

C:\Anaconda\lib\site-packages\pandas\core\common.pyc in _astype_nansafe(arr, dtype, copy)
   2616 
   2617         if np.isnan(arr).any():
-> 2618             raise ValueError('Cannot convert NA to integer')
   2619     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
   2620         # work around NumPy brokenness, #1987

ValueError: Cannot convert NA to integer

Any idea why I get this error on the 2nd use of the map function but not the first? There are no NAN values in the Embarked column per value_counts(). I'm guessing it's a noob problem :)

like image 358
Adrien Avatar asked Dec 12 '25 22:12

Adrien


1 Answers

by default value_counts does not count NaN values, you can change this by doing df['Embarked'].value_counts(dropna=False) .

I looked at your value_counts for Gender column (577 + 314 = 891) versus Embarked column (644 + 168 + 77 = 889) and they are different by 2 which means you must have 2 NaN values.

So you either drop them first (using dropna) or fill them with some desired value using fillna.

Also the astype(int) is redundant as you are mapping to an int anyway.

like image 187
EdChum Avatar answered Dec 15 '25 12:12

EdChum



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!