I have to clean a input data file in python. Due to typo error, the datafield may have strings instead of numbers. I would like to identify all fields which are a string and fill these with NaN using pandas. Also, I would like to log the index of those fields.
One of the crudest way is to loop through each and every field and checking whether it is a number or not, but this consumes lot of time if the data is big.
My csv file contains data similar to the following table:
Country  Count  Sales
USA         1   65000
UK          3    4000
IND         8       g
SPA         3    9000
NTH         5   80000
.... Assume that i have 60,000 such rows in the data.
Ideally I would like to identify that row IND has an invalid value under SALES column. Any suggestions on how to do this efficiently?
There is a na_values argument to read_csv:
na_values: list-like or dict, defaultNone
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values
df = pd.read_csv('city.csv', sep='\s+', na_values=['g'])
In [2]: df
Out[2]:
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000
Using pandas.isnull, you can select only those rows with NaN in the 'Sales' column, or the 'Country' series:
In [3]: df[pd.isnull(df['Sales'])]
Out[3]: 
  Country  Count  Sales
2     IND      8    NaN
In [4]: df[pd.isnull(df['Sales'])]['Country']
Out[4]: 
2    IND
Name: Country
If it's already in the DataFrame you could use apply to convert those strings which are numbers into integers (using str.isdigit):
df = pd.DataFrame({'Count': {0: 1, 1: 3, 2: 8, 3: 3, 4: 5}, 'Country': {0: 'USA', 1: 'UK', 2: 'IND', 3: 'SPA', 4: 'NTH'}, 'Sales': {0: '65000', 1: '4000', 2: 'g', 3: '9000', 4: '80000'}})
In [12]: df
Out[12]: 
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8      g
3     SPA      3   9000
4     NTH      5  80000
In [13]: df['Sales'] = df['Sales'].apply(lambda x: int(x) 
                                                  if str.isdigit(x)
                                                  else np.nan)
In [14]: df
Out[14]: 
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000
                        import os
import numpy as np
import pandas as PD
filename = os.path.expanduser('~/tmp/data.csv')
df = PD.DataFrame(
        np.genfromtxt(
            filename, delimiter = '\t', names = True, dtype = '|O4,<i4,<f8'))
print(df)
yields
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000
and to find the country with NaN sales, you could compute
print(y['Country'][np.isnan(y['Sales'])])
which yields the pandas.Series:
2    IND
Name: Country
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With