Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cleaning big data using python

Tags:

python

pandas

I have to clean a input data file in python. Due to typo error, the datafield may have strings instead of numbers. I would like to identify all fields which are a string and fill these with NaN using pandas. Also, I would like to log the index of those fields.

One of the crudest way is to loop through each and every field and checking whether it is a number or not, but this consumes lot of time if the data is big.

My csv file contains data similar to the following table:

Country  Count  Sales
USA         1   65000
UK          3    4000
IND         8       g
SPA         3    9000
NTH         5   80000

.... Assume that i have 60,000 such rows in the data.

Ideally I would like to identify that row IND has an invalid value under SALES column. Any suggestions on how to do this efficiently?

like image 451
Kathirmani Sukumar Avatar asked Dec 13 '12 19:12

Kathirmani Sukumar


2 Answers

There is a na_values argument to read_csv:

na_values : list-like or dict, default None
       Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values

df = pd.read_csv('city.csv', sep='\s+', na_values=['g'])

In [2]: df
Out[2]:
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000

Using pandas.isnull, you can select only those rows with NaN in the 'Sales' column, or the 'Country' series:

In [3]: df[pd.isnull(df['Sales'])]
Out[3]: 
  Country  Count  Sales
2     IND      8    NaN

In [4]: df[pd.isnull(df['Sales'])]['Country']
Out[4]: 
2    IND
Name: Country

If it's already in the DataFrame you could use apply to convert those strings which are numbers into integers (using str.isdigit):

df = pd.DataFrame({'Count': {0: 1, 1: 3, 2: 8, 3: 3, 4: 5}, 'Country': {0: 'USA', 1: 'UK', 2: 'IND', 3: 'SPA', 4: 'NTH'}, 'Sales': {0: '65000', 1: '4000', 2: 'g', 3: '9000', 4: '80000'}})

In [12]: df
Out[12]: 
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8      g
3     SPA      3   9000
4     NTH      5  80000

In [13]: df['Sales'] = df['Sales'].apply(lambda x: int(x) 
                                                  if str.isdigit(x)
                                                  else np.nan)

In [14]: df
Out[14]: 
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000
like image 81
Andy Hayden Avatar answered Sep 21 '22 13:09

Andy Hayden


import os
import numpy as np
import pandas as PD

filename = os.path.expanduser('~/tmp/data.csv')
df = PD.DataFrame(
        np.genfromtxt(
            filename, delimiter = '\t', names = True, dtype = '|O4,<i4,<f8'))
print(df)

yields

  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000

and to find the country with NaN sales, you could compute

print(y['Country'][np.isnan(y['Sales'])])

which yields the pandas.Series:

2    IND
Name: Country
like image 36
unutbu Avatar answered Sep 22 '22 13:09

unutbu