Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ignoring rows with unmatching dtype in pandas

Tags:

python

pandas

csv

I'm specifying dtypes while reading a huge CSV in pandas:

pd.read_csv('29_2016/data.csv', 
            error_bad_lines=False, 
            encoding='utf-8', 
            dtype={'a': str, 'b': np.float64, 'c':np.float64}, 
            na_values=['na'], quotechar='"')

but there are some "bad" rows in my data and they don't match the dtype due to some data preparation problems. For example, I might have strings in column b or column c.

Is there any way to ignore these bad rows while reading the CSV using pandas read_csv?

Error:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:9261)()

pandas/parser.pyx in pandas.parser.TextReader._convert_column_data (pandas/parser.c:10654)()

pandas/parser.pyx in pandas.parser.TextReader._convert_tokens (pandas/parser.c:11891)()

ValueError: could not convert string to float: some string "\
like image 583
Abhishek Thakur Avatar asked Jul 25 '16 07:07

Abhishek Thakur


1 Answers

Based on the discussion in https://github.com/pandas-dev/pandas/issues/2570, it would seem that the intended way of dealing malformed numeric values is to load the data without forcing the dtypes and then calling pd.to_numeric() (or the other functions of the pd.to_XXXXX() family as needed) on the columns. (In fact, in the discussion mentioned above suggested convert_objects() which has since been deprecated).

So, in the case at issue here, I would do the following:

data = pd.read_csv('29_2016/data.csv', 
            error_bad_lines=False, 
            encoding='utf-8', 
            na_values=['na'], quotechar='"')

and then

data.b = pd.to_numeric(data.b,errors='coerce')  # 'coerce' results in NaN for entries that can't be converted
data.c = pd.to_numeric(data.c,errors='coerce')

data = data[data.b.notnull() & data.c.notnull()]  # filter out rows with NaN
like image 93
P.Toccaceli Avatar answered Oct 13 '22 18:10

P.Toccaceli