I'm specifying dtypes while reading a huge CSV in pandas
:
pd.read_csv('29_2016/data.csv',
error_bad_lines=False,
encoding='utf-8',
dtype={'a': str, 'b': np.float64, 'c':np.float64},
na_values=['na'], quotechar='"')
but there are some "bad" rows in my data and they don't match the dtype due to some data preparation problems. For example, I might have strings in column b or column c.
Is there any way to ignore these bad rows while reading the CSV using pandas
read_csv
?
Error:
pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)()
pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)()
pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:9261)()
pandas/parser.pyx in pandas.parser.TextReader._convert_column_data (pandas/parser.c:10654)()
pandas/parser.pyx in pandas.parser.TextReader._convert_tokens (pandas/parser.c:11891)()
ValueError: could not convert string to float: some string "\
Based on the discussion in https://github.com/pandas-dev/pandas/issues/2570, it would seem that the intended way of dealing malformed numeric values is to load the data without forcing the dtypes
and then calling pd.to_numeric()
(or the other functions of the pd.to_XXXXX()
family as needed) on the columns. (In fact, in the discussion mentioned above suggested convert_objects()
which has since been deprecated).
So, in the case at issue here, I would do the following:
data = pd.read_csv('29_2016/data.csv',
error_bad_lines=False,
encoding='utf-8',
na_values=['na'], quotechar='"')
and then
data.b = pd.to_numeric(data.b,errors='coerce') # 'coerce' results in NaN for entries that can't be converted
data.c = pd.to_numeric(data.c,errors='coerce')
data = data[data.b.notnull() & data.c.notnull()] # filter out rows with NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With