The pandas read_csv() method interprets 'NA' as nan (not a number) instead of a valid string.
In the simple case below note that the output in row 1, column 2 (zero based count) is 'nan' instead of 'NA'.
sample.tsv (tab delimited)
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 1 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118
read_sample.py
import pandas as pd df = pd.read_csv( 'sample.tsv', sep='\t', encoding='utf-8', ) for df_tuples in df.itertuples(index=True): print(df_tuples)
output
(0, u'5d8b', u'N', u'P60490', 1, 146, 1, 146, 1, 146)
(1, u'5d8b', nan, u'P80377', 1, 126, 1, 126, 1, 126)
(2, u'5d8b', u'O', u'P60491', 1, 118, 1, 118, 1, 118)
Re-writing the file with quotes for data in the 'CHAIN' column and then using the quotechar parameter quotechar='\''
has the same result. And passing a dictionary of types via the dtype parameter dtype=dict(valid_cols)
does not change the result.
An old answer to Prevent pandas from automatically inferring type in read_csv suggests first using a numpy record array to parse the file, but given the ability to now specify column dtypes, this shouldn't be necessary.
Note that itertuples() is used to preserve dtypes as described in the iterrows documentation: "To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns tuples of the values and which is generally faster as iterrows."
Example was tested on Python 2 and 3 with pandas version 0.16.2, 0.17.0, and 0.17.1.
Is there a way to capture a valid string 'NA' instead of it being converted to nan?
You could use parameters keep_default_na
and na_values
to set all NA values by hand docs:
import pandas as pd from io import StringIO data = """ PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END 5d8b N P60490 1 146 1 146 1 146 5d8b NA P80377 _ 126 1 126 1 126 5d8b O P60491 1 118 1 118 1 118 """ df = pd.read_csv(StringIO(data), sep=' ', keep_default_na=False, na_values=['_']) In [130]: df Out[130]: PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END 0 5d8b N P60490 1 146 1 146 1 146 1 5d8b NA P80377 NaN 126 1 126 1 126 2 5d8b O P60491 1 118 1 118 1 118 In [144]: df.CHAIN.apply(type) Out[144]: 0 <class 'str'> 1 <class 'str'> 2 <class 'str'> Name: CHAIN, dtype: object
EDIT
All default NA
values from na-values (as of pandas
1.0.0):
The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''].
For me solution came from using parameter na_filter = False
df = pd.read_csv(file_, header=0, dtype=object, na_filter = False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With