I have a csv-file with a column with strings and I want to read it with pandas. In this file the string null
occurs as an actual value and should not be regarded as a missing value.
Example:
import pandas as pd
from io import StringIO
data = u'strings,numbers\nfoo,1\nbar,2\nnull,3'
print(pd.read_csv(StringIO(data)))
This gives the following output:
strings numbers
0 foo 1
1 bar 2
2 NaN 3
What can I do to get the value null
as it is (and not as NaN) into the DataFrame? The file can be assumed to not contain any actually missing values.
Operating on Null Values As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values.
Python | Pandas isnull() and notnull() While making a Data Frame from a csv file, many blank columns are imported as null value into the Data Frame which later creates problems while operating that data frame. Pandas isnull() and notnull() methods are used to check and manage NULL values in a data frame.
You can specify a converters
argument for the string
column.
pd.read_csv(StringIO(data), converters={'strings' : str})
strings numbers
0 foo 1
1 bar 2
2 null 3
This will by-pass pandas' automatic parsing.
Another option is setting na_filter=False
:
pd.read_csv(StringIO(data), na_filter=False)
strings numbers
0 foo 1
1 bar 2
2 null 3
This works for the entire DataFrame, so use with caution. I recommend first option if you want to surgically apply this to select columns instead.
The reason this happens is that the string 'null'
is treated as NaN
on parsing, you can turn this off by passing keep_default_na=False
in addition to @coldspeed's answer:
In[49]:
data = u'strings,numbers\nfoo,1\nbar,2\nnull,3'
df = pd.read_csv(io.StringIO(data), keep_default_na=False)
df
Out[49]:
strings numbers
0 foo 1
1 bar 2
2 null 3
The full list is:
na_values : scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
UPDATE: 2020-03-23 for Pandas 1+:
many thanks to @aiguofer for the adapted solution:
na_vals = pd.io.parsers.STR_NA_VALUES.difference({'NULL','null'})
df = pd.read_csv(io.StringIO(data), na_values=na_vals, keep_default_na=False)
Old answer:
we can dynamically exclude 'NULL'
and 'null'
from the set of default _NA_VALUES
:
In [4]: na_vals = pd.io.common._NA_VALUES.difference({'NULL','null'})
In [5]: na_vals
Out[5]:
{'',
'#N/A',
'#N/A N/A',
'#NA',
'-1.#IND',
'-1.#QNAN',
'-NaN',
'-nan',
'1.#IND',
'1.#QNAN',
'N/A',
'NA',
'NaN',
'n/a',
'nan'}
and use it in read_csv()
:
df = pd.read_csv(io.StringIO(data), na_values=na_vals)
Other answers are better for reading in a csv without "null" being interpreted as Nan
, but if you have a dataframe that you want "fixed", this code will do so: df=df.fillna('null')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With