Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to treat NULL as a normal string with pandas?

I have a csv-file with a column with strings and I want to read it with pandas. In this file the string null occurs as an actual value and should not be regarded as a missing value.

Example:

import pandas as pd
from io import StringIO

data = u'strings,numbers\nfoo,1\nbar,2\nnull,3'
print(pd.read_csv(StringIO(data)))

This gives the following output:

  strings  numbers
0     foo        1
1     bar        2
2     NaN        3

What can I do to get the value null as it is (and not as NaN) into the DataFrame? The file can be assumed to not contain any actually missing values.

like image 719
piripiri Avatar asked Jun 04 '18 15:06

piripiri


People also ask

Is null equal to NaN in pandas?

Operating on Null Values As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values.

Is null and Notnull in pandas?

Python | Pandas isnull() and notnull() While making a Data Frame from a csv file, many blank columns are imported as null value into the Data Frame which later creates problems while operating that data frame. Pandas isnull() and notnull() methods are used to check and manage NULL values in a data frame.


4 Answers

You can specify a converters argument for the string column.

pd.read_csv(StringIO(data), converters={'strings' : str})

  strings  numbers
0     foo        1
1     bar        2
2    null        3

This will by-pass pandas' automatic parsing.


Another option is setting na_filter=False:

pd.read_csv(StringIO(data), na_filter=False)

  strings  numbers
0     foo        1
1     bar        2
2    null        3

This works for the entire DataFrame, so use with caution. I recommend first option if you want to surgically apply this to select columns instead.

like image 81
cs95 Avatar answered Oct 22 '22 22:10

cs95


The reason this happens is that the string 'null' is treated as NaN on parsing, you can turn this off by passing keep_default_na=False in addition to @coldspeed's answer:

In[49]:
data = u'strings,numbers\nfoo,1\nbar,2\nnull,3'
df = pd.read_csv(io.StringIO(data), keep_default_na=False)
df

Out[49]: 
  strings  numbers
0     foo        1
1     bar        2
2    null        3

The full list is:

na_values : scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

like image 43
EdChum Avatar answered Oct 22 '22 22:10

EdChum


UPDATE: 2020-03-23 for Pandas 1+:

many thanks to @aiguofer for the adapted solution:

na_vals = pd.io.parsers.STR_NA_VALUES.difference({'NULL','null'})
df = pd.read_csv(io.StringIO(data), na_values=na_vals, keep_default_na=False)

Old answer:

we can dynamically exclude 'NULL' and 'null' from the set of default _NA_VALUES:

In [4]: na_vals = pd.io.common._NA_VALUES.difference({'NULL','null'})

In [5]: na_vals
Out[5]:
{'',
 '#N/A',
 '#N/A N/A',
 '#NA',
 '-1.#IND',
 '-1.#QNAN',
 '-NaN',
 '-nan',
 '1.#IND',
 '1.#QNAN',
 'N/A',
 'NA',
 'NaN',
 'n/a',
 'nan'}

and use it in read_csv():

df = pd.read_csv(io.StringIO(data), na_values=na_vals)
like image 6
MaxU - stop WAR against UA Avatar answered Oct 22 '22 21:10

MaxU - stop WAR against UA


Other answers are better for reading in a csv without "null" being interpreted as Nan, but if you have a dataframe that you want "fixed", this code will do so: df=df.fillna('null')

like image 2
Acccumulation Avatar answered Oct 22 '22 20:10

Acccumulation