How to treat NULL as a normal string with pandas?

Tags:

I have a csv-file with a column with strings and I want to read it with pandas. In this file the string null occurs as an actual value and should not be regarded as a missing value.

Example:

import pandas as pd
from io import StringIO

data = u'strings,numbers\nfoo,1\nbar,2\nnull,3'
print(pd.read_csv(StringIO(data)))

This gives the following output:

  strings  numbers
0     foo        1
1     bar        2
2     NaN        3

What can I do to get the value null as it is (and not as NaN) into the DataFrame? The file can be assumed to not contain any actually missing values.

719

asked Jun 04 '18 15:06

piripiri

4 Answers

You can specify a converters argument for the string column.

pd.read_csv(StringIO(data), converters={'strings' : str})

  strings  numbers
0     foo        1
1     bar        2
2    null        3

This will by-pass pandas' automatic parsing.

Another option is setting na_filter=False:

pd.read_csv(StringIO(data), na_filter=False)

  strings  numbers
0     foo        1
1     bar        2
2    null        3

This works for the entire DataFrame, so use with caution. I recommend first option if you want to surgically apply this to select columns instead.

answered Oct 22 '22 22:10

cs95

The reason this happens is that the string 'null' is treated as NaN on parsing, you can turn this off by passing keep_default_na=False in addition to @coldspeed's answer:

In[49]:
data = u'strings,numbers\nfoo,1\nbar,2\nnull,3'
df = pd.read_csv(io.StringIO(data), keep_default_na=False)
df

Out[49]: 
  strings  numbers
0     foo        1
1     bar        2
2    null        3

The full list is:

na_values : scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

answered Oct 22 '22 22:10

EdChum

UPDATE: 2020-03-23 for Pandas 1+:

many thanks to @aiguofer for the adapted solution:

na_vals = pd.io.parsers.STR_NA_VALUES.difference({'NULL','null'})
df = pd.read_csv(io.StringIO(data), na_values=na_vals, keep_default_na=False)

Old answer:

we can dynamically exclude 'NULL' and 'null' from the set of default _NA_VALUES:

In [4]: na_vals = pd.io.common._NA_VALUES.difference({'NULL','null'})

In [5]: na_vals
Out[5]:
{'',
 '#N/A',
 '#N/A N/A',
 '#NA',
 '-1.#IND',
 '-1.#QNAN',
 '-NaN',
 '-nan',
 '1.#IND',
 '1.#QNAN',
 'N/A',
 'NA',
 'NaN',
 'n/a',
 'nan'}

and use it in read_csv():

df = pd.read_csv(io.StringIO(data), na_values=na_vals)

answered Oct 22 '22 21:10

MaxU - stop WAR against UA

Other answers are better for reading in a csv without "null" being interpreted as Nan, but if you have a dataframe that you want "fixed", this code will do so: df=df.fillna('null')

answered Oct 22 '22 20:10

Acccumulation

Related questions
                            
                                why is xrange able to go back to beginning in Python?
                            
                                How do I import a pre-existing python project into Eclipse?
                            
                                CSV read specific row
                            
                                Matplotlib with annotation cut off from the saved figure
                            
                                Matplotlib : display array values with imshow
                            
                                Python pandas equivalent to R groupby mutate
                            
                                How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?
                            
                                PyQt: Always on top
                            
                                Developing with Django+Celery without running `celeryd`?
                            
                                Processing multiple values for one single option using getopt/optparse?
                            
                                Declaring a python function with an array parameters and passing an array argument to the function call?
                            
                                correlation matrix in python
                            
                                Pydev Perspective Not Showing After Install For Eclipse
                            
                                What is the pythonic way to avoid shadowing variables?
                            
                                Cannot use geometry manager pack inside
                            
                                Merging dictionary value lists in python
                            
                                pandas DataFrame, how to apply function to a specific column?
                            
                                How to compare pandas DataFrame against None in Python?
                            
                                How to use Faker from Factory_boy
                            
                                Selenium gives "selenium.common.exceptions.WebDriverException: Message: unknown error: cannot find Chrome binary" on Mac

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to treat NULL as a normal string with pandas?

Tags:

python

string

pandas

dataframe

csv