Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get pandas.read_csv to read empty fields as NaN, and empty strings as empty strings

Tags:

python

pandas

csv

This is quite the opposite of Get pandas.read_csv to read empty values as empty string instead of nan

Given the following CSV file:

col,val
"hi
there",1
,2
\f\,3
"",4
"""hi""",5

I want it to be read as:

         col  val
0  hi\nthere    1
1        NaN    2
2        \f\    3
3               4
4       "hi"    5

That is, reading the empty field (val 2) as NaN, while keeping the empty string (val 4) as empty string.

Currently pd.read_csv converts val 2 and val 4 both as NaN, or if I use na_filter=False both are kept as empty string.

I'm assuming these two representations mean different things in CSV (empty fields vs empty string), so I'm assuming pandas should be able to distinguish this too.

Is there a way to make pandas to distinguish these two cases? Or is my assumption wrong, that the two representations are actually the same? (please point me to a CSV standard if the second one is the case)

More information, I got the CSV by exporting BigQuery table (with the intended meaning, val 2 is null and val 4 is empty string) into CSV. And I want to get the exact same table back. So this example is not just a contrived example, but is actually used by BigQuery when exporting to CSV.

EDIT: further search reveals a Github issue 4 years ago that discusses similar point (see this comment, for example), and one of the commenter mentions that there is some coercion (I'm not sure what they refer to, but I understand it as coercion between empty field and empty string). Is this still happening?

like image 511
justhalf Avatar asked Nov 29 '19 07:11

justhalf


1 Answers

Another option would be do disable the quoting to get the fields where an empty string is present and where nothing is present. The problem in this case are the entries which include new line characters in the test. We would need to remove these chars first and merge the lines to create a new data file.

When reading the new data file with quoting off, empty values are NaN and empty strings are two quotes. This dataframe then can be used to set the NaN in the original dataframe to set the real NaNs.

import numpy as np
import pandas as pd

with open('./data.csv') as f:
    lines = f.readlines()

# merge lines where the comma is missing
it = iter(lines)
lines2 = [x if ',' in x else x + next(it) for x in it]
# replace \n which are not at the end of the line
lines3 = [l.replace('\n','') + '\n' for l in lines2]
# write new file with merged lines
with open('./data_merged.csv', 'w+') as f:
    f.writelines(lines3)


# read original data
df = pd.read_csv('./data.csv', na_filter=False)
# read merged lines data with quoting off
df_merged = pd.read_csv('./data_merged.csv', quoting=3)

# in df_merged dataframe if is NaN it is a real NaN
# set lines in original df to NaN when in df_merged is NaN
df.loc[df_merged.col.isna(), 'col'] = np.NaN
like image 125
mjspier Avatar answered Oct 06 '22 02:10

mjspier