Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_csv failing on columns with null characters

Tags:

python

pandas

Column y below should be ['Reg', 'Reg', 'Swp', 'Swp']

In [1]: pd.read_csv('/tmp/test3.csv')  
Out[1]:  
x,y  
 ^@^@^@,Reg  
 ^@^@^@,Reg  
I,Swp  
I,Swp  

In [2]: ! cat /tmp/test3.csv  
     x    y  
0  
1  NaN  NaN  
2    I  Swp  
3    I  Swp    

In [3]: f = open('/tmp/test3.csv', 'rb'); print(repr(f.read()))  
'x,y\n \x00\x00\x00,Reg\n \x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
like image 294
user1827356 Avatar asked Jan 23 '13 20:01

user1827356


People also ask

Is null and Notnull in pandas?

Python | Pandas isnull() and notnull() While making a Data Frame from a csv file, many blank columns are imported as null value into the Data Frame which later creates problems while operating that data frame. Pandas isnull() and notnull() methods are used to check and manage NULL values in a data frame.

IS NULL condition in pandas?

isnull. Detect missing values for an array-like object. This function takes a scalar or array-like object and indicates whether values are missing ( NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

Does read_csv read blank lines?

The read_csv method, by default, reads all blank lines of an input CSV file.

What does parse_dates in pandas do?

If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.


1 Answers

Yes, I could reproduce the problem, but don't know how to fix it with pd.read_csv. Here is a workaround:

In [46]: import numpy as np
In [47]: arr = np.genfromtxt('test3.csv', delimiter = ',', 
                             dtype = None, names = True)

In [48]: df = pd.DataFrame(arr)

In [49]: df
Out[49]: 
   x    y
0     Reg
1     Reg
2  I  Swp
3  I  Swp

Note that with names = True the first valid line of the csv is interpreted as column names (and therefore does not affect the dtype of the values on the subsequent lines.) Thus, if the csv file contains numerical data such as

In [22]: with open('/tmp/test.csv','r') as f:
   ....:     print(repr(f.read()))
   ....:     
'x,y,z\n \x00\x00\x00,Reg,1\n \x00\x00\x00,Reg,2\nI,Swp,3\nI,Swp,4\n'

Then genfromtxt will assign a numerical dtype to the third column (<i4 in this case).

In [19]: arr = np.genfromtxt('/tmp/test.csv', delimiter = ',', dtype = None, names = True)

In [20]: arr
Out[20]: 
array([('', 'Reg', 1), ('', 'Reg', 2), ('I', 'Swp', 3), ('I', 'Swp', 4)], 
      dtype=[('x', '|S3'), ('y', '|S3'), ('z', '<i4')])

However, if the numerical data is intermingled with bytes such as '\x00' then genfromtxt will be unable to recognize this column as numerical and will therefore resort to assigning a string dtype. Nevertheless, you can force the dtype of the columns by manually assigning the dtype parameter. For example,

In [11]: arr = np.genfromtxt('/tmp/test.csv', delimiter = ',', dtype = [('x', '|i4'), ('y', '|S3')], names = True)

sets the first column x to have dtype |i4 (4-byte integers) and the second column y to have dtype |S3 (3-byte string). See this doc page for more information on available dtypes.

like image 121
unutbu Avatar answered Nov 23 '22 02:11

unutbu