Pandas read_csv failing on columns with null characters

Tags:

1 Answers

Yes, I could reproduce the problem, but don't know how to fix it with pd.read_csv. Here is a workaround:

In [46]: import numpy as np
In [47]: arr = np.genfromtxt('test3.csv', delimiter = ',', 
                             dtype = None, names = True)

In [48]: df = pd.DataFrame(arr)

In [49]: df
Out[49]: 
   x    y
0     Reg
1     Reg
2  I  Swp
3  I  Swp

Note that with names = True the first valid line of the csv is interpreted as column names (and therefore does not affect the dtype of the values on the subsequent lines.) Thus, if the csv file contains numerical data such as

In [22]: with open('/tmp/test.csv','r') as f:
   ....:     print(repr(f.read()))
   ....:     
'x,y,z\n \x00\x00\x00,Reg,1\n \x00\x00\x00,Reg,2\nI,Swp,3\nI,Swp,4\n'

Then genfromtxt will assign a numerical dtype to the third column (<i4 in this case).

In [19]: arr = np.genfromtxt('/tmp/test.csv', delimiter = ',', dtype = None, names = True)

In [20]: arr
Out[20]: 
array([('', 'Reg', 1), ('', 'Reg', 2), ('I', 'Swp', 3), ('I', 'Swp', 4)], 
      dtype=[('x', '|S3'), ('y', '|S3'), ('z', '<i4')])

However, if the numerical data is intermingled with bytes such as '\x00' then genfromtxt will be unable to recognize this column as numerical and will therefore resort to assigning a string dtype. Nevertheless, you can force the dtype of the columns by manually assigning the dtype parameter. For example,

In [11]: arr = np.genfromtxt('/tmp/test.csv', delimiter = ',', dtype = [('x', '|i4'), ('y', '|S3')], names = True)

sets the first column x to have dtype |i4 (4-byte integers) and the second column y to have dtype |S3 (3-byte string). See this doc page for more information on available dtypes.

121

answered Nov 23 '22 02:11

unutbu

Related questions
                            
                                How should I import django.middleware classes in Google App Engine project?
                            
                                Python Pyramid traversal
                            
                                Object deletes reference to self
                            
                                Why are the methods sys.exit(), exit(), raise SystemExit not working?
                            
                                Iterate over the same list twice in Jinja2?
                            
                                Subclassing numpy scalar types
                            
                                Ruby equivalent of python nonlocal
                            
                                What is the correct flymake configuration for emacs? (using Python.el)
                            
                                Flask WSGI application hangs when import nltk
                            
                                jinja2: macro selecting macro or dynamic macro calls
                            
                                Python monkey patch private function
                            
                                Matplotlib axis labels: how to find out where they will be located?
                            
                                python pandas csv exporting
                            
                                Python subprocess.call blocking
                            
                                scipy.optimize.curve_fit, TypeError: unsupported operand type
                            
                                Edit a commit with gitpython
                            
                                Default value of DateTimeField for South migration in Django project with activated timezone support
                            
                                Prevent running concurrent instances of a python script [duplicate]
                            
                                Flask-WTF uses input=submit instead of button type=submit
                            
                                Python package to estimate Perron-Frobenius Eigenvalue of real, square, non-negative matrix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas read_csv failing on columns with null characters

Tags:

python

pandas

user1827356

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us