Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy: mismatch in size of old and new data-descriptor

I ran into the following problem with NumPy 1.10.2 when reading a CSV file. I cannot figure out how to give explicit datatypes to genfromtxt.

Here is the CSV, minimal.csv:

x,y
1,hello
2,hello
3,jello
4,jelly
5,belly

Here I try to read it with genfromtxt:

import numpy
numpy.genfromtxt('minimal.csv', dtype=(int, str))

I also tried:

import numpy
numpy.genfromtxt('minimal.csv', names=True, dtype=(int, str))

Anyway, I get the error:

Traceback (most recent call last):
  File "visualize_numpy.py", line 39, in <module>
    numpy.genfromtxt('minimal.csv', dtype=(int, str))
  File "/Users/xeli/workspace/myproj/env/lib/python3.5/site-packages/numpy/lib/npyio.py", line 1518, in genfromtxt
    replace_space=replace_space)
  File "/Users/xeli/workspace/myproj/env/lib/python3.5/site-packages/numpy/lib/_iotools.py", line 881, in easy_dtype
    ndtype = np.dtype(ndtype)
ValueError: mismatch in size of old and new data-descriptor

Alternatively, I tried:

import numpy
numpy.genfromtxt('minimal.csv', dtype=[('x', int), ('y', str)])

Which throws:

Traceback (most recent call last):
  File "visualize_numpy.py", line 39, in <module>
    numpy.genfromtxt('minimal.csv', dtype=[('x', int), ('y', str)])
  File "/Users/xeli/workspace/myproj/env/lib/python3.5/site-packages/numpy/lib/npyio.py", line 1834, in genfromtxt
    rows = np.array(data, dtype=[('', _) for _ in dtype_flat])
ValueError: size of tuple must match number of fields.

I known dtype=None makes NumPy to try to guess correct types and usually works well. However, the documentation mentions it to be much slower than explicit types. In my case the computational efficiency is required so dtype=None is not an option.

Is there something terribly wrong with my approach or NumPy?

like image 304
Akseli Palén Avatar asked Dec 15 '15 16:12

Akseli Palén


1 Answers

This works well, and preserves your header information:

df = numpy.genfromtxt('minimal.csv',
                      names=True,
                      dtype=None,
                      delimiter=',')

This makes genfromtxt guess the dtype, which is generally what you want. Delimiter is a comma, so we should pass that argument also and finally, names=True preserves the header information.

Simply access your data as you would with any frame:

>>>>print(df['x'])
[1 2 3 4 5]

Edit: as per your comment below, you could provide the dtype explicitly, like so:

df = numpy.genfromtxt('file1.csv',
                      names=True,
                      dtype=[('x', int), ('y', 'S5')], # assuming each string is of len =< 5
                      delimiter=',')
like image 80
Niels Wouda Avatar answered Nov 02 '22 04:11

Niels Wouda