NumPy: mismatch in size of old and new data-descriptor

Question

I ran into the following problem with NumPy 1.10.2 when reading a CSV file. I cannot figure out how to give explicit datatypes to genfromtxt.

Here is the CSV, minimal.csv:

x,y
1,hello
2,hello
3,jello
4,jelly
5,belly

Here I try to read it with genfromtxt:

import numpy
numpy.genfromtxt('minimal.csv', dtype=(int, str))

I also tried:

import numpy
numpy.genfromtxt('minimal.csv', names=True, dtype=(int, str))

Anyway, I get the error:

Traceback (most recent call last):
  File "visualize_numpy.py", line 39, in <module>
    numpy.genfromtxt('minimal.csv', dtype=(int, str))
  File "/Users/xeli/workspace/myproj/env/lib/python3.5/site-packages/numpy/lib/npyio.py", line 1518, in genfromtxt
    replace_space=replace_space)
  File "/Users/xeli/workspace/myproj/env/lib/python3.5/site-packages/numpy/lib/_iotools.py", line 881, in easy_dtype
    ndtype = np.dtype(ndtype)
ValueError: mismatch in size of old and new data-descriptor

Alternatively, I tried:

import numpy
numpy.genfromtxt('minimal.csv', dtype=[('x', int), ('y', str)])

Which throws:

Traceback (most recent call last):
  File "visualize_numpy.py", line 39, in <module>
    numpy.genfromtxt('minimal.csv', dtype=[('x', int), ('y', str)])
  File "/Users/xeli/workspace/myproj/env/lib/python3.5/site-packages/numpy/lib/npyio.py", line 1834, in genfromtxt
    rows = np.array(data, dtype=[('', _) for _ in dtype_flat])
ValueError: size of tuple must match number of fields.

I known dtype=None makes NumPy to try to guess correct types and usually works well. However, the documentation mentions it to be much slower than explicit types. In my case the computational efficiency is required so dtype=None is not an option.

Is there something terribly wrong with my approach or NumPy?

Niels Wouda · Accepted Answer

This works well, and preserves your header information:

df = numpy.genfromtxt('minimal.csv',
                      names=True,
                      dtype=None,
                      delimiter=',')

This makes genfromtxt guess the dtype, which is generally what you want. Delimiter is a comma, so we should pass that argument also and finally, names=True preserves the header information.

Simply access your data as you would with any frame:

>>>>print(df['x'])
[1 2 3 4 5]

Edit: as per your comment below, you could provide the dtype explicitly, like so:

df = numpy.genfromtxt('file1.csv',
                      names=True,
                      dtype=[('x', int), ('y', 'S5')], # assuming each string is of len =< 5
                      delimiter=',')

NumPy: mismatch in size of old and new data-descriptor

Tags:

python

csv

numpy

genfromtxt

Akseli Palén

1 Answers

Niels Wouda

Recent Activity

Donate For Us

NumPy: mismatch in size of old and new data-descriptor

Tags:

python

csv

numpy

genfromtxt

Akseli Palén

1 Answers

Niels Wouda

Related questions

Recent Activity

Donate For Us