I ran into the following problem with NumPy 1.10.2 when reading a CSV file. I cannot figure out how to give explicit datatypes to genfromtxt
.
Here is the CSV, minimal.csv
:
x,y
1,hello
2,hello
3,jello
4,jelly
5,belly
Here I try to read it with genfromtxt:
import numpy
numpy.genfromtxt('minimal.csv', dtype=(int, str))
I also tried:
import numpy
numpy.genfromtxt('minimal.csv', names=True, dtype=(int, str))
Anyway, I get the error:
Traceback (most recent call last):
File "visualize_numpy.py", line 39, in <module>
numpy.genfromtxt('minimal.csv', dtype=(int, str))
File "/Users/xeli/workspace/myproj/env/lib/python3.5/site-packages/numpy/lib/npyio.py", line 1518, in genfromtxt
replace_space=replace_space)
File "/Users/xeli/workspace/myproj/env/lib/python3.5/site-packages/numpy/lib/_iotools.py", line 881, in easy_dtype
ndtype = np.dtype(ndtype)
ValueError: mismatch in size of old and new data-descriptor
Alternatively, I tried:
import numpy
numpy.genfromtxt('minimal.csv', dtype=[('x', int), ('y', str)])
Which throws:
Traceback (most recent call last):
File "visualize_numpy.py", line 39, in <module>
numpy.genfromtxt('minimal.csv', dtype=[('x', int), ('y', str)])
File "/Users/xeli/workspace/myproj/env/lib/python3.5/site-packages/numpy/lib/npyio.py", line 1834, in genfromtxt
rows = np.array(data, dtype=[('', _) for _ in dtype_flat])
ValueError: size of tuple must match number of fields.
I known dtype=None
makes NumPy to try to guess correct types and usually works well. However, the documentation mentions it to be much slower than explicit types. In my case the computational efficiency is required so dtype=None
is not an option.
Is there something terribly wrong with my approach or NumPy?
This works well, and preserves your header information:
df = numpy.genfromtxt('minimal.csv',
names=True,
dtype=None,
delimiter=',')
This makes genfromtxt
guess the dtype, which is generally what you want. Delimiter is a comma, so we should pass that argument also and finally, names=True
preserves the header information.
Simply access your data as you would with any frame:
>>>>print(df['x'])
[1 2 3 4 5]
Edit: as per your comment below, you could provide the dtype explicitly, like so:
df = numpy.genfromtxt('file1.csv',
names=True,
dtype=[('x', int), ('y', 'S5')], # assuming each string is of len =< 5
delimiter=',')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With