My environment:
OS: Windows 11
Python version: 3.13.2
NumPy version: 2.1.3
According to NumPy Fundementals guide describing how to use numpy.genfromtxt
function:
The optional argument
comments
is used to define a character string that marks the beginning of a comment. By default, genfromtxt assumescomments='#'
. The comment marker may occur anywhere on the line. Any character present after the comment marker(s) is simply ignored.Note: There is one notable exception to this behavior: if the optional argument
names=True
, the first commented line will be examined for names.
To do a test about the above-mentioned note (indicated in bold), I created the following data file and I put the header line, as a commented line:
C:\tmp\data.txt
#firstName|LastName
Anthony|Quinn
Harry|POTTER
George|WASHINGTON
And the following program to read and print the content of the file:
with open("C:/tmp/data.txt", "r", encoding="UTF-8") as fd:
result = np.genfromtxt(fd,
comments="#",
delimiter="|",
dtype=str,
names=True,
skip_header=0)
print(f"result = {result}")
But the result is not what I expected:
result = [('', '') ('', '') ('', '')]
I cannot figure out where is the error in my code and I don't understand why the content of my data file, and in particular, its header line after the comment indicator #, is not interpreted correctly.
I'd appriciate if you could kindly make some clarification.
The magic happens in this line in genfromtxt
:
rows = np.array(data, dtype=[('', _) for _ in dtype_flat])
The inputs are
data = data=[('Anthony', 'Quinn'), ('Harry', 'POTTER'), ('George', 'WASHINGTON')]
dtype_flat = [dtype('<U'), dtype('<U')]
This is not too surprising since you have variable-length strings, and numpy is designed for homogeneous data types. You should have a couple of workarounds available, but only one seems to work.
If you set dtype=object
, you get
result = [(b'Anthony', b'Quinn') (b'Harry', b'POTTER') (b'George', b'WASHINGTON')]
You would also expect that specify a string size explicitly. Instead of dtype = str
, should work. However, using something like dtype = '<U10'
does not work and produces the same empty result as before.
There appears to be an issue open for this, or at least a similar issue: https://github.com/numpy/numpy/issues/9644
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With