Update: In lastest version of numpy (e.g., v1.8.1), this is no longer a issue. All the methods mentioned here now work as excepted.
Original question: Using object dtype to store string array is convenient sometimes, especially when one needs to modify the content of a large array without prior knowledge about the maximum length of the strings, e.g.,
>>> import numpy as np
>>> a = np.array([u'abc', u'12345'], dtype=object)
At some point, one might want to convert the dtype back to unicode or str. However, simple conversion will truncate the string at length 4 or 1 (why?), e.g.,
>>> b = np.array(a, dtype=unicode)
>>> b
array([u'abc', u'1234'], dtype='<U4')
>>> c = a.astype(unicode)
>>> c
array([u'a', u'1'], dtype='<U1')
Of course, one can always iterate over the entire array explicitly to determine the max length,
>>> d = np.array(a, dtype='<U{0}'.format(np.max([len(x) for x in a])))
array([u'abc', u'12345'], dtype='<U5')
Yet, this is a little bit awkward in my opinion. Is there a better way to do this?
Edit to add: According to this closely related question,
>>> len(max(a, key=len))
is another way to find out the longest string length, and this step seems to be unavoidable...
I know this is an old question but in case anyone comes across it and is looking for an answer, try
c = a.astype('U')
and you should get the result you expect:
c = array([u'abc', u'12345'], dtype='<U5')
At least in Python 3.5 Jupyter 4 I can use:
a=np.array([u'12345',u'abc'],dtype=object)
b=a.astype(str)
b
works just fine for me and returns:
array(['12345', 'abc'],dtype='<U5')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With