Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy recarray strings of variable length

Is it possible to initialise a numpy recarray that will hold strings, without knowing the length of the strings beforehand?

As a (contrived) example:

mydf = np.empty( (numrows,), dtype=[ ('file_name','STRING'), ('file_size_MB',float) ] )

The problem is that I'm constructing my recarray in advance of populating it with information, and I don't necessarily know the maximum length of file_name in advance.

All my attempts result in the string field being truncated:

>>> mydf = np.empty( (2,), dtype=[('file_name',str),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('', 6.9164002347457e-310), ('', 9.9413127e-317)], 
      dtype=[('file_name', 'S'), ('file_size_mb', '<f8')])
>>> mydf['file_name']
array(['f', 'a'], 
      dtype='|S1')

(As an aside, why does mydf['file_name'] show 'f' and 'a' whilst mydf shows '' and ''?)

Similarly, if I initialise with type (say) |S10 for file_name then things get truncated at length 10.

The only similar question I could find is this one, but this calculates the appropriate string length a priori and hence is not quite the same as mine (as I know nothing in advance).

Is there any alternative other than initalising the file_name with (eg) |S9999999999999 (ie some ridiculous upper limit)?

like image 246
mathematical.coffee Avatar asked Feb 02 '12 07:02

mathematical.coffee


People also ask

Can you make a NumPy array of strings?

The elements of a NumPy array, or simply an array, are usually numbers, but can also be boolians, strings, or other objects.

Does NumPy work with strings?

The numpy. char module provides a set of vectorized string operations for arrays of type numpy.

Can NumPy array store strings and floats?

numpy arrays support only one type of data in the array. Changing the float to str is not a good idea as it will only result in values very close to the original value. Try using pandas, it support multiple data types in single column.

What is Recarray NumPy?

recarray, which allows field access by attribute on the array object, and record arrays also use a special datatype, numpy. record, which allows field access by attribute on the individual elements of the array. The simplest way to create a record array is with numpy.rec.array: >>> >>> recordarr = np.


1 Answers

Instead of using the STRING dtype, one can always use object as dtype. That will allow any object to be assigned to an array element, including Python variable length strings. For example:

>>> import numpy as np
>>> mydf = np.empty( (2,), dtype=[('file_name',object),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('foobarasdf.tif', 0.0), ('arghtidlsarbda.jpg', 0.0)], 
      dtype=[('file_name', '|O8'), ('file_size_mb', '<f8')])

It is a against the spirit of the array concept to have variable length elements, but this is as close as one can get. The idea of an array is that elements are stored in memory at well-defined and regularly spaced memory addresses, which prohibits variable length elements. By storing the pointers to a string in an array, one can circumvent this limitation. (This is basically what the above example does.)

like image 78
Toon Verstraelen Avatar answered Sep 23 '22 14:09

Toon Verstraelen