I'm working on converting some old text logs to a usable format in Python.
The files are huge, so I'm writing my own C extensions to run through the files as quickly as possible and parse out the relevant fields with regular expressions. My ultimate goal is to export these fields into NumPy
arrays of strings
. I know it's possible to create the NumPy
array as a PyObject
in C and then call SetItem
on each element, but I'm looking to optimize as much as I can.
Can I use something like memcpy
or PyBuffer_FromMemory
to read the C strings into a NumPy
string
array directly? I understand that NumPy
arrays are internally similar to C arrays, but do I have to ensure the NumPy
array will be contiguously allocated?
I intend to use the NumPy
arrays to build columns Pandas
for statistical analysis. As I understand it, Pandas
uses NumPy
arrays to store columns in a DataFrame
so I won't have a large overhead going from NumPy
into Pandas
. I'd like to avoid cython
if possible.
To give a sense of how an array of strings is stored, I'll make one, and view it in several ways:
In [654]: np.array(['one','two','three','four'],dtype='S5')
Out[654]:
array([b'one', b'two', b'three', b'four'],
dtype='|S5')
In [655]: x=np.array(['one','two','three','four'],dtype='S5')
In [656]: x.tostring()
Out[656]: b'one\x00\x00two\x00\x00threefour\x00'
In [657]: x.view(np.uint8)
Out[657]:
array([111, 110, 101, 0, 0, 116, 119, 111, 0, 0, 116, 104, 114,
101, 101, 102, 111, 117, 114, 0], dtype=uint8)
So its databuffer consists of 20 bytes (4*S5). For strings that are shorter than 5, it puts (or leaves) 0
in the byte.
Yes, there are C
functions for creating new arrays of a given size and dtype. And functions for copying blocks of data to those arrays. Look at the C
side of the numpy documentation, or look at some of the numpy code on it's github repository.
Regarding the pandas
transfer, beware that pandas
readily changes the dtype of its columns. For example if you put None
or nan
in a column it is likely to change it to object dtype.
Object arrays store pointers in the databuffer.
In [658]: y=np.array(['one','two','three','four'],dtype=object)
In [659]: y
Out[659]: array(['one', 'two', 'three', 'four'], dtype=object)
In [660]: y.tostring()
Out[660]: b'\xe0\x0f\xc5\xb5\xa0\xfah\xb5\x80\x0b\x8c\xb4\xc09\x8b\xb4'
If I interpret that right, the databuffer has 16 bytes - 4 4byte pointers. The strings are stored elsewhere in memory as regular Python strings (in this case unicode strings (Py3)).
=================
fromstring
and frombuffer
lets me recreate an array from a buffer
In [696]: x=np.array(['one','two','three','four'],dtype='S5')
In [697]: xs=x.tostring()
In [698]: np.fromstring(xs,'S5')
Out[698]:
array([b'one', b'two', b'three', b'four'],
dtype='|S5')
In [700]: np.frombuffer(xs,'S5')
Out[700]:
array([b'one', b'two', b'three', b'four'],
dtype='|S5')
This works without copying the buffer.
However, if the are multiple strings in different parts of memory, then building an array from them will require copying into on contiguous buffer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With