Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

High-speed alternatives to replace byte array processing bottlenecks

>> See EDIT below <<

I am working on processing data from a special pixelated CCD camera over serial, using FTDI D2xx drivers via pyUSB.

The camera can operate at high bandwidth to the PC, up to 80 frames/sec. I would love that speed, but know that it isn't feasible with Python, due to it being a scripted language, but would like to know how close I can get - whether it be some optimizations that I missed in my code, threading, or using some other approach. I immediately think that breaking-out the most time consuming loops and putting them in C code, but I don't have much experience with C code and not sure the best way to get Python to interact inline with it, if that's possible. I have complex algorithms heavily developed in Python with SciPy/Numpy, which are already optimized and have acceptable performance, so I would need a way to just speed-up the acquisition of the data to feed-back to Python, if that's the best approach.

The difficulty, and the reason I used Python, and not some other language, is due to the need to be able to easily run it cross-platform (I develop in Windows, but am putting the code on an embedded Linux board, making a stand-alone system). If you suggest that I use another code, like C, how would I be able to work cross-platform? I have never worked with compiling a lower-level language like C between Windows and Linux, so I would want to be sure of that process - I would have to compile it for each system, right? What do you suggest?


Here are my functions, with current execution times:

ReadStream: 'RXcount' is 114733 for a device read, formatting from string to byte equivalent

Returns a list of bytes (0-255), representing binary values

Current execution time: 0.037 sec

def ReadStream(RXcount):
    global ftdi
    RXdata = ftdi.read(RXcount)
    RXdata = list(struct.unpack(str(len(RXdata)) + 'B', RXdata))
    return RXdata


ProcessRawData: To reshape the byte list into an array that matches the pixel orientations

Results in a 3584x32 array, after trimming off some un-needed bytes.

Data is unique in that every block of 14 rows represents 14-bits of one row of pixels on the device (with 32 bytes across @ 8 bits/byte = 256 bits across), which is 256x256 pixels. The processed array has 32 columns of bytes because each byte, in binary, represents 8 pixels (32 bytes * 8 bits = 256 pixels). Still working on how to do that one... I have already posted a question for that previously

Current execution time: 0.01 sec ... not bad, it's just Numpy

def ProcessRawData(RawData):
    if len(RawData) == 114733:
        ProcessedMatrix = np.ndarray((1, 114733), dtype=int)
        np.copyto(ProcessedMatrix, RawData)
        ProcessedMatrix = ProcessedMatrix[:, 1:-44]
        ProcessedMatrix = np.reshape(ProcessedMatrix, (-1, 32))
        return ProcessedMatrix
    else:
        return None


Finally,

GetFrame: The device has a mode where it just outputs whether a pixel detected anything or not, using the lowest bit of the array (every 14th row) - Get that data and convert to int for each pixel

Results in 256x256 array, after processing every 14th row, which are bytes to be read as binary (32 bytes across ... 32 bytes * 8 bits = 256 pixels across)

Current execution time: 0.04 sec

def GetFrame(ProcessedMatrix):
    if np.shape(ProcessedMatrix) == (3584, 32):
        FrameArray = np.zeros((256, 256), dtype='B')
        DataRows = ProcessedMatrix[13::14]
        for i in range(256):
            RowData = ""
            for j in range(32):
                RowData = RowData + "{:08b}".format(DataRows[i, j])
            FrameArray[i] = [int(RowData[b:b+1], 2) for b in range(256)]
        return FrameArray
    else:
        return False


Goal:

I would like to target a total execution time of ~0.02 secs/frame by whatever suggestions you make (currently it's 0.25 secs/frame with the GetFrame function being the weakest). The device I/O is not the limiting factor, as that outputs a data packet every 0.0125 secs. If I get the execution time down, then can I just run the acquisition and processing in parallel with some threading?

Let me know what you suggest as the best path forward - Thank you for the help!


EDIT, thanks to @Jaime:

Functions are now:

def ReadStream(RXcount):
    global ftdi
    return np.frombuffer(ftdi.read(RXcount), dtype=np.uint8)

... time 0.013 sec

def ProcessRawData(RawData):
    if len(RawData) == 114733:
        return RawData[1:-44].reshape(-1, 32)
    return None

... time 0.000007 sec!

def GetFrame(ProcessedMatrix):
    if ProcessedMatrix.shape == (3584, 32):
        return np.unpackbits(ProcessedMatrix[13::14]).reshape(256, 256)
    return False

... time 0.00006 sec!

So, with pure Python, I am now able to acquire the data at the desired frame rate! After a few tweaks to the D2xx USB buffers and latency timing, I just clocked it at 47.6 FPS!

Last step is if there is any way to make this run in parallel with my processing algorithms? Need some way to pass the result of GetFrame to another loop running in parallel.

like image 654
Chemik Avatar asked May 07 '14 20:05

Chemik


1 Answers

There are several places where you can speed things up significantly. Perhaps the most obvious is rewriting GetFrame:

def GetFrame(ProcessedMatrix):
    if ProcessedMatrix.shape == (3584, 32):
        return np.unpackbits(ProcessedMatrix[13::14]).reshape(256, 256)
    return False

This requires that ProcessedMatrix be an ndarray of type np.uint8, but other than that, on my systems it runs 1000x faster.

With your other two functions, I think that in ReadStream you should do something like:

def ReadStream(RXcount):
    global ftdi
    return np.frombuffer(ftdi.read(RXcount), dtype=np.uint8)

Even if it doesn't speed up that function much, because it is the reading taking up most of the time, it will already give you a numpy array of bytes to work on. With that, you can then go on to ProcessRawData and try:

def ProcessRawData(RawData):
    if len(RawData) == 114733:
        return RawData[1:-44].reshape(-1, 32)
    return None

Which is 10x faster than your version.

like image 164
Jaime Avatar answered Oct 02 '22 05:10

Jaime