Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for info to improve code speed

I have some code that will stream video from a camera at 720p and 24fps. I am trying to capture this stream in code and eventually create a video of it by throwing together compressed jpegs into mjpeg or the like. The issue I'm having is that this overall code is not fast enough to create something at 24 fps or .04 seconds per image.

using

Stopwatch();

I found out that the interior for loop takes .000000000022 seconds per loop.

The exterior for loop takes .0000077 seconds to complete per loop.

and I found that the entire function from start to image save runs .21 seconds per run.

calculations from interior loop to complete an image:

.000000000022 x 640 = .000000001408 seconds
.000000001408 x 360 = .00000050688  seconds

calculation from exterior loop to complete an image:

.0000077 x 360 = .002772 seconds

If i could create an image relating to those times i would be set, but the code running the overall code takes .21 seconds to complete all of the code

temp_byte1 = main_byte1;
temp_byte2 = main_byte2;

timer1.Reset();
timer1.Start();

Bitmap mybmp = new Bitmap(1280, 720);
BitmapData BPD = mybmp.LockBits(new Rectangle(0, 0, 1280, 720), ImageLockMode.WriteOnly, mybmp.PixelFormat);
IntPtr xptr = BPD.Scan0;
IntPtr yptr = BPD.Scan0;
yptr = new IntPtr( yptr.ToInt64() + (1280 * 720 * 2));
int bytes = Math.Abs(BPD.Stride);
byte[][] rgb = new byte[720][];
int Y1, Y2, Y3, Y4, Y5, Y6, Y7, Y8;
int U1, U2, V1, V2, U3, U4, V3, V4;
for (int one = 0; one < 360; one++)
{
    timer2.Reset();
    timer2.Start();
    rgb[one] = new byte[bytes];
    rgb[360 + one] = new byte[bytes];
    for (int two = 0; two < 640; two++)
    {
        timer3.Reset();
        timer3.Start();
        U1 = temp_byte1[one * 2560 + 4 * two + 0];
        Y1 = temp_byte1[one * 2560 + 4 * two + 1];
        V1 = temp_byte1[one * 2560 + 4 * two + 2];
        Y2 = temp_byte1[one * 2560 + 4 * two + 3];

        U2 = temp_byte2[one * 2560 + 4 * two + 0];
        Y3 = temp_byte2[one * 2560 + 4 * two + 1];
        V2 = temp_byte2[one * 2560 + 4 * two + 2];
        Y4 = temp_byte2[one * 2560 + 4 * two + 3];

        RGB_Conversion(Y1, U1, V1, two * 8 + 0, rgb[one]);
        RGB_Conversion(Y2, U1, V1, two * 8 + 4, rgb[one]);

        RGB_Conversion(Y3, U2, V2, two * 8 + 0, rgb[(360 + one)]);
        RGB_Conversion(Y4, U2, V2, two * 8 + 4, rgb[(360 + one)]);

        timer3.Stop();
        timer3_[two] = timer3.Elapsed;
    }
    Marshal.Copy(rgb[one], 0, xptr, 5120);
    xptr = new IntPtr(xptr.ToInt64() + 5120);
    Marshal.Copy(rgb[(360 + one)], 0, yptr, 5120);
    yptr = new IntPtr(yptr.ToInt64() + 5120);
    timer2.Stop();
    timer2_[one] = timer2.Elapsed;
}
mybmp.UnlockBits(BPD);
mybmp.Save(GetDateTimeString("IP Pictures") + ".jpg", ImageFormat.Jpeg);

the code works and it converts yuv422 incoming array of bytes into a full size jpeg but cant understand why there is such a discrepancy between the speed of the for loops and the entire code

I moved the

byte[][]rgb = new byte[720];  
rgb[x] = new byte[bytes]; 

to a global that gets init at program startup instead of each function call/run no measurable increase in speed.

UPDATE

RGB Conversion: takes in YUV and converts it to RGB and puts it in the global array holding the values

public void RGB_Conversion(int Y, int U, int V, int MULT, byte[] rgb)
{

    int C,D,E;
    int R,G,B;

    // create the params for rgb conversion
    C = Y - 16;
    D = U - 128;
    E = V - 128;

    //R = clamp((298 x C + 409 x E + 128)>>8)
    //G = clamp((298 x C - 100 x D - 208 x E + 128)>>8)
    //B = clamp((298 x C + 516 x D + 128)>>8)

    R = (298 * C + 409 * E + 128)/256;
    G = (298 * C - 100 * D - 208 * E + 128)/256;
    B = (298 * C + 516 * D + 128)/256;

    if (R > 255)
        R = 255;
    if (R < 0)
        R = 0;
    if (G > 255)
        G = 255;
    if (G < 0)
        G = 0;
    if (B > 255)
        B = 255;
    if (B < 0)
        B = 0;

    rgb[MULT + 3] = 255;
    rgb[MULT + 0] = (byte)B;
    rgb[MULT + 1] = (byte)G;
    rgb[MULT + 2] = (byte)R;
    }
like image 599
Grant Avatar asked Aug 26 '11 22:08

Grant


3 Answers

Firstly

You need to remove the Start/Stop and stopwatch business from the inside of the loop

Resetting the stopwatch 640x in a tight loop is going to skew the figures. Better use a profiler or measure coarse grained performance.

Also, the presence of these statements might prevent compiler optimizations (loop tiling and loop unrolling look to be very good candidates here, but the JITter might not be able to use them, as the registers get clobbered to call stopwatch functions...

Data structures:

I have a feeling that you should be able to use a 'flat' data structure, instead of newing up all the jagged arrays there. That said, I don't know what API you are feeding it into, and I haven't concetrated a lot on it.

I do feel that making RGB_Conversion 'just' return the RGB parts instead of letting it write into an array might really give the compiler an edge to optimize things.

Other thoughts:

  • Look into RGB_Conversion (where/how is it defined?). Perhaps you can pull it inline.

  • use an unchecked block to prevent all the array index manipulations to check for overflow

  • consider using /unsafe code (here) to avoid bounds checking

like image 75
sehe Avatar answered Nov 07 '22 18:11

sehe


There's a ton of stuff you can do:

  1. Remove the 'new' allocation from the outer loop.
  2. Preallocate and pin all buffers
  3. Get rid of the Marshal.Copy and replace with either unsafe dword copy or win32 rtlcopymemory
  4. Inline RGB_Conversion
  5. Don't call new IntPtr in the outer loop, instead just increment a pointer to a pinned buffer.

I'm sure there's more, but that's what I saw at first glance. I think you'd be better off refactoring or rewriting the entire routine or perhaps even rewrite it in a C++.NET DLL or at least use unsafe code in the current version to avoid all of the fluff of .NET.

like image 3
GLD Avatar answered Nov 07 '22 19:11

GLD


One, I'd make sure you're not running this in the debugger, otherwise optimizations are completely turned off, and lots of NOP opcodes are inserted to give the debugger anchor points for braces etc.

Two, you're doing disk writing. That'll be fast sometimes if it gets buffered, and very very slow other times if the write triggers a flush. It's not CPU usage that's killing you here, probably. Could you confirm by running task manager and telling us what your cpu usage is?

If you still want to write intermediate JPGs to disk, what I'd recommend doing is setting up two threads with a thread-safe circular queue between them. Thread one is the code you have above that does all of the processing; once it was done it would save the BMP object to the queue and immediately move to the next iteration. Thread two would read your BMP objects out of the queue and write them to disk.

I'd recommend using a blocking queue (or making your own from Queue, with a counting semaphore) if the writes end up taking longer than the frames.

Second, do you have a machine that is multicore? You could further batch up computation. Below is a rough example, as there are a lot of considerations you'll want to make when taking an approach like this (involves a lot more locking, finding a good reader-writer circular queue implementation, dealing with out-of-order processing, dealing with larger jitter in the speed in which JPGs are generated causing the overall stream to have more lag, but better throughput).

Thread A: Reads YUV frames as arrays, from video source, assigns serial number to array, stuffs array + sn into queue A.

Thread B, C, D: reads objects from queue A, computes BMP object, stuffs BMPs with the same serial number into queue B. Queue B will have BMP objects in random order, eg, 0, 5, 6, 2, 3, 9, 4, ... because you have more than one thread writing to it, but since you have them labeled with serial number, you can reorder them later.

Thread E: Reads from Queue B, reorders frames, writes to disk.

All queues of course need to be thread-safe.

Going a step further yet, why not get rid of the intermediate JPG files? It's a lot of extra work to write those to disk just to read them back out in some other program or at some later step, and is probably a huge performance bottleneck. Why not generate the movie stream entirely in memory?

Other performance considerations: Are you reading across your arrays in the 'right' way? This is the cpu cache problem. Simple answer: try reversing which for-loop is inner to see if you get better performance.

Long answer: CPU caching of your data works much better if you read bytes in linear order. Lets take an example. You've got an rectangular 1000x1000 array, and its layed out in memory linearly by row - row zero is the first 1000 bytes, row one is the next, etc. If you read the array column-wise then row-wise, then you'd read bytes in this order: 0, 1000, 2000, ...., 999000, 1, 1001, 2001, ..., 999001, and so on. CPUs wont like that because each read is in a different page every single time, which means more cache-line misses. You'd be candy-striping across memory instead of reading linearly.

like image 2
antiduh Avatar answered Nov 07 '22 20:11

antiduh