Fast RGB => YUV conversion in OpenCL

Tags:

I know the following formula can be used to convert RGB images to YUV images. In the following formula, R, G, B, Y, U, V are all 8-bit unsigned integers, and intermediate values are 16-bit unsigned integers.

Y = ( (  66 * R + 129 * G +  25 * B + 128) >> 8) +  16  
U = ( ( -38 * R -  74 * G + 112 * B + 128) >> 8) + 128  
V = ( ( 112 * R -  94 * G -  18 * B + 128) >> 8) + 128

But when the formula is used in OpenCL it's a different story.
1. 8-bit memory write access is an optional extension, which means some OpenCL implementations may not support it.
2. even the above extension is supported, it's deadly slow compared with 32-bit write access.

In order to get better performance, every 4 pixels will be processed at the same time, so the input is 12 8-bit integers and the output is 3 32-bit unsigned integers(the first one stands for 4 Y samples, the second one stands for 4 U samples, the last one stands for 4 V samples).

My question is how to get these 3 32-bit integers directly from the 12 8-bit integers? Is there a formula to get these 3 32-bit integers, or I just need to use the old formula to get 12 8-bit integer results(4 Y, 4 U, 4 V) and construct the 3 32-bit integers with bit-wise operation?

697

asked Feb 12 '11 17:02

user416983

2 Answers

Even though this question was asked 2 years ago, i think some working code would help here. In terms of the initial concerns about bad performance when directly accessing 8-bit values, it's better to perform 32-bit direct access when possible.

Some time ago I've developed and used the following OpenCL kernel to convert ARGB (typical windows bitmap pixel layout) to the y-plane (full sized), u/v-half-plane (quarter sized) memory layout as input for libx264 encoding.

__kernel void ARGB2YUV ( 
                            __global  unsigned int * sourceImage,
                            __global unsigned int * destImage,
            unsigned int srcHeight,
            unsigned int srcWidth,
            unsigned int yuvStride // must be srcWidth/4 since we pack 4 pixels into 1 Y-unit (with 4 y-pixels)
            )
{
    int i,j;
    unsigned int RGBs [ 4 ];
    unsigned int posSrc, RGB, Value4 = 0, Value, yuvStrideHalf, srcHeightHalf, yPlaneOffset, posOffset;
    unsigned char red, green, blue;

    unsigned int posX = get_global_id(0);
    unsigned int posY = get_global_id(1);

    if ( posX < yuvStride ) {
        // Y plane - pack 4 y's within each work item
        if ( posY >= srcHeight )
            return;

        posSrc = (posY * srcWidth) + (posX * 4);

        RGBs [ 0 ] = sourceImage [ posSrc ];
        RGBs [ 1 ] = sourceImage [ posSrc + 1 ];
        RGBs [ 2 ] = sourceImage [ posSrc + 2 ];
        RGBs [ 3 ] = sourceImage [ posSrc + 3 ];

        for ( i=0; i<4; i++ ) {
            RGB = RGBs [ i ];

            blue = RGB & 0xff; green = (RGB >> 8) & 0xff; red = (RGB >> 16) & 0xff;

            Value = ( ( 66 * red + 129 * green + 25 * blue ) >> 8 ) + 16;
            Value4 |= (Value << (i * 8));
        }

        destImage [ (posY * yuvStride) + posX ] = Value4;
        return;
    }

    posX -= yuvStride;
    yuvStrideHalf = yuvStride >> 1;

    // U plane - pack 4 u's within each work item
    if ( posX >= yuvStrideHalf )
        return;

    srcHeightHalf = srcHeight >> 1; 
    if ( posY < srcHeightHalf ) {
        posSrc = ((posY * 2) * srcWidth) + (posX * 8);

        RGBs [ 0 ] = sourceImage [ posSrc ];
        RGBs [ 1 ] = sourceImage [ posSrc + 2 ];
        RGBs [ 2 ] = sourceImage [ posSrc + 4 ];
        RGBs [ 3 ] = sourceImage [ posSrc + 6 ];

        for ( i=0; i<4; i++ ) {
            RGB = RGBs [ i ];

            blue = RGB & 0xff; green = (RGB >> 8) & 0xff; red = (RGB >> 16) & 0xff;
            Value = ( ( -38 * red + -74 * green + 112 * blue ) >> 8 ) + 128;
            Value4 |= (Value << (i * 8));
        }
        yPlaneOffset = yuvStride * srcHeight;
        posOffset = (posY * yuvStrideHalf) + posX;
        destImage [ yPlaneOffset + posOffset ] = Value4;
        return;
    }

    posY -= srcHeightHalf;
    if ( posY >= srcHeightHalf )
        return;

    // V plane - pack 4 v's within each work item
    posSrc = ((posY * 2) * srcWidth) + (posX * 8);

    RGBs [ 0 ] = sourceImage [ posSrc ];
    RGBs [ 1 ] = sourceImage [ posSrc + 2 ];
    RGBs [ 2 ] = sourceImage [ posSrc + 4 ];
    RGBs [ 3 ] = sourceImage [ posSrc + 6 ];

    for ( i=0; i<4; i++ ) {
        RGB = RGBs [ i ];

        blue = RGB & 0xff; green = (RGB >> 8) & 0xff; red = (RGB >> 16) & 0xff;

        Value = ( ( 112 * red + -94 * green + -18 * blue ) >> 8 ) + 128;
        Value4 |= (Value << (i * 8));
    }

    yPlaneOffset = yuvStride * srcHeight;
    posOffset = (posY * yuvStrideHalf) + posX;

    destImage [ yPlaneOffset + (yPlaneOffset >> 2) + posOffset ] = Value4;
    return;
}

This code performs only global 32-bit memory access while 8-bit processing happens within each work item.

Oh.. and the proper code to invoke the kernel

unsigned int width = 1024;
unsigned int height = 768;

unsigned int frameSize = width * height;
const unsigned int argbSize = frameSize * 4; // ARGB pixels

const unsigned int yuvSize = frameSize + (frameSize >> 1); // Y,U,V planes

const unsigned int yuvStride = width >> 2; // since we pack 4 RGBs into "one" YYYY

// Allocates ARGB buffer
ocl_rgb_buffer = clCreateBuffer ( context, CL_MEM_READ_WRITE, argbSize, 0, &error );
// ... error handling ...

ocl_yuv_buffer = clCreateBuffer ( context, CL_MEM_READ_WRITE, yuvSize, 0, &error );
// ... error handling ...

error = clSetKernelArg  ( kernel, 0, sizeof(cl_mem), &ocl_rgb_buffer );
error |= clSetKernelArg ( kernel, 1, sizeof(cl_mem), &ocl_yuv_buffer );

error |= clSetKernelArg ( kernel, 2, sizeof(unsigned int), &height);
error |= clSetKernelArg ( kernel, 3, sizeof(unsigned int), &width);

error |= clSetKernelArg ( kernel, 4, sizeof(unsigned int), &yuvStride);
// ... error handling ...

const size_t local_ws[] = { 16, 16 };
const size_t global_ws[] = { yuvStride + (yuvStride >> 1), height };

error = clEnqueueNDRangeKernel ( queue, kernel, 2, NULL, global_ws, local_ws, 0, NULL, NULL );
// ... error handling ...

Note: have a look at the work item calculations. Some additional code needs to be added (e.g. using mod so as to add sufficient spare items) to make sure that work item sizes fit to local work sizes.

188

answered Sep 23 '22 01:09

Chi-Tai D.

Like this? Use int4 unless your platform can use int3. Also you can pack 5 pixels into an int16 so you are wasting 1/16 instead of 1/4 of the memory bandwidth.

__kernel void rgb2yuv( __global int3* input, __global int3* output){


rgb = input[get_global_id(0)];
R = rgb.x;
G = rgb.y;
B = rgb.z;    

yuv.x = ( (  66 * R + 129 * G +  25 * B + 128) >> 8) +  16; 
yuv.y = ( ( -38 * R -  74 * G + 112 * B + 128) >> 8) + 128; 
yuv.z = ( ( 112 * R -  94 * G -  18 * B + 128) >> 8) + 128;

output[get_global_id(0)] = yuv;
}

answered Sep 25 '22 01:09

Chad Brewbaker

Related questions
                            
                                Can you explicitly set a structure layout/alignment in C++ as you can in C#?
                            
                                TStatusBar with bottom aligned panel
                            
                                How to position an <input> in the center of the <div>
                            
                                Text alignment: extracting matching sequence using python
                            
                                How to left align a Gtk Label when its width is set by GtkSizeGroup
                            
                                Alignment of characters in Notepad++
                            
                                WPF itemscontrol child alignment
                            
                                Aligning Nodes to the right side of a Vbox in javafx
                            
                                CSS have each grid element a different height
                            
                                Placing a UITextField in the center of the view
                            
                                Vertical alignment of submit button using CSS HTML
                            
                                Center align anchor within a `li` element?
                            
                                Align View to bottom and top in RelativeLayout XamarinForms
                            
                                How to right align text and justify it? [Latex]
                            
                                Buttons with multiline text sink under alignment line, how to fix it?
                            
                                How to align text in a WPF menu item?
                            
                                Showing <h1> <h5> text inline using css
                            
                                Text direction and alignment in email sent via email() using PHP
                            
                                Align text and icon differently in JLabel
                            
                                vim how to comma align

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast RGB => YUV conversion in OpenCL

Tags:

alignment

rgb

opencl

yuv

user416983

People also ask

2 Answers

Chi-Tai D.

Chad Brewbaker

Recent Activity

Donate For Us