I want to implement a really (really) fast Sobel operator for a ray-tracer a friend of me and I wrote (sources can be found here). What follows is what I figure out so far... First, let assume the image is a grey-scale picture store line by line in an 8 bits unsigned integers array. To write a real Sobel filter, I need to compute Gx and Gy for each pixel. Each of these numbers are computed thanks to 6 pixels next to the origin. But SIMD instruction allow me to deal with 16 or even 32 (AVX) pixels. Hopefully, the kernel of the operator has some nice property, so that I can compute Gy by : <ul> <li>subtracting each i and i+2 rows and store the result in a i+1 row of some other picture (array)</li> <li>adding the i, twice of the i+1 and the i+2 columns give the i+1 column of the final picture</li> </ul> I would do the same (but transposed) to compute Gx then add the two pictures. Some notes : <ul> <li>I don't care about memory allocation since everything will be allocated at the beginning.</li> <li>I can deal with the overflow and sign problem dividing the values by four (thanks to _mm_srli_epi8) <code>(uint8_t >> 2 - uint8_t >> 2) = int7_t //really store as int8_t int7_t + uint8_t << 1 >> 2 + int7_t = uint8_t //some precision is lost but I don't care</code> </li> </ul> The real problem I'm facing is to go from the rows to the columns. Since I couldn't load the picture in the SIMD register otherwise. I must flip the image three time at least isn't it ? Once the original picture. Then I can compute the first step for Gx and Gy and then flip the resulting pictures to compute the second step. So, here is my questions : <ul> <li>Is this kind of implementation a good idea ?</li> <li>Is there a way to transpose an array faster than the dumb algorithm ? (I don't think so)</li> <li>Where will be the bottlenecks ? (any guess ? :P) </li> </ul>

I think transpose/2-pass is not good for optimizing Sobel Operator code. Sobel Operator is not computational function, so wasting memory access for transpose/2-pass access is not good for this case. I wrote some Sobel Operator test codes to see how fast SSE can get. this code does not handle first and last edge pixels, and use FPUs to calculate sqrt() value. Sobel operator need 14 multiply, 1 square root, 11 addition, 2 min/max, 12 read access and 1 write access operators. This means you can process a component in 20~30 cycle if you optimize code well. FloatSobel() function took 2113044 CPU cycles to process 256 * 256 image processing 32.76 cycle/component. I'll convert this sample code to SSE. <pre class="prettyprint"><code>void FPUSobel() { BYTE* image_0 = g_image + g_image_width * 0; BYTE* image_1 = g_image + g_image_width * 1; BYTE* image_2 = g_image + g_image_width * 2; DWORD* screen = g_screen + g_screen_width*1; for(int y=1; y<g_image_height-1; ++y) { for(int x=1; x<g_image_width-1; ++x) { float gx = image_0[x-1] * (+1.0f) + image_0[x+1] * (-1.0f) + image_1[x-1] * (+2.0f) + image_1[x+1] * (-2.0f) + image_2[x-1] * (+1.0f) + image_2[x+1] * (-1.0f); float gy = image_0[x-1] * (+1.0f) + image_0[x+0] * (+2.0f) + image_0[x+1] * (+1.0f) + image_2[x-1] * (-1.0f) + image_2[x+0] * (-2.0f) + image_2[x+1] * (-1.0f); int result = (int)min(255.0f, max(0.0f, sqrtf(gx * gx + gy * gy))); screen[x] = 0x01010101 * result; } image_0 += g_image_width; image_1 += g_image_width; image_2 += g_image_width; screen += g_screen_width; } } </code></pre> SseSobel() function took 613220 CPU cycle to process same 256*256 image. It took 9.51 cycle/component and 3.4 time faster than FPUSobel(). There are some spaces to optimize but it will not faster than 4 times because it used 4-way SIMD. This function used SoA approach to process 4 pixels at once. SoA is better than AoS in most array or image datas because you have to transpose/shuffle to use AoS. And SoA is far easier changing common C code to SSE codes. <pre class="prettyprint"><code>void SseSobel() { BYTE* image_0 = g_image + g_image_width * 0; BYTE* image_1 = g_image + g_image_width * 1; BYTE* image_2 = g_image + g_image_width * 2; DWORD* screen = g_screen + g_screen_width*1; __m128 const_p_one = _mm_set1_ps(+1.0f); __m128 const_p_two = _mm_set1_ps(+2.0f); __m128 const_n_one = _mm_set1_ps(-1.0f); __m128 const_n_two = _mm_set1_ps(-2.0f); for(int y=1; y<g_image_height-1; ++y) { for(int x=1; x<g_image_width-1; x+=4) { // load 16 components. (0~6 will be used) __m128i current_0 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_0+x-1)), _mm_setzero_si128()); __m128i current_1 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_1+x-1)), _mm_setzero_si128()); __m128i current_2 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_2+x-1)), _mm_setzero_si128()); // image_00 = { image_0[x-1], image_0[x+0], image_0[x+1], image_0[x+2] } __m128 image_00 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_0, _mm_setzero_si128())); // image_01 = { image_0[x+0], image_0[x+1], image_0[x+2], image_0[x+3] } __m128 image_01 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_0, 2), _mm_setzero_si128())); // image_02 = { image_0[x+1], image_0[x+2], image_0[x+3], image_0[x+4] } __m128 image_02 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_0, 4), _mm_setzero_si128())); __m128 image_10 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_1, _mm_setzero_si128())); __m128 image_12 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_1, 4), _mm_setzero_si128())); __m128 image_20 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_2, _mm_setzero_si128())); __m128 image_21 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_2, 2), _mm_setzero_si128())); __m128 image_22 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_2, 4), _mm_setzero_si128())); __m128 gx = _mm_add_ps( _mm_mul_ps(image_00,const_p_one), _mm_add_ps( _mm_mul_ps(image_02,const_n_one), _mm_add_ps( _mm_mul_ps(image_10,const_p_two), _mm_add_ps( _mm_mul_ps(image_12,const_n_two), _mm_add_ps( _mm_mul_ps(image_20,const_p_one), _mm_mul_ps(image_22,const_n_one)))))); __m128 gy = _mm_add_ps( _mm_mul_ps(image_00,const_p_one), _mm_add_ps( _mm_mul_ps(image_01,const_p_two), _mm_add_ps( _mm_mul_ps(image_02,const_p_one), _mm_add_ps( _mm_mul_ps(image_20,const_n_one), _mm_add_ps( _mm_mul_ps(image_21,const_n_two), _mm_mul_ps(image_22,const_n_one)))))); __m128 result = _mm_min_ps( _mm_set1_ps(255.0f), _mm_max_ps( _mm_set1_ps(0.0f), _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(gx, gx), _mm_mul_ps(gy,gy))) )); __m128i pack_32 = _mm_cvtps_epi32(result); //R32,G32,B32,A32 __m128i pack_16 = _mm_packs_epi32(pack_32, pack_32); //R16,G16,B16,A16,R16,G16,B16,A16 __m128i pack_8 = _mm_packus_epi16(pack_16, pack_16); //RGBA,RGBA,RGBA,RGBA __m128i unpack_2 = _mm_unpacklo_epi8(pack_8, pack_8); //RRGG,BBAA,RRGG,BBAA __m128i unpack_4 = _mm_unpacklo_epi8(unpack_2, unpack_2); //RRRR,GGGG,BBBB,AAAA _mm_storeu_si128((__m128i*)(screen+x),unpack_4); } image_0 += g_image_width; image_1 += g_image_width; image_2 += g_image_width; screen += g_screen_width; } } </code></pre>

For the code in @zupet's answer: Instead of multiplying by one (const_p_one), I would do .... nothing. Compilers might not optimize that away. Instead of multiplying by two, I would add by self; faster than mul with integer arithm. But with FP, it mostly just avoids needing another vector constant. Haswell has worse FP add throughput than FP mul, but Skylake and Zen are balanced. Instead of multiplying by <code>-1.0</code>, negate with <code>_mm_xor_ps</code> with <code>-0.0</code> to flip the sign bit. I would compute pos and neg terms independently and side by side in parallel rather than one after the other (for better pipelining) with same arithm and sub only at the end. etc etc ... still a lot improvements pending With AVX+FMA available, <code>_mm_fma_ps</code> can be much faster.

Fast transposition of an image and Sobel Filter optimization in C (SIMD)

Tags:

c

optimization

simd

sse

I want to implement a really (really) fast Sobel operator for a ray-tracer a friend of me and I wrote (sources can be found here). What follows is what I figure out so far...

First, let assume the image is a grey-scale picture store line by line in an 8 bits unsigned integers array.

To write a real Sobel filter, I need to compute Gx and Gy for each pixel. Each of these numbers are computed thanks to 6 pixels next to the origin. But SIMD instruction allow me to deal with 16 or even 32 (AVX) pixels. Hopefully, the kernel of the operator has some nice property, so that I can compute Gy by :

subtracting each i and i+2 rows and store the result in a i+1 row of some other picture (array)
adding the i, twice of the i+1 and the i+2 columns give the i+1 column of the final picture

I would do the same (but transposed) to compute Gx then add the two pictures.

Some notes :

I don't care about memory allocation since everything will be allocated at the beginning.
I can deal with the overflow and sign problem dividing the values by four (thanks to _mm_srli_epi8) (uint8_t >> 2 - uint8_t >> 2) = int7_t //really store as int8_t int7_t + uint8_t << 1 >> 2 + int7_t = uint8_t //some precision is lost but I don't care

The real problem I'm facing is to go from the rows to the columns. Since I couldn't load the picture in the SIMD register otherwise. I must flip the image three time at least isn't it ?

Once the original picture. Then I can compute the first step for Gx and Gy and then flip the resulting pictures to compute the second step.

So, here is my questions :

Is this kind of implementation a good idea ?
Is there a way to transpose an array faster than the dumb algorithm ? (I don't think so)
Where will be the bottlenecks ? (any guess ? :P)

571

asked Aug 13 '13 19:08

matovitch

2 Answers

I think transpose/2-pass is not good for optimizing Sobel Operator code. Sobel Operator is not computational function, so wasting memory access for transpose/2-pass access is not good for this case. I wrote some Sobel Operator test codes to see how fast SSE can get. this code does not handle first and last edge pixels, and use FPUs to calculate sqrt() value.

Sobel operator need 14 multiply, 1 square root, 11 addition, 2 min/max, 12 read access and 1 write access operators. This means you can process a component in 20~30 cycle if you optimize code well.

FloatSobel() function took 2113044 CPU cycles to process 256 * 256 image processing 32.76 cycle/component. I'll convert this sample code to SSE.

void FPUSobel()
{
    BYTE* image_0 = g_image + g_image_width * 0;
    BYTE* image_1 = g_image + g_image_width * 1;
    BYTE* image_2 = g_image + g_image_width * 2;
    DWORD* screen = g_screen + g_screen_width*1;

    for(int y=1; y<g_image_height-1; ++y)
    {
        for(int x=1; x<g_image_width-1; ++x)
        {
            float gx =  image_0[x-1] * (+1.0f) + 
                        image_0[x+1] * (-1.0f) +
                        image_1[x-1] * (+2.0f) + 
                        image_1[x+1] * (-2.0f) +
                        image_2[x-1] * (+1.0f) + 
                        image_2[x+1] * (-1.0f);

            float gy =  image_0[x-1] * (+1.0f) + 
                        image_0[x+0] * (+2.0f) + 
                        image_0[x+1] * (+1.0f) +
                        image_2[x-1] * (-1.0f) + 
                        image_2[x+0] * (-2.0f) + 
                        image_2[x+1] * (-1.0f);


            int result = (int)min(255.0f, max(0.0f, sqrtf(gx * gx + gy * gy)));

            screen[x] = 0x01010101 * result;
        }
        image_0 += g_image_width;
        image_1 += g_image_width;
        image_2 += g_image_width;
        screen += g_screen_width;
    }
}

SseSobel() function took 613220 CPU cycle to process same 256*256 image. It took 9.51 cycle/component and 3.4 time faster than FPUSobel(). There are some spaces to optimize but it will not faster than 4 times because it used 4-way SIMD.

This function used SoA approach to process 4 pixels at once. SoA is better than AoS in most array or image datas because you have to transpose/shuffle to use AoS. And SoA is far easier changing common C code to SSE codes.

void SseSobel()
{
    BYTE* image_0 = g_image + g_image_width * 0;
    BYTE* image_1 = g_image + g_image_width * 1;
    BYTE* image_2 = g_image + g_image_width * 2;
    DWORD* screen = g_screen + g_screen_width*1;

    __m128 const_p_one = _mm_set1_ps(+1.0f);
    __m128 const_p_two = _mm_set1_ps(+2.0f);
    __m128 const_n_one = _mm_set1_ps(-1.0f);
    __m128 const_n_two = _mm_set1_ps(-2.0f);

    for(int y=1; y<g_image_height-1; ++y)
    {
        for(int x=1; x<g_image_width-1; x+=4)
        {
            // load 16 components. (0~6 will be used)
            __m128i current_0 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_0+x-1)), _mm_setzero_si128());
            __m128i current_1 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_1+x-1)), _mm_setzero_si128());
            __m128i current_2 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_2+x-1)), _mm_setzero_si128());

            // image_00 = { image_0[x-1], image_0[x+0], image_0[x+1], image_0[x+2] }
            __m128 image_00 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_0, _mm_setzero_si128()));
            // image_01 = { image_0[x+0], image_0[x+1], image_0[x+2], image_0[x+3] }
            __m128 image_01 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_0, 2), _mm_setzero_si128()));
            // image_02 = { image_0[x+1], image_0[x+2], image_0[x+3], image_0[x+4] }
            __m128 image_02 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_0, 4), _mm_setzero_si128()));
            __m128 image_10 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_1, _mm_setzero_si128()));
            __m128 image_12 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_1, 4), _mm_setzero_si128()));
            __m128 image_20 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_2, _mm_setzero_si128()));
            __m128 image_21 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_2, 2), _mm_setzero_si128()));
            __m128 image_22 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_2, 4), _mm_setzero_si128()));

            __m128 gx = _mm_add_ps( _mm_mul_ps(image_00,const_p_one),
                        _mm_add_ps( _mm_mul_ps(image_02,const_n_one),
                        _mm_add_ps( _mm_mul_ps(image_10,const_p_two),
                        _mm_add_ps( _mm_mul_ps(image_12,const_n_two),
                        _mm_add_ps( _mm_mul_ps(image_20,const_p_one),
                                    _mm_mul_ps(image_22,const_n_one))))));

            __m128 gy = _mm_add_ps( _mm_mul_ps(image_00,const_p_one), 
                        _mm_add_ps( _mm_mul_ps(image_01,const_p_two), 
                        _mm_add_ps( _mm_mul_ps(image_02,const_p_one),
                        _mm_add_ps( _mm_mul_ps(image_20,const_n_one), 
                        _mm_add_ps( _mm_mul_ps(image_21,const_n_two), 
                                    _mm_mul_ps(image_22,const_n_one))))));

            __m128 result = _mm_min_ps( _mm_set1_ps(255.0f), 
                            _mm_max_ps( _mm_set1_ps(0.0f), 
                                        _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(gx, gx), _mm_mul_ps(gy,gy))) ));

            __m128i pack_32 = _mm_cvtps_epi32(result); //R32,G32,B32,A32
            __m128i pack_16 = _mm_packs_epi32(pack_32, pack_32); //R16,G16,B16,A16,R16,G16,B16,A16
            __m128i pack_8 = _mm_packus_epi16(pack_16, pack_16); //RGBA,RGBA,RGBA,RGBA
            __m128i unpack_2 = _mm_unpacklo_epi8(pack_8, pack_8); //RRGG,BBAA,RRGG,BBAA
            __m128i unpack_4 = _mm_unpacklo_epi8(unpack_2, unpack_2); //RRRR,GGGG,BBBB,AAAA

            _mm_storeu_si128((__m128i*)(screen+x),unpack_4);
        }
        image_0 += g_image_width;
        image_1 += g_image_width;
        image_2 += g_image_width;
        screen += g_screen_width;
    }
}

122

answered Sep 21 '22 23:09

zupet

For the code in @zupet's answer:
Instead of multiplying by one (const_p_one), I would do .... nothing. Compilers might not optimize that away.
Instead of multiplying by two, I would add by self; faster than mul with integer arithm. But with FP, it mostly just avoids needing another vector constant. Haswell has worse FP add throughput than FP mul, but Skylake and Zen are balanced.

Instead of multiplying by -1.0, negate with _mm_xor_ps with -0.0 to flip the sign bit.

I would compute pos and neg terms independently and side by side in parallel rather than one after the other (for better pipelining) with same arithm and sub only at the end. etc etc ... still a lot improvements pending

With AVX+FMA available, _mm_fma_ps can be much faster.

answered Sep 24 '22 23:09

XavYeahZzzZzzz

Related questions
                            
                                Using GTK without DISPLAY
                            
                                creating va_list dynamically in GCC - can it be done?
                            
                                What is the correct way to check equality between instances of a union?
                            
                                Why there is no effect of restrict pointer
                            
                                Beginner: syntax error before int main ()
                            
                                using gdb to step into 3rd party functions with shared libraries
                            
                                What does the avformat_open_input() do?
                            
                                How to check the size of struct w/o padding?
                            
                                Is it OK to have a thread-local variable with the same name as a non-thread-local variable?
                            
                                How To Use Flex on Windows
                            
                                C Regular Expressions: Extracting the Actual Matches
                            
                                Understanding Scanf - dealing with formatted input
                            
                                Passing an array to execvp() from the user's input
                            
                                Create a dll dynamic library from C in R (Windows)
                            
                                How to write a program in C to measure the speed of cache?
                            
                                Complexity for nested loops dividing by 2
                            
                                Optimising bitwise operations in C
                            
                                function attribute returns_twice
                            
                                C: Assigning "static const char * const" to "static const char *"
                            
                                Swig: Syntax error in input(3)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With