Using Multiply Accumulate Instruction Inline Assembly in C++

Tags:

I am implementing a FIR filter on an ARM9 processor and am trying to use the SMLAL instruction.

Initially I had the following filter implemented and it worked perfectly, except this method uses too much processing power to be used in our application.

uint32_t DDPDataAcq::filterSample_8k(uint32_t sample)
 {
    // This routine is based on the fir_double_z routine outline by Grant R Griffin
    // - www.dspguru.com/sw/opendsp/alglib.htm 
    int i = 0; 
    int64_t accum = 0; 
    const int32_t *p_h = hCoeff_8K; 
    const int32_t *p_z = zOut_8K + filterState_8K;


    /* Cast the sample to a signed 32 bit int 
     * We need to preserve the signdness of the number, so if the 24 bit
     * sample is negative we need to move the sign bit up to the MSB and pad the number
     * with 1's to preserve 2's compliment. 
     */
    int32_t s = sample; 
    if (s & 0x800000)
        s |= ~0xffffff;

    // store input sample at the beginning of the delay line as well as ntaps more
    zOut_8K[filterState_8K] = zOut_8K[filterState_8K+NTAPS_8K] = s;

    for (i =0; i<NTAPS_8K; ++i)
    {
        accum += (int64_t)(*p_h++) * (int64_t)(*p_z++);
    }

    //convert the 64 bit accumulator back down to 32 bits
    int32_t a = (int32_t)(accum >> 9);


    // decrement state, wrapping if below zero
    if ( --filterState_8K < 0 )
        filterState_8K += NTAPS_8K;

    return a; 
}

I have been attempting to replace the multiply accumulate with inline assembly since GCC is not using a MAC instruction even with optimization turned on. I replaced the for loop with the following:

uint32_t accum_low = 0; 
int32_t accum_high = 0; 

for (i =0; i<NTAPS_4K; ++i)
{
    __asm__ __volatile__("smlal %0,%1,%2,%3;"
        :"+r"(accum_low),"+r"(accum_high)
        :"r"(*p_h++),"r"(*p_z++)); 
} 

accum = (int64_t)accum_high << 32 | (accum_low);

The output I now get using the SMLAL instruction is not the filtered data I was expecting. I have been getting random values that seem to have no pattern or connection to the original signal or the data I am expecting.

I have a feeling I am doing something wrong with splitting the 64 bit accumulator into the high and low registers for the instruction, or I am putting them back together wrong. Either way I not sure why I am not able to get the correct output by swapping the C code with the inline assembly.

608

asked Aug 23 '10 17:08

John C

1 Answers

Which compiler version are you using? I tried compiling your C only code using GCC 4.4.3 using the options -O3 -march=armv5te and it generated the smlal instructions.

193

answered Oct 01 '22 19:10

Nils Pipenbrinck

Related questions
                            
                                OpenCV unproject 2D points to 3D with known depth `Z`
                            
                                Understanding compilation result for std::isnan
                            
                                In CMake how do I make `TARGET_LINK_LIBRARIES` suppress warnings from 3rd party library code?
                            
                                Why doesn't this use of std::is_constructible compile?
                            
                                error assuming cast to type xxx from overloaded function on gcc [duplicate]
                            
                                Usage of for_each in the presence of exceptions? std::exception_list
                            
                                How to avoid {} when using aggregate initialization with empty base class
                            
                                Can filesystem::canonical be used to prevent filepath injection for filepaths passed to fstream
                            
                                Why can't template functions be passed as a template template parameter?
                            
                                Is this C++ AtomicInt implementation correct?
                            
                                Checking for constexpr in a concept
                            
                                How to redirect program output as its input
                            
                                What is the best way to disable implicit conversion from pointer types to bool when constructing an std::variant?
                            
                                Multiple conversion functions as "operator auto" in class
                            
                                How to convert Cardinal numbers into Ordinal ones
                            
                                Elegant Object comparison
                            
                                Unable to instantiate function templates which uses decltype to deduce return type, if called from inside a lambda?
                            
                                How is application virtualization implemented?
                            
                                Generating XML Documents from XML Schemas in C++
                            
                                C++ compile-time constant detection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Multiply Accumulate Instruction Inline Assembly in C++

Tags:

c++

assembly

filtering

arm

John C

People also ask

1 Answers

Nils Pipenbrinck

Recent Activity

Donate For Us