Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How could these case conversion functions be improved?

As a learning exercise, my three functions—ToggleCase, LowerCase and UpperCase—each expect a pointer to an ASCII char string, terminated by the null character; they work as expected. Are there more efficient or faster methods of accomplishing this task? Am I breaking any unspoken rules of good C coding? I've made use of macros because, I think, it makes the code look better and it is more efficient than function calls. Is this typical or overkill?

Please feel free to nit-pick and critique the code (but do be nice).

case_conversion.h

#define CASE_FLAG 32
#define a_z(c) (c >= 'a' && c <= 'z')
#define A_Z(c) (c >= 'A' && c <= 'Z')

void ToggleCase(char* c);
void LowerCase(char* c);
void UpperCase(char* c);

case_conversion.c

#include "case_conversion.h"

void ToggleCase(char* c)
{
 while (*c)
 {
  *c ^= a_z(*c) || A_Z(*c) ? CASE_FLAG : 0;
  c++;
 }
}
void LowerCase(char* c)
{
 while (*c)
 {
  *c ^= A_Z(*c) ? CASE_FLAG : 0;
  c++;
 }
}
void UpperCase(char* c)
{
 while (*c)
 {
  *c ^= a_z(*c) ? CASE_FLAG : 0;
  c++;
 }
}
like image 847
RLH Avatar asked Oct 07 '10 17:10

RLH


4 Answers

My favorites:

*c += (*c-'A'<26U)<<5; /* lowercase */
*c -= (*c-'a'<26U)<<5; /* uppercase */
*c ^= ((*c|32U)-'a'<26)<<5; /* toggle case */

Since your target will be embedded systems, you should be learning how to eliminate unnecessary code bloat, branches, etc. Your conditional for determining if an ascii character is alphabetical is 4 comparison/branch operations; mine is 1. I would recommend looking up some good resources on arithmetic and bit manipulation tricks.

Note: I changed the *32 operations to <<5 after posting my answer, because a number of embedded systems compilers are too poor to do this for you. When writing code for a good compiler, *32 would probably illustrate your intent better.

Edit: Regarding the charge that my code has too many implicit compiler-generated operations, I believe that's completely false. Here is the pseudo-asm any half-decent compiler should generate for the first line:

  1. Load *c and zero- or sign-extend it to fill an int-sized word (depending on whether plain char is signed or unsigned).
  2. Subtract the constant 26 using an unsigned (non-overflow-trapping) sub instruction.
  3. Conditional-jump past the rest of the code if the carry flag is not set.
  4. Else, add 32 to the value at *c.

Steps 2 and 3 may be combined on architectures that use a comparing-jump operation instead of flags. The only way I can see any significant behind-the-scenes costs cropping up is if the machine cannot directly address chars, or if it uses a nasty (sign/magnitude or ones complement) signed value representation, in which case the conversion to unsigned would be nontrivial. As far as I know, no modern embedded architecture has these problems; they're mostly isolated to legacy mainframes (and to a lesser extent, DSPs).

If anyone is worried about bad compilers actually performing arithmetic for the <<5, you might try:

if (*c-'A'<26U) *c+=32;

instead of my code. That's probably cleaner anyway, but I generally like avoiding statements so I can shove the code inside a loop condition or function-like macro.

Edit 2: By request, a branch-free version of the first line:

*c += (64U-*c & *c-91U)>>(CHAR_BIT*sizeof(unsigned)-5);

*c += (64U-*c & *c-91U) >> CHAR_BIT*sizeof(unsigned)-1 << 5;

In order for this to work reliably, c should have type unsigned char * and unsigned int should be strictly wider than unsigned char.

like image 154
R.. GitHub STOP HELPING ICE Avatar answered Nov 19 '22 10:11

R.. GitHub STOP HELPING ICE


There are at least two major problems with your macros. Consider what happens if I call one of them like

a_z('a' + 1);

The call will not give correct results due to operator precedence. This is easy to fix using brackets:

#define a_z(c) ((c) >= 'a' && (c) <= 'z')

But they can also be called like this:

a_z(i++);

This call will increment i twice! And that is not easily fixable (if at all) in a macro. I would recommend using inline functions instead (if needed - see below).

The fastest way to convert between upper/lowercase I know of is using lookup tables. Of course, this trades memory for speed - pick your preference knowing your specific platform :-)

You need two arrays, one for either direction. Initialize them like

char toUpper[128]; // we care only about standard ASCII
for (int i = 0; i < 128; i++)
  toUpper[i] = i;
toUpper['a'] = 'A';
...
toUpper['z'] = 'Z';

And the conversion is trivial:

char toUpperCase(char c)
{
  return toUpper[c];
}

(for production code, this should be improved to extend the array to all possible char values on given platform (or shrink it to only the legal values and do parameter checking), but for illustration, this will do.)

like image 44
Péter Török Avatar answered Nov 19 '22 10:11

Péter Török


NOTE: The Question Title was edited - original title was about optimization "Please Critique-- An optimal function for converting string cases in C" which explains why my answer deals with optimization only rather than generically "improving" the functions.

If you are really looking for the absolute fastest way to do it, a branch-free version is going to be the way to go in the long run because it can use SIMD. Plus it avoids having tables (which may be too large on an embedded system if memory is really cramped).

Here is a simple non-SIMD branch-free example and ToLower is a trivial change from this.

char BranchFree_AsciiToUpper(char inchar) 
{ 
        // Branch-Free / No-Lookup 
        // toupper() for ASCII-only 
        const int ConvertVal = 'A' - 'a'; 
        // Bits to Shift Arithmetic to Right : 9 == (char-bits + 1) 
        const int AsrBits = 9; 

        int c=(int)inchar; 
        //if( (('a'-1)<c) && (c<('z'+1)) ) { c += 'A'-'a'; } 
        int LowerBound = ('a'-1) - c; 
        int UpperBound = c - ('z' + 1); 
        int BranchFreeMask = (LowerBound & UpperBound)>>AsrBits;
        c = c + (BranchFreeMask & ConvertVal); 
        return((char)c); 
}

My function is expanded for clarity and uses non-hardcoded constants. You can do the same thing in a single line with hardcoded values but I like readable code; however, here is the "compressed" version of my algorithm. It's not any faster since it does the EXACT same thing "compressed" to one line.

c+=(((96-(int)c)&((int)c-123))>>9)&(-32);

There are a number of optimizations you can make here to make it even faster. You can hardcode more optimal numbers for ASCII because the example doesn't assume any encoding mapping other than a-z and A-Z are contiguous ranges. For example with ASCII, if you don't have a barrel shifter, you can actually change the AsrBits to 4 (9-5) since ConvertVal will be +/-32 depending on the toupper or tolower operation.

Once you have working branchfree versions, you can use SIMD or bit-twiddling SWAR (SIMD Within A Register) techniques to convert 4-16 bytes at a time (or even possibly more depending how wide your registers go and if you unroll to hide latency). This will be much faster than any lookup method which is pretty much limited to single-byte conversion unless you have immensely large tables which grow exponential per byte processed simultaneously.

Also, you can generate the branchfree predicate without using int upcasting but then you have to do a couple more operations (with upcasting it's just one subtract per range). You may need to do the expanded operations for SWAR but most SIMD implementations have a compare operation that will generate a mask for you for free.

The SWAR/SIMD operations also can benefit from fewer reads/writes to memory and the writes that do occur can be aligned. This is much faster on processors that have load-hit-store penalties (such as the PS3 Cell Processor). Combine this with simple prefetching in an unrolled version and you can avoid memory stalls nearly altogether.

I know it seems like there is a lot of code in my example there but there are ZERO branches (implicit or explicit) and no branch mispredictions as a result. If you are on a platform with significant branch mispredict penalties (which is true for many pipelined embedded processor), then even without SIMD, your optimized release build of the above code should run faster than something that seems much less complicated but creates implicit branches.

Even without SIMD/SWAR, it is possible for a smart compiler to unroll and interleave the above implementation to hide latencies and result in a very fast version - especially on modern superscalar processors that can issue more than one non-dependent instruction per cycle. This is not usually possible with any of the branching versions.

If you manually unroll, I would group loads and gather stores to make it easier for the compiler to interleave the non-branching non-dependent instructions in between. Example:

// Unrolled inner loop where 'char *c' is the string we're converting
char c0=c[0],c1=c[1],c2=c[2],c3=c[3];  // Grouped-Loads
c[0]=BranchFree_AsciiToUpper(c0);
c[1]=BranchFree_AsciiToUpper(c1);
c[2]=BranchFree_AsciiToUpper(c2);
c[3]=BranchFree_AsciiToUpper(c3);
c+=4;

A decent compiler should be able to inline the ToUpper and fully interleave the above code since there are no branches, no memory aliasing, and no codependent instructions between each one. Just for kicks, I decided to actually compile this and a compiler that targetted PowerPC generated perfect interleaving for dual-issue superscalar core that will easily outperform any code with branches.

mr               r31,r3
mr               r13,r13
lbz              r11,0(r31)
lbz              r10,1(r31)
extsb            r11,r11
lbz              r9,2(r31)
extsb            r10,r10
lbz              r8,3(r31)
subfic           r7,r11,96
addi             r6,r11,-123
srawi            r5,r7,9
srawi            r4,r6,9
subfic           r3,r10,96
addi             r7,r10,-123
extsb            r9,r9
srawi            r6,r3,9
srawi            r3,r7,9
subfic           r7,r9,96
addi             r30,r9,-123
extsb            r8,r8
srawi            r7,r7,9
srawi            r30,r30,9
subfic           r29,r8,96
addi             r28,r8,-123
srawi            r29,r29,9
srawi            r28,r28,9
and              r5,r5,r4
and              r3,r6,r3
and              r7,r7,r30
and              r30,r29,r28
clrrwi           r4,r5,5
clrrwi           r6,r7,5
clrrwi           r5,r3,5
clrrwi           r7,r30,5
add              r4,r4,r11
add              r3,r5,r10
add              r11,r6,r9
stb              r4,0(r31)
add              r10,r7,r8
stb              r3,1(r31)
stb              r11,2(r31)
stb              r10,3(r31)

The proof is in the pudding and the above compiled code gonna be really fast compared to branching versions even before going to SWAR or SIMD.

In summary, reasons why this should be the fastest method:

  1. No branch misprediction penalties
  2. Ability to SIMD-ify algorithm for 4-16 (or more) bytes at a time
  3. Compiler (or programmer) can unroll and interleave to eliminate latencies and to take advantage of superscalar (multi-issue) processors
  4. No memory latencies (i.e. table lookups)
like image 7
Adisak Avatar answered Nov 19 '22 08:11

Adisak


I hesitated to answer this because it's been over 20 years since I worked with small devices. However, I think the rules are pretty much the same (with one possible addition):

  1. Minimize memory accesses
  2. Minimize CPU cycles
  3. Minimize code size

When I was developing low-level code, rule #1 overshadowed all the others. There wasn't any on-board cache, and memory was incredible slow relative to the processor; that's the reason that the "register" storage class exists in C. Today the situation has changed somewhat, but it's still one of the top two concerns. As I commented on one post, a lookup table is a good idea, but recognize that it means an extra memory access for each test. Once it gets into cache that may not be an issue, but you're going to pay the price of a several cache hits each time you enter the function (unless you're calling it so often that the lookup table can remain in cache).

Rule number 2 seems like "duh, of course you want to do that, why isn't it rule #1?" but the reasoning actually goes deeper. In fact, in some ways it's a restatement of rule #1, since each instruction has to be fetched from memory before it can be executed. There's a delicate tradeoff: on an integer-only processor, it's a clear win to use a lookup table to compute trigonometric functions; on a chip with embedded floating point, maybe not.

I'm not sure that rule #3 still applies. In my experience, there was always the scramble to cut code, fit the proverbial 20 pounds into a 10 pound sack. But it seems that today the smallest sack is 50 pounds. However, even with a 50 pound sack (or many-megabyte ROM) to hold your code/data, you still need to pull it into the cache (if you have one).

The new rule #1: keep the pipeline full

Modern processors have deep instruction pipelines (if you're not familiar with this term, see this article: http://arstechnica.com/old/content/2004/09/pipelining-1.ars/1). The general rule of thumb with deep pipelines is that the branching -- an "if" test -- is expensive, because it means that the pipeline might have to be flushed to load in the new code. So you write your code to branch on the unlikely case (see Adisak's post for a perhaps-justified no-branch implementation; +1 if I could).

Someone with more recent experience than me will probably comment, and say "modern processors load the pipeline with both branches, so there's no cost penalty."Which is all well and good, but it brings up an overarching rule:

Rule 0: optimization depends on your architecture and workload

The microprocessor inside my dishwasher probably doesn't have a pipeline and maybe doesn't have a cache. Of course, it's probably not going to do a lot of text processing either. Or maybe it has both; it seems that there are only a few major embedded processors in the market, so maybe there's a Pentium on that circuit board rather than an 8051 derivative. Even so, there's a wide range even within the Pentium-based embedded processors (http://en.wikipedia.org/wiki/List_of_Intel_Pentium_microprocessors#Embedded_processors). What is best for one might not be best for another.

Then there's the issue of what sort of data you're processing. If you're processing text, then it's likely (but not guaranteed) that most of your data will be letters, versus numbers or punctuation; so you can optimize for that.

However, there's more: I commented "ASCII only, huh?" on the OP; another commenter was more explicit: if you're processing text in 2010, you probably aren't processing ASCII. At the very least, you'll be dealing with ISO-8859-1 or a similar 8-bit character set. And in this case, maybe a no-branch or smart-branch (paying attention to the pipeline) solution will still be faster than a lookup table (yes, that's an assumption on my part). But if you're dealing with Unicode BMP (16 bits), you'll pretty much have to use the table, whatever its cost in terms of memory, because there are no easy rules to determine what's lower- versus upper-case. And if you're dealing with the higher planes of Unicode ... well, maybe capitalization of "Old Italics" isn't that important (especially because it doesn't have upper- and lower-case).

Ultimately, the only way to know for sure is to profile given realistic workloads.

Finally: Clear Code FTW

This post started when I wrote a comment to the OP that his/her use of macros was a bad idea (and couldn't enter it because SO went into maintenance mode). Peter Torok (sorry, I don't support Unicode or even ISO-8859-1) gave one reason, but there's another: they're black boxes.

The OP looks nice and clean: short code, heavy use of bitwise and ternary operators, easy to understand if you understand the language. But it would have been a lot easier to understand the actual work if you saw A_Z in its expanded form. That might have gotten you thinking about how much branching you were doing, particular in the ToggleCase method. And then you might have given some thought to how you could re-arrange those branches to minimize the number of actual tests that you're doing. And perhaps given some thought to maintaining the pipeline.

like image 3
Anon Avatar answered Nov 19 '22 10:11

Anon