Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to 'trim' trailing spaces/tabs from a string in an arbitrary encoding using ICU without doing any conversions

Specifically, given the following:

  • A pointer to a buffer containing string data in some encoding X supported by ICU
  • The length of the data in the buffer, in bytes
  • The encoding of the buffer (i.e. X)

Can I compute the length of the string, minus the trailing space/tab characters, without actually converting it into ICU's internal encoding first, then converting back? (this itself could be problematic due to unicode normalizations).

For certain encodings, such as any ascii-based encoding along with utf-8/16/32 the solution is pretty simple, just iterate from the back of the string comparing either 1/2/4 bytes at a time against the two constants.

For others it could be harder (variable-length encodings come to mind). I would like this to be as efficient as possible.

like image 721
Bwmat Avatar asked Nov 29 '13 18:11

Bwmat


2 Answers

For a large subset of encodings, and for the limited set of U+0020 SPACE and HORIZONTAL TAB U+0009, this is pretty simple.

In ASCII, single-byte Windows code pages, and single-byte ISO code pages, these characters all have the same value. You can simply work backwards, byte-by-byte, lopping them off as long as the value is either 9 or 32.

This approach also works for UTF-8, which has the nice property that a byte less than 128 is always that ASCII character. You don't have to wonder whether it's a lead byte or a continuation byte, as those always have the high bit set.

Given UTF-16, you work two bytes at a time, looking for 0x0009 and 0x0020, being careful to handle byte order. Like UTF-8, UTF-16 has the nice property that if you see this value, you don't have to wonder if it's part of a surrogate pair, as those always have a distinct value.

The problematic cases are the variable-byte encodings that don't give you the assurance that a given unit is unique. If you see a byte with a value 9, you don't necessarily know whether it's a tab character or a random byte from a multibyte encoding. Even for some of these, however, it may be possible that the specific values you care about (9 and 32) are unique. For example, looking at Windows code page 950, it seems that lead bytes have the high value set, and tail bytes steer clear of the lower values (it would take a lot of checking to be absolutely sure). So for your limited case, this might be sufficient.

For the general problem of stripping out an arbitrary set of characters from absolutely any encoding, you need to parse the string according to the rules of that encoding (as well as knowing all the character mappings). For the general case, it's almost certainly best to convert the string to some Unicode encoding, do the trimming, and then convert back. This should round-trip correctly if you're careful to use the K normalization forms.

like image 196
Adrian McCarthy Avatar answered Sep 23 '22 15:09

Adrian McCarthy


I use the rather simplistic STL approach of:

std::string mystring;
mystring.erase(mystring.find_last_not_of(" \n\r\t")+1);

Which seems to work for all my needs so far (your mileage may vary), but after years of using it it seems to do the job:)

Let me know if you need more information:)

like image 33
GMasucci Avatar answered Sep 22 '22 15:09

GMasucci