Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Properly delimiting strings in C

I'm wondering, what would be a good/efficient way to delimit a string that can contain basically any character. so for instance, I need to concatenate n strings that can look like:

char *str_1 = "foo; for|* 1.234+\"@!`";
char *str_n = "bar; for|* 1.234+%\"@`";

for a final string as:

char *str_final = "foo; for|* 1.234+\"@!`bar; for|* 1.234+%\"@`"; // split?

Which delimiter could I use to properly split it?

Note that there could be more than 2 string to concatenate.

I'm open for suggestions.

Thanks

like image 786
Darrow11 Avatar asked Dec 12 '22 18:12

Darrow11


2 Answers

Because my comments kept getting longer and longer, here is a full answer:

Your char * buffer should store the length of the string in the first X bytes (like how Pascal does it). After that length comes the string data, which can contain any characters you like. After that, the next X bytes tell you the length of the next string. So on and so forth, until the end, which is delimited by an empty string (i.e. the last X bytes claim that the next string has zero length, and your application takes this as the signal to stop looking for more strings).

One benefit is that you don't need to scan through the string data - finding the next string from the beginning of the first string takes O(1) time, finding how many strings there are in your list takes O(n) time but will still be blazingly fast (if O(n) is unacceptable you can work around this, but I don't think that's worth getting into right now).

Another benefit is that the string data can contain any character you like. This can be a con - if your string might contain the NUL character, you can safely extract it, but you have to be careful not to pass it to a C string function (like strlen() or strcat()), which will see the NUL character as the end of your data (which it may or may not be). You'll have to rely on memcpy() and pointer arithmetic.

The issue is the value of X (the number of bytes you use to store the string length). The easiest would be 1, which would bypass all endianness and alignment issues, but would limit your strings to 255 characters. If this is a limitation you can live with, excellent, but 255 seems a little low to me.

X could be 2 or 4 bytes, but you would need to make sure you have an (unsigned) data type that is at least that many bytes (stdint.h's uint16_t or uint32_t, or maybe uint_least16_t or uint_least32_t). A better solution would be to make X = sizeof(size_t), since the size_t type is guaranteed to be able to store the length of any string you could want to store.

Having X > 1 introduces alignment and, if network portability is an issue, endianness. The simplest way to read the first X bytes as a size_t variable would be to cast your char * data to a size_t * and just dereference. However, unless you can guarantee that your char * data is aligned properly, this will break on some systems. Even if you do guarantee the alignment of your char * data, you'll have to waste a few bytes at the end of most strings to make sure the next string's length value is aligned.

The easiest way to overcome alignment is to manually convert the first sizeof(size_t) bytes to a size_t value. You'll have to decide if you want the data to be stored little- or big-endian. Most computers will be little-endian natively, but for a manual conversion this won't matter - just pick one. The number 65537 (2 ^ 16 + 2) stored in 4 bytes, big-endian, looks like { 0, 1, 0, 2 }; little-endian, { 2, 0, 1, 0 }.

Once you've decided that (it doesn't matter, pick whichever one you like), you just cast the first X points of data to unsigned chars, then to size_t, then do a bit-shift by the appropriate exponent to put them in the proper place, then add them all together. In the above examples, 0 would be multiplied by 2 ^ 32, 1 by 2 ^ 16, 0 by 2 ^ 8, and 2 by 2 ^ 0 (or 1), producing 0 + 65536 + 0 + 2 or 65537. There probably will be zero efficiency difference between big- and little-endian if you're doing the manual conversion - I want to point out (again) that the choice is entirely arbitrary as far as I can tell.

Doing a manual conversion avoids alignment issues, and completely bypasses concerns about cross-system endianness, so data transferred from a little-endian computer to a big-endian one will be read the same. There is still a potential problem about data being transferred from a system where sizeof(size_t) == 4 to one where sizeof(size_t) == 8. If this is a problem, you can either a) ditch size_t and choose an invariant size, or b) encode (a single byte is all you need) the value of sizeof(size_t) for the sender as the first byte of data, and have the receiver make any necessary adjustments. Choice a) may be easier, but may cause problems (what if you pick a size too low to account for legacy computers on your network, and as they're phased out you start running out of room to store your data?), so I would prefer choice b) since it scales with whatever system you're running (16-bit, 32-bit, 64-bit, maybe even in the future 128-bit), but that kind of effort may not be necessary for you.

</vomit> I leave it to the reader to sort out all that mess I just wrote.

like image 159
Chris Lutz Avatar answered Jan 07 '23 04:01

Chris Lutz


Perhaps you could encode the length of the string followed by a special character in front of every string? This way you don't have to worry about what characters are in the next N characters. It may be a good idea to null terminate each substring as well.

The one advantage of this approach is that you'll be able to parse through the string quite fast.

EDIT: An even better approach is to use the first 2-4 bytes as suggested by Chris in the comment below instead of an encoded length + special character.

like image 27
GWW Avatar answered Jan 07 '23 05:01

GWW