When need to buffer in memory some raw data, for example from a stream, is it better to use an array of <code>char</code> or of <code>unsigned char</code>? I always used <code>char</code> but at work are saying it is better <code>unsigned char</code> and I don't know why.

UPDATE: C++17 introduced <code>std::byte</code>, which is more suited to "raw" data buffers than using any manner of <code>char</code>. For earlier C++ versions: <ul> <li><code>unsigned char</code> emphasises that the data is not "just" text</li> <li> if you've got what's effectively "byte" data from e.g. a compressed stream, a database table backup file, an executable image, a jpeg... then <code>unsigned</code> is appropriate for the binary-data connotation mentioned above <ul> <li><code>unsigned</code> works better for some of the operations you might want to do on binary data, e.g. there are undefined and implementation defined behaviours for some bit operations on signed types, and <code>unsigned</code> values can be used directly as indices in arrays</li> <li>you can't accidentally pass an <code>unsigned char*</code> to a function expecting <code>char*</code> and have it operated on as presumed text</li> <li>in these situations it's usually more natural to think of the values as being in the range 0..255, after all - why should the "sign" bit have a different kind of significance to the other bits in the data?</li> </ul> </li> <li>if you're storing "raw data" that - at an application logic/design level happens to be 8-bit numeric data, then by all means choose either <code>unsigned</code> or explicitly <code>signed</code> <code>char</code> as appropriate to your needs</li> </ul>

As far as the structure of the buffer is concerned, there is no difference: in both cases you get an element size of one byte, mandated by the standard. Perhaps the most important difference that you get is the behavior that you see when accessing the individual elements of the buffer, for example, for printing. With <code>char</code> you get implementation-defined signed or unsigned behavior; with <code>unsigned char</code> you always see unsigned behavior. This becomes important if you want to print the individual bytes of your "raw data" buffer. Another good alternative for use for buffers is the exact-width integer <code>uint8_t</code>. It is guaranteed to have the same width as <code>unsigned char</code>, its name requires less typing, and it tells the reader that you are not intended to use the individual elements of the buffer as character-based information.

Internally, it is exactly the same: Each element is a byte. The difference is given when you operate with those values. If your values range is [0,255] you should use <code>unsigned char</code> but if it is [-128,127] then you should use <code>signed char</code>. Suppose you are use the first range (<code>signed char</code>), then you can perform the operation <code>100+100</code>. Otherwise that operation will overflow and give you an unexpected value. Depending on your compiler or machine type, <code>char</code> may be unsigned or signed by default: Is char signed or unsigned by default? Thus having <code>char</code> the ranges described for the cases above. If you are using this buffer just to store binary data without operating with it, there is no difference between using <code>char</code> or <code>unsigned char</code>. EDIT Note that you can even change the default <code>char</code> for the same machine and compiler using compiler's flags: <blockquote> -funsigned-char Let the type char be unsigned, like unsigned char. Each kind of machine has a default for what char should be. It is either likeunsigned char by default or like signed char by default. Ideally, a portable program should always use signed char or unsigned char when it depends on the signedness of an object. But many programs have been written to use plain char and expect it to be signed, or expect it to be unsigned, depending on the machines they were written for. This option, and its inverse, let you make such a program work with the opposite default. The type char is always a distinct type from each of signed char or unsigned char, even though its behavior is always just like one of those two. </blockquote>

As @Pablo said in his answer, the key reason is that if you're doing arithmetic on the bytes, you'll get the 'right' answers if you declare the bytes as <code>unsigned char</code>: you want (in Pablo's example) 100 + 100 to add to 200; if you do that sum with <code>signed char</code> (which you might do by accident if <code>char</code> on your compiler is signed) there's no guarantee of that – you're asking for trouble. Another important reason is that it can help document your code, if you're explicit about what datatypes are what. It's useful to declare <pre class="prettyprint"><code>typedef unsigned char byte </code></pre> or even better <pre class="prettyprint"><code>#include <stdint.h> typedef uint8_t byte </code></pre> Using <code>byte</code> thereafter makes it that little bit clearer what your program's intent is. Depending on how paranoid your compiler is (<code>-Wall</code> is your friend), this might even cause a type warning if you give a <code>byte*</code> argument to a <code>char*</code> function argument, thus prompting you to think slightly more carefully about whether you're doing the right thing. A 'character' is fundamentally a pretty different thing from a 'byte'. C happens to blur the distinction (because at C's level, in a mostly ASCII world, the distinction doesn't matter in many cases). This blurring isn't always helpful, but it's at least good intellectual hygiene to keep the difference clear in your head.

Is it better to use char or unsigned char array for storing raw data?

Tags:

c++

arrays

c

char

When need to buffer in memory some raw data, for example from a stream, is it better to use an array of char or of unsigned char? I always used char but at work are saying it is better unsigned char and I don't know why.

553

asked Jun 12 '14 09:06

M310

4 Answers

UPDATE: C++17 introduced std::byte, which is more suited to "raw" data buffers than using any manner of char.

For earlier C++ versions:

unsigned char emphasises that the data is not "just" text
if you've got what's effectively "byte" data from e.g. a compressed stream, a database table backup file, an executable image, a jpeg... then unsigned is appropriate for the binary-data connotation mentioned above
- unsigned works better for some of the operations you might want to do on binary data, e.g. there are undefined and implementation defined behaviours for some bit operations on signed types, and unsigned values can be used directly as indices in arrays
- you can't accidentally pass an unsigned char* to a function expecting char* and have it operated on as presumed text
- in these situations it's usually more natural to think of the values as being in the range 0..255, after all - why should the "sign" bit have a different kind of significance to the other bits in the data?
if you're storing "raw data" that - at an application logic/design level happens to be 8-bit numeric data, then by all means choose either unsigned or explicitly signed char as appropriate to your needs

170

answered Oct 26 '22 15:10

Tony Delroy

As far as the structure of the buffer is concerned, there is no difference: in both cases you get an element size of one byte, mandated by the standard.

Perhaps the most important difference that you get is the behavior that you see when accessing the individual elements of the buffer, for example, for printing. With char you get implementation-defined signed or unsigned behavior; with unsigned char you always see unsigned behavior. This becomes important if you want to print the individual bytes of your "raw data" buffer.

Another good alternative for use for buffers is the exact-width integer uint8_t. It is guaranteed to have the same width as unsigned char, its name requires less typing, and it tells the reader that you are not intended to use the individual elements of the buffer as character-based information.

answered Oct 26 '22 13:10

Sergey Kalinichenko

Internally, it is exactly the same: Each element is a byte. The difference is given when you operate with those values.

If your values range is [0,255] you should use unsigned char but if it is [-128,127] then you should use signed char.

Suppose you are use the first range (signed char), then you can perform the operation 100+100. Otherwise that operation will overflow and give you an unexpected value.

Depending on your compiler or machine type, char may be unsigned or signed by default: Is char signed or unsigned by default? Thus having char the ranges described for the cases above.

If you are using this buffer just to store binary data without operating with it, there is no difference between using char or unsigned char.

EDIT

Note that you can even change the default char for the same machine and compiler using compiler's flags:

-funsigned-char Let the type char be unsigned, like unsigned char.

Each kind of machine has a default for what char should be. It is either likeunsigned char by default or like signed char by default. Ideally, a portable program should always use signed char or unsigned char when it depends on the signedness of an object. But many programs have been written to use plain char and expect it to be signed, or expect it to be unsigned, depending on the machines they were written for. This option, and its inverse, let you make such a program work with the opposite default.

The type char is always a distinct type from each of signed char or unsigned char, even though its behavior is always just like one of those two.

answered Oct 26 '22 14:10

Pablo Francisco Pérez Hidalgo

As @Pablo said in his answer, the key reason is that if you're doing arithmetic on the bytes, you'll get the 'right' answers if you declare the bytes as unsigned char: you want (in Pablo's example) 100 + 100 to add to 200; if you do that sum with signed char (which you might do by accident if char on your compiler is signed) there's no guarantee of that – you're asking for trouble.

Another important reason is that it can help document your code, if you're explicit about what datatypes are what. It's useful to declare

typedef unsigned char byte

or even better

#include <stdint.h>
typedef uint8_t byte

Using byte thereafter makes it that little bit clearer what your program's intent is. Depending on how paranoid your compiler is (-Wall is your friend), this might even cause a type warning if you give a byte* argument to a char* function argument, thus prompting you to think slightly more carefully about whether you're doing the right thing.

A 'character' is fundamentally a pretty different thing from a 'byte'. C happens to blur the distinction (because at C's level, in a mostly ASCII world, the distinction doesn't matter in many cases). This blurring isn't always helpful, but it's at least good intellectual hygiene to keep the difference clear in your head.

answered Oct 26 '22 14:10

Norman Gray

Related questions
                            
                                What does this colon do in an enum declaration?
                            
                                Virtual destructor and undefined behavior
                            
                                How to find occurrences of a string in string in C++? [duplicate]
                            
                                map/set iterator not incrementablemap/set iterator not incrementable
                            
                                Trapping quiet NaN
                            
                                "As a rule of thumb, make all your methods virtual" in C++ - sound advice?
                            
                                Undefined reference, using FFMpeg-library (AvCodec) on Ubuntu, 64-bits system
                            
                                What are the main differences between fwrite and write?
                            
                                printf more than 5 times faster than std::cout?
                            
                                Eigen convert dense matrix to sparse one
                            
                                Is it modern C++ to use srand to set random seed?
                            
                                const vs non-const of container and its content
                            
                                C++11: extending std::is_pointer to std::shared_ptr
                            
                                Boost Variant essentially a Union in c/c++?
                            
                                Function in OpenCV to find mean / avg over any one dimension (rows/cols) simultaneously
                            
                                Object creation order in braced init list
                            
                                Fastest way to compare bitsets (< operator on bitsets)?
                            
                                Is it legal to modify an object created with new through a const pointer?
                            
                                How to find the first value less than the search key with STL set?
                            
                                Hint for branch prediction in assertions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With