Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing a string as UTF8 in C#

I'm doing a lot of string manipulation in C#, and really need the strings to be stored one byte per character. This is because I need gigabytes of text simultaneously in memory and it's causing low memory issues. I know for certain that this text will never contain non-ASCII characters, so for my purposes, the fact that System.String and System.Char store everything as two bytes per character is both unnecessary and a real problem.

I'm about to start coding my own CharAscii and StringAscii classes - the string one will basically hold its data as byte[], and expose string manipulation methods similar to the ones that System.String does. However this seems a lot of work to do something that seems like a very standard problem, so I'm really posting here to check that there isn't already an easier solution. Is there for example some way I can make System.String internally store data as UTF8 that I haven't noticed, or some other way round the problem?

like image 390
LondonPhantom Avatar asked Aug 27 '12 15:08

LondonPhantom


People also ask

How do I encode strings to UTF-8?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.

Does C use ASCII or UTF-8?

As far as I know, the standard C's char data type is ASCII, 1 byte (8 bits).

Is std :: string UTF-8?

Both std::string and std::wstring must use UTF encoding to represent Unicode. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.

What are UTF-8 strings?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.


1 Answers

Well, you could create a wrapper that retrieves the data as UTF-8 bytes and converts pieces as needed to System.String, then vice-versa to push the string back out to memory. The Encoding class will help you out here:

var utf8 = Encoding.UTF8; byte[] utfBytes = utf8.GetBytes(myString);  var myReturnedString = utf8.GetString(utfBytes); 
like image 167
KeithS Avatar answered Sep 22 '22 12:09

KeithS