Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert extended ASCII character codes to utf-8 byte codes

I'm trying to figure out how to url encode strings, character by character, when all i have are the extended ASCII codes.

For example, for codes below 128, that's pretty simple: The code for char "?" is 63, which is 3F in base 16, so the url encoding of the string "?" is "%3F".

Is it possible to do the same for > 127 char codes? For instance the code for "á" is 225 (E1 in base 16). Is it possible to get from here to the bytes %C3%A1, which constitute the url encoding of "á"? If so, which operations need to be performed?

Edit: I should have been more specific, the character set is (ISO Latin-1). It seems I should also make it clearer that this question is about a formula / way to programmatically do the conversion, not about how to urlencode a char using some library in some language.

like image 492
Diogo Franco Avatar asked Mar 08 '16 22:03

Diogo Franco


People also ask

Does UTF-8 support extended ASCII?

UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.

How many bytes does extended ASCII use per character?

The standard ASCII character set is only 7 bits, and characters are represented as 8-bit bytes with the most significant bit set to 0. Modern computers almost universally use 8-bit bytes, and the extended ASCII character set includes 127 more 8-bit characters, where the most significant bit is set to 1.


1 Answers

If your encoding of "extended ASCII" is ISO-8859-1, then you're in luck. The first 255 Unicode points (Not UTF-8 encoding) of Unicode follow ISO-8859-1. I.e. á == U+00E1.

If you have any other encoding, then you're out of luck. The mapping of characters was arbitrary, so requires a rosetta stone and not calculation.

Once you have a Unicode point, you can relatively easily encode it to UTF-8 using the specification found in https://www.rfc-editor.org/rfc/rfc3629. Without a programming language defined in your question it's out of scope to try to detail that conversion here.

Percent encoding, is then a matter of applying the percent encoding specification to the UTF-8 characters.

Fortunately, most programming languages have inbuilt or 3rd party library for this kind of conversion.

like image 110
Alastair McCormack Avatar answered Sep 29 '22 07:09

Alastair McCormack