Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I recover international characters mistakenly stored in a varchar field?

My client has an old MS SQL 2000 database that uses varchar(50) fields to store names. He tried to use this database to capture some data (via a web form). Some of the form-fillers are from other countries, and the varchar fields went nutty when some of these folks entered their names. Is it possible to recover the data somehow? Maybe by guessing what the character should be based on what it resolved to in ASCII/varchar and the country the person is from? Some of the data:

Name / Country / First or Last Name?
Jiří / CZE / F
Torbjörn / FIN / F
Huszár / HUN / L
Jürgen / DEU / F
Müller / CHE / L
Bumbálková / CZE / L
Doležal / CZE / L
Loïc / DEU / L

By the way, the web form specified this content-type:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
like image 574
Chris Avatar asked Oct 28 '08 00:10

Chris


People also ask

Can we store special characters in VARCHAR?

VARCHAR columns, as the name implies, store variable-length data. They can store characters, numbers, and special characters just like a CHAR column and can support strings up to 8000 bytes in size.

Can VARCHAR store Unicode characters?

Unicode Data Types. Data types nchar, nvarchar, and long nvarchar are used to store Unicode data. They behave similarly to char, varchar, and long varchar character types respectively, except that each character in a Unicode type typically uses 16 bits.

What is the difference between VARCHAR and nvarchar?

The key difference between varchar and nvarchar is the way they are stored, varchar is stored as regular 8-bit data(1 byte per character) and nvarchar stores data at 2 bytes per character. Due to this reason, nvarchar can hold upto 4000 characters and it takes double the space as SQL varchar.

How do I find non ascii characters in SQL?

ASCII character set is captured using regex [A-Za-z0-9]. You can use this regex in your query as shown below, to find non-ASCII characters. mysql> SELECT * FROM data WHERE full_name NOT REGEXP '[A-Za-z0-9]'; You can also customize regex to include certain characters.


1 Answers

Working from the 5th example.

à is ascii #195 (C3). ¼ is ascii #188 (BC).

I'd guess that Müller is meant to be Müller.

If this is UTF-8, based upon http://en.wikipedia.org/wiki/UTF-8#Description

We've got C3 BC = 1100 0011 1011 1100

Applying the UTF-8 mapping:

(110) 00011 (10) 11 1100

0000 0000 1111 1100

00FC which is Unicode ü

U+00FC (see http://en.wikipedia.org/wiki/Latin_characters_in_Unicode)

Seems to me that you could work through this programmatically.

Now solving the first example:

Jiå™ã was actually Jiří (The final character not shown).

Ignoring the Ji, which is correct,

C5 99 c3 AD

(110)0 0101 (10)01 1001 (110)0 0011 (10)10 1101

0159 00ED

ří

So the name is: Jiří. Wikipedia says that special r is Czech and so is the i. Furthermore if I google Jiří (http://www.google.com/search?q=Ji%C5%99%C3%AD&ie=utf-8&oe=utf-8) I get plenty of hits. We're on a winner here.

The second example, Torbjörn, maps nicely to Torbjörn which sounds convincing.

IMHO there's no great need for human checking of these, they seem to just work.

like image 193
8 revs, 2 users 98% Avatar answered Sep 21 '22 14:09

8 revs, 2 users 98%