How can I remove characters from a string that are not supported by MySQL's utf8 character set? In other words, characters with four bytes, such as "𝜀", that are only supported by MySQL's utf8mb4 character set.
For example,
𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰
should become
C = -2.4‰ ± 0.3‰; H = -57‰
I want to load a data file into a MySQL table that has CHARSET=utf8
.
Yes. 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.
The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.
Similarly, here's the command to change character set of MySQL table from latin1 to UTF8. Replace table_name with your database table name. mysql> ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci; Hopefully, the above tutorial will help you change database character set to utf8mb4 (UTF-8).
MySQL's utf8mb4
encoding is what the world calls UTF-8
.
MySQL's utf8
encoding is a subset of UTF-8
that only supports characters in the BMP (meaning characters U+0000 to U+FFFF inclusive).
Reference
So, the following will match the unsupported characters in question:
/[^\N{U+0000}-\N{U+FFFF}]/
Here are three different techniques you can use clean your input:
1: Remove unsupported characters:
s/[^\N{U+0000}-\N{U+FFFF}]//g;
2: Replace unsupported characters with U+FFFD:
s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g;
3: Replace unsupported characters using a translation map:
my %translations = (
"\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
# ...
);
s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
For example,
use utf8; # Source code is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal and files use UTF-8.
use strict;
use warnings;
use 5.010; # say, //
use charnames ':full'; # Not needed in 5.16+
my %translations = (
"\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
# ...
);
$_ = "𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰";
say;
s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
say;
Output:
𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰
εC = -2.4‰ ± 0.3‰; εH = -57‰
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With