How can I remove characters from a string that are not supported by MySQL's utf8 character set? In other words, characters with four bytes, such as "𝜀", that are only supported by MySQL's utf8mb4 character set. For example, <pre class="prettyprint"><code>𝜀C = -2.4&permil; ± 0.3&permil;; 𝜀H = -57&permil; </code></pre> should become <pre class="prettyprint"><code>C = -2.4&permil; ± 0.3&permil;; H = -57&permil; </code></pre> I want to load a data file into a MySQL table that has <code>CHARSET=utf8</code>.

MySQL's <code>utf8mb4</code> encoding is what the world calls <code>UTF-8</code>. MySQL's <code>utf8</code> encoding is a subset of <code>UTF-8</code> that only supports characters in the BMP (meaning characters U+0000 to U+FFFF inclusive). Reference So, the following will match the unsupported characters in question: <pre class="prettyprint"><code>/[^\N{U+0000}-\N{U+FFFF}]/ </code></pre> Here are three different techniques you can use clean your input: 1: Remove unsupported characters: <pre class="prettyprint"><code>s/[^\N{U+0000}-\N{U+FFFF}]//g; </code></pre> 2: Replace unsupported characters with U+FFFD: <pre class="prettyprint"><code>s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g; </code></pre> 3: Replace unsupported characters using a translation map: <pre class="prettyprint"><code>my %translations = ( "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}", # ... ); s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg; </code></pre> <hr> For example, <pre class="prettyprint"><code>use utf8; # Source code is encoded using UTF-8 use open ':std', ':encoding(UTF-8)'; # Terminal and files use UTF-8. use strict; use warnings; use 5.010; # say, // use charnames ':full'; # Not needed in 5.16+ my %translations = ( "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}", # ... ); $_ = "𝜀C = -2.4&permil; ± 0.3&permil;; 𝜀H = -57&permil;"; say; s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg; say; </code></pre> Output: <pre class="prettyprint"><code>𝜀C = -2.4&permil; ± 0.3&permil;; 𝜀H = -57&permil; εC = -2.4&permil; ± 0.3&permil;; εH = -57&permil; </code></pre>

How can I remove characters that are not supported by MySQL's utf8 character set?

Tags:

mysql

utf-8

perl

utf8mb4

How can I remove characters from a string that are not supported by MySQL's utf8 character set? In other words, characters with four bytes, such as "𝜀", that are only supported by MySQL's utf8mb4 character set.

For example,

𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰

should become

C = -2.4‰ ± 0.3‰; H = -57‰

I want to load a data file into a MySQL table that has CHARSET=utf8.

778

asked Jan 10 '17 13:01

Matthias Munz

1 Answers

MySQL's utf8mb4 encoding is what the world calls UTF-8.

MySQL's utf8 encoding is a subset of UTF-8 that only supports characters in the BMP (meaning characters U+0000 to U+FFFF inclusive).

Reference

So, the following will match the unsupported characters in question:

/[^\N{U+0000}-\N{U+FFFF}]/

Here are three different techniques you can use clean your input:

1: Remove unsupported characters:

s/[^\N{U+0000}-\N{U+FFFF}]//g;

2: Replace unsupported characters with U+FFFD:

s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g;

3: Replace unsupported characters using a translation map:

my %translations = (
    "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
    # ...
);

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;

For example,

use utf8;                              # Source code is encoded using UTF-8
use open ':std', ':encoding(UTF-8)';   # Terminal and files use UTF-8.

use strict;
use warnings;
use 5.010;               # say, //
use charnames ':full';   # Not needed in 5.16+

my %translations = (
   "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
   # ...
);

$_ = "𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰";
say;

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
say;

Output:

𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰
εC = -2.4‰ ± 0.3‰; εH = -57‰

136

answered Sep 28 '22 10:09

ikegami

Related questions
                            
                                PDO not working in GoDaddy Server
                            
                                How to get the MySQL Server Only installer package?
                            
                                How to get event by event id into fullcalendar?
                            
                                MySQL image ignores volume configuration of docker-compose.yml
                            
                                How to set up an SQL table(s) structure for product categories?
                            
                                Selecting query sql with % as value character
                            
                                MySQL "ALTER IGNORE TABLE" Error In Syntax
                            
                                Efficient way of handling large number of data in MySQL
                            
                                How do I create a table with self-referencing fields in MySQL?
                            
                                HQL - Delete with JOIN error
                            
                                Delete rows from all tables
                            
                                Atomikos: exception when transaction contains more than one persist
                            
                                AWS RDS out of memory error when adding column
                            
                                Mysql disable auto_increment column temporary in Stored procedure
                            
                                Can you add an existing table to an existing model in MySQL Workbench?
                            
                                MySQL-python install Mac
                            
                                MySql : Initialize mySql variable inside a query
                            
                                NodeJS + mysql - automatically closing pool connections?
                            
                                Copy data between different databases (both are jdbc supported)
                            
                                difference between find_in_set and and locate

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With