Convert all types of smart quotes with PHP

Tags:

I am trying to convert all types of smart quotes to regular quotes when working with text. However, the following function I've compiled still seems to be lacking support and proper design.

Does anyone know how to properly get all quote characters converted?

function convert_smart_quotes($string) {     $quotes = array(         "\xC2\xAB"   => '"', // « (U+00AB) in UTF-8         "\xC2\xBB"   => '"', // » (U+00BB) in UTF-8         "\xE2\x80\x98" => "'", // ‘ (U+2018) in UTF-8         "\xE2\x80\x99" => "'", // ’ (U+2019) in UTF-8         "\xE2\x80\x9A" => "'", // ‚ (U+201A) in UTF-8         "\xE2\x80\x9B" => "'", // ‛ (U+201B) in UTF-8         "\xE2\x80\x9C" => '"', // “ (U+201C) in UTF-8         "\xE2\x80\x9D" => '"', // ” (U+201D) in UTF-8         "\xE2\x80\x9E" => '"', // „ (U+201E) in UTF-8         "\xE2\x80\x9F" => '"', // ‟ (U+201F) in UTF-8         "\xE2\x80\xB9" => "'", // ‹ (U+2039) in UTF-8         "\xE2\x80\xBA" => "'", // › (U+203A) in UTF-8     );     $string = strtr($string, $quotes);      // Version 2     $search = array(         chr(145),         chr(146),         chr(147),         chr(148),         chr(151)     );     $replace = array("'","'",'"','"',' - ');     $string = str_replace($search, $replace, $string);      // Version 3     $string = str_replace(         array('&#8216;','&#8217;','&#8220;','&#8221;'),         array("'", "'", '"', '"'),         $string     );      // Version 4     $search = array(         '&lsquo;',          '&rsquo;',          '&ldquo;',          '&rdquo;',          '&mdash;',         '&ndash;',     );     $replace = array("'","'",'"','"',' - ', '-');     $string = str_replace($search, $replace, $string);      return $string; }

Note: This question is a complete query about the full of gamut of quotes including the "Microsoft" quotes asked here This is a "duplicate" in the same way that asking about all tire sizes is a "duplicate" of asking for a car tire size.

374

asked Nov 16 '13 23:11

Xeoncross

1 Answers

You need something like this (assuming UTF-8 input, and ignoring CJK (Chinese, Japanese, Korean)):

$chr_map = array(    // Windows codepage 1252    "\xC2\x82" => "'", // U+0082⇒U+201A single low-9 quotation mark    "\xC2\x84" => '"', // U+0084⇒U+201E double low-9 quotation mark    "\xC2\x8B" => "'", // U+008B⇒U+2039 single left-pointing angle quotation mark    "\xC2\x91" => "'", // U+0091⇒U+2018 left single quotation mark    "\xC2\x92" => "'", // U+0092⇒U+2019 right single quotation mark    "\xC2\x93" => '"', // U+0093⇒U+201C left double quotation mark    "\xC2\x94" => '"', // U+0094⇒U+201D right double quotation mark    "\xC2\x9B" => "'", // U+009B⇒U+203A single right-pointing angle quotation mark     // Regular Unicode     // U+0022 quotation mark (")                           // U+0027 apostrophe     (')    "\xC2\xAB"     => '"', // U+00AB left-pointing double angle quotation mark    "\xC2\xBB"     => '"', // U+00BB right-pointing double angle quotation mark    "\xE2\x80\x98" => "'", // U+2018 left single quotation mark    "\xE2\x80\x99" => "'", // U+2019 right single quotation mark    "\xE2\x80\x9A" => "'", // U+201A single low-9 quotation mark    "\xE2\x80\x9B" => "'", // U+201B single high-reversed-9 quotation mark    "\xE2\x80\x9C" => '"', // U+201C left double quotation mark    "\xE2\x80\x9D" => '"', // U+201D right double quotation mark    "\xE2\x80\x9E" => '"', // U+201E double low-9 quotation mark    "\xE2\x80\x9F" => '"', // U+201F double high-reversed-9 quotation mark    "\xE2\x80\xB9" => "'", // U+2039 single left-pointing angle quotation mark    "\xE2\x80\xBA" => "'", // U+203A single right-pointing angle quotation mark ); $chr = array_keys  ($chr_map); // but: for efficiency you should $rpl = array_values($chr_map); // pre-calculate these two arrays $str = str_replace($chr, $rpl, html_entity_decode($str, ENT_QUOTES, "UTF-8"));

Here comes the background:

Every Unicode character belongs to exactly one "General Category", of which the ones that can contain quote characters are the following:

Ps "Punctuation, Open"
Pe "Punctuation, Close"
Pi "Punctuation, Initial quote (may behave like Ps or Pe depending on usage)"
Pf "Punctuation, Final quote (may behave like Ps or Pe depending on usage)"
Po "Punctuation, Other"

(these pages are handy for checking that you didn't miss anything - there is also an index of categories)

It is sometimes useful to match these categories in a Unicode-enabled regex.

Furthermore, Unicode characters have "properties", of which the one you are interested in is Quotation_Mark. Unfortunately, these are not accessible in a regex.

In Wikipedia you can find the group of characters with the Quotation_Mark property. The final reference is PropList.txt on unicode.org, but this is an ASCII textfile.

In case you need to translate CJK characters too, you only have to get their code points, decide their translation, and find their UTF-8 encoding, e.g., by looking it up in fileformat.info (e.g., for U+301E: http://www.fileformat.info/info/unicode/char/301e/index.htm).

Regarding Windows codepage 1252: Unicode defines the first 256 code points to represent exactly the same characters as ISO-8859-1, but ISO-8859-1 is often confused with Windows codepage 1252, so that all browsers render the range 0x80-0x9F, which is "empty" in ISO-8859-1 (more exactly: it contains control characters), as if it were Windows codepage 1252. The table in the Wikipedia page lists the Unicode equivalents.

Note: strtr() is often slower than str_replace(). Time it with your input and your PHP version. If it's fast enough, you can directly use a map like my $chr_map.

If you are not sure that your input is UTF-8 encoded, AND are willing to assume that if it's not, then it's ISO-8859-1 or Windows codepage 1252, then you can do this before anything else:

if ( !preg_match('/^\\X*$/u', $str)) {    $str = utf8_encode($str); }

Warning: this regex can in very rare cases fail to detect a non-UTF-8 encoding, though. E.g.: "Gruß…"/*CP-1252*/=="Gru\xDF\x85" looks like UTF-8 to this regex (U+07C5 is the N'ko digit 5). This regex can be slightly enhanced, but unfortunately it can be shown that there exists NO completely foolproof solution to the problem of encoding detection.

If you want to normalize the range 0x80-0x9F that stems from Windows codepage 1252 to regular Unicode codepoints, you can do this (and remove the first part of the $chr_map above):

$normalization_map = array(    "\xC2\x80" => "\xE2\x82\xAC", // U+20AC Euro sign    "\xC2\x82" => "\xE2\x80\x9A", // U+201A single low-9 quotation mark    "\xC2\x83" => "\xC6\x92",     // U+0192 latin small letter f with hook    "\xC2\x84" => "\xE2\x80\x9E", // U+201E double low-9 quotation mark    "\xC2\x85" => "\xE2\x80\xA6", // U+2026 horizontal ellipsis    "\xC2\x86" => "\xE2\x80\xA0", // U+2020 dagger    "\xC2\x87" => "\xE2\x80\xA1", // U+2021 double dagger    "\xC2\x88" => "\xCB\x86",     // U+02C6 modifier letter circumflex accent    "\xC2\x89" => "\xE2\x80\xB0", // U+2030 per mille sign    "\xC2\x8A" => "\xC5\xA0",     // U+0160 latin capital letter s with caron    "\xC2\x8B" => "\xE2\x80\xB9", // U+2039 single left-pointing angle quotation mark    "\xC2\x8C" => "\xC5\x92",     // U+0152 latin capital ligature oe    "\xC2\x8E" => "\xC5\xBD",     // U+017D latin capital letter z with caron    "\xC2\x91" => "\xE2\x80\x98", // U+2018 left single quotation mark    "\xC2\x92" => "\xE2\x80\x99", // U+2019 right single quotation mark    "\xC2\x93" => "\xE2\x80\x9C", // U+201C left double quotation mark    "\xC2\x94" => "\xE2\x80\x9D", // U+201D right double quotation mark    "\xC2\x95" => "\xE2\x80\xA2", // U+2022 bullet    "\xC2\x96" => "\xE2\x80\x93", // U+2013 en dash    "\xC2\x97" => "\xE2\x80\x94", // U+2014 em dash    "\xC2\x98" => "\xCB\x9C",     // U+02DC small tilde    "\xC2\x99" => "\xE2\x84\xA2", // U+2122 trade mark sign    "\xC2\x9A" => "\xC5\xA1",     // U+0161 latin small letter s with caron    "\xC2\x9B" => "\xE2\x80\xBA", // U+203A single right-pointing angle quotation mark    "\xC2\x9C" => "\xC5\x93",     // U+0153 latin small ligature oe    "\xC2\x9E" => "\xC5\xBE",     // U+017E latin small letter z with caron    "\xC2\x9F" => "\xC5\xB8",     // U+0178 latin capital letter y with diaeresis ); $chr = array_keys  ($normalization_map); // but: for efficiency you should $rpl = array_values($normalization_map); // pre-calculate these two arrays $str = str_replace($chr, $rpl, $str);

169

answered Oct 03 '22 06:10

Walter Tross

Related questions
                            
                                passing PHP objects to javascript [duplicate]
                            
                                Laravel Eloquent truncate - Foreign key constraint
                            
                                Laravel - where less/greater than date syntax
                            
                                Find a document with ObjectID in mongoDB
                            
                                Send PHP date to JavaScript date format
                            
                                How to select the whole variable name including $ in Visual Studio Code in PHP?
                            
                                PHP: Convert unicode codepoint to UTF-8
                            
                                PHP: Require path does not work for cron job?
                            
                                PHP - Merge two arrays (same-length) into one associative?
                            
                                Find out which class called a method in another class
                            
                                Get all elements in array besides the first one.. ? (php)
                            
                                Easy way to export a SQL table without access to the server or phpMyADMIN
                            
                                move all files in a folder to another?
                            
                                Delete files which has the same prefix
                            
                                Ubuntu php5-fpm throws unknown instance on reload
                            
                                PHP Implode wrap in tags
                            
                                Where 2x prefix are used in BCrypt?
                            
                                Validate or remove for extra fields in laravel
                            
                                Doctrine 2: Call to a member function format() on a non-object ... in DateTimeType.php
                            
                                how to change php variable name in a loop?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert all types of smart quotes with PHP

Tags:

html

replace

php

unicode

double-quotes

Xeoncross

People also ask

1 Answers

Walter Tross

Recent Activity

Donate For Us