Converting Microsoft Word special characters with PHP

Tags:

I am trying to convert Word text pasted by users that contain MS Word ellipsis and long dash before processing it further.

I found an old proposed solution here to the problem http://www.codingforums.com/archive/index.php/t-47163.html , but it does not work for me. After replacing the ellipsis for example , the variable comes back as empty. Never seen anything like this before:

$src = "Long word dash – and weird Word ellipsis…";
$src = str_replace("‘", "'", $src);
$src = str_replace("’", "'", $src);
$src = str_replace("”", '"', $src);
$src = str_replace("“", '"', $src);
$src = str_replace("–", "-", $src);
$src = str_replace("…", "...", $src);
print $src;

Any ideas?

847

asked Sep 14 '11 15:09

giorgio79

2 Answers

For anyone getting the diamond question mark in PHP, this method of replacing UTF-8 characters worked better than using the chr function.

$search = [                 // www.fileformat.info/info/unicode/<NUM>/ <NUM> = 2018
                "\xC2\xAB",     // « (U+00AB) in UTF-8
                "\xC2\xBB",     // » (U+00BB) in UTF-8
                "\xE2\x80\x98", // ‘ (U+2018) in UTF-8
                "\xE2\x80\x99", // ’ (U+2019) in UTF-8
                "\xE2\x80\x9A", // ‚ (U+201A) in UTF-8
                "\xE2\x80\x9B", // ‛ (U+201B) in UTF-8
                "\xE2\x80\x9C", // “ (U+201C) in UTF-8
                "\xE2\x80\x9D", // ” (U+201D) in UTF-8
                "\xE2\x80\x9E", // „ (U+201E) in UTF-8
                "\xE2\x80\x9F", // ‟ (U+201F) in UTF-8
                "\xE2\x80\xB9", // ‹ (U+2039) in UTF-8
                "\xE2\x80\xBA", // › (U+203A) in UTF-8
                "\xE2\x80\x93", // – (U+2013) in UTF-8
                "\xE2\x80\x94", // — (U+2014) in UTF-8
                "\xE2\x80\xA6"  // … (U+2026) in UTF-8
    ];

    $replacements = [
                "<<", 
                ">>",
                "'",
                "'",
                "'",
                "'",
                '"',
                '"',
                '"',
                '"',
                "<",
                ">",
                "-",
                "-",
                "..."
    ];

    str_replace($search, $replacements, $string);

190

answered Sep 22 '22 10:09

Verron Knowles

Hmm. I use this function for sanitizing text copied into an RTE. It may or may not work in this case. It converts to HTML entities, but you could tweak it to just convert to regular characters:

function convertFromCP1252($string)
{
    $search = array('&',
                    '<',
                    '>',
                    '"',
                    chr(212),
                    chr(213),
                    chr(210),
                    chr(211),
                    chr(209),
                    chr(208),
                    chr(201),
                    chr(145),
                    chr(146),
                    chr(147),
                    chr(148),
                    chr(151),
                    chr(150),
                    chr(133),
                    chr(194)
                );

     $replace = array(  '&amp;',
                        '&lt;',
                        '&gt;',
                        '&quot;',
                        '&#8216;',
                        '&#8217;',
                        '&#8220;',
                        '&#8221;',
                        '&#8211;',
                        '&#8212;',
                        '&#8230;',
                        '&#8216;',
                        '&#8217;',
                        '&#8220;',
                        '&#8221;',
                        '&#8211;',
                        '&#8212;',
                        '&#8230;',
                        ''
                    );

    return str_replace($search, $replace, $string);
}

answered Sep 25 '22 10:09

christopher_b

Related questions
                            
                                Add a new column to the file
                            
                                ActiveAdmin forms with has_many - belongs_to relationships?
                            
                                Cannot implicitly convert type 'int?' to 'int'
                            
                                Plupload Automatically start upload when files added
                            
                                Remove specific rows from a data frame [duplicate]
                            
                                Subdividing a list in haskell
                            
                                setValue:forUndefinedKey:]:
                            
                                Could not find Facebook SDK.apk
                            
                                The name 'ClientScript' does not exist in the current context
                            
                                ArgumentNullException - how to simplify?
                            
                                Can't shrink flurry with proguard
                            
                                Form is not appearing but its content does

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With