Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert Word smart quotes and em dashes in a string?

Tags:

I have a form with a textarea. Users enter a block of text which is stored in a database.

Occasionally a user will paste text from Word containing smart quotes or emdashes. Those characters appear in the database as: –, ’, “ ,â€

What function should I call on the input string to convert smart quotes to regular quotes and emdashes to regular dashes?

I am working in PHP.

Update: Thanks for all of the great responses so far. The page on Joel's site about encodings is very informative: http://www.joelonsoftware.com/articles/Unicode.html

Some notes on my environment:

The MySQL database is using UTF-8 encoding. Likewise, the HTML pages that display the content are using UTF-8 (Update:) by explicitly setting the meta content-type.

On those pages the smart quotes and emdashes appear as a diamond with question mark.

Solution:

Thanks again for the responses. The solution was twofold:

  1. Make sure the database and HTML files were explicitly set to use UTF-8 encoding.
  2. Use htmlspecialchars() instead of htmlentities().
like image 443
GloryFish Avatar asked Oct 06 '08 19:10

GloryFish


People also ask

How do you replace Straight Quotes With Smart Quotes in Word?

Click the AutoFormat As You Type tab, and under Replace as you type, select or clear the "Straight quotes" with “smart quotes” check box. Click the AutoFormat tab, and under Replace, select or clear the "Straight quotes" with “smart quotes” check box.

What are smart quotes in Word?

Smart quotes are those fancy quotes that point different directions—you know, you see them all the time in typeset material. Word uses smart quotes automatically as one of the features in AutoFormat.


2 Answers

This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html

like image 157
theraccoonbear Avatar answered Sep 19 '22 12:09

theraccoonbear


The mysql database is using UTF-8 encoding. Likewise, the html pages that display the content are using UTF-8.

The content of the HTML can be in UTF-8, yes, but are you explicitly setting the content type (encoding) of your HTML pages (generated via PHP?) to UTF-8 as well? Try returning a Content-Type header of "text/html;charset=utf-8" or add <meta> tags to your HTMLs:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> 

That way, the content type of the data submitted to PHP will also be the same.

I had a similar issue and adding the <meta> tag worked for me.

like image 29
Ates Goral Avatar answered Sep 23 '22 12:09

Ates Goral