Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between EM Dash #151; and #8212;?

I've an ASCII file that contains an EM Dash (— or — in HTML). The hex value is 0x97. When we pass this file through one application it arrives as UTF-8, and it converts the character to 0xC297, which is — in HTML. However, when we pass this file through a different application it converts the character to 0xE28094 or —.

What would cause these applications to convert these characters differently? Is it perhaps a code page setting?

like image 897
ilitirit Avatar asked Mar 10 '09 17:03

ilitirit


People also ask

When should you use an em dash?

The em dash can function like a comma, a colon, or parenthesis. Like commas and parentheses, em dashes set off extra information, such as examples, explanatory or descriptive phrases, or supplemental facts. Like a colon, an em dash introduces a clause that explains or expands upon something that precedes it.

How do you use em dash in a sentence?

Use an Em Dash to Bring Focus to a List. When a sentence begins with an independent clause and ends with a list, you can use a colon between the clause and the list. When the list comes first, it's better to use a dash to connect the list to the clause.

What is the 3 em dash used for?

Create an em dash by typing two hyphens without spaces between the hyphens and no spaces before or after the hyphens. Em-dashes are not generally used in formal documents. The en-dash is used between dates and times, and the 3-em dash is used to signal omitted information in certain (often legal) situations.


1 Answers

— is wrong. When you use numeric character references, the number refers to the Unicode codepoint. For numbers below 256 that is the same as the codepoint in ISO-8859-1. In 8859-1, character 151 is amongst the “C1 control codes”, and not a dash or any other visible character.

The confusion arises because character 151 is a dash in Windows code page 1252 (Western European). Many people think cp1252 is the same thing as ISO-8859-1, but in reality it's not: the characters in the C1 range (128 to 159) are different.

The first application is reading your “ASCII” file* as ISO-8859-1, but actually it's probably cp1252 and you'll need a way to clue the app in about what encoding it has to expect.

(*: “ASCII” is a misnomer if there are top-bit-set characters in the file. You probably mean “ANSI”, which is really also a misnomer, but one which has stuck in the Windows world to mean “text encoded in the current system-default code page”.)

like image 74
bobince Avatar answered Sep 28 '22 01:09

bobince