I've an ASCII file that contains an EM Dash (— or —
in HTML). The hex value is 0x97. When we pass this file through one application it arrives as UTF-8, and it converts the character to 0xC297, which is —
in HTML. However, when we pass this file through a different application it converts the character to 0xE28094 or —
.
What would cause these applications to convert these characters differently? Is it perhaps a code page setting?
The em dash can function like a comma, a colon, or parenthesis. Like commas and parentheses, em dashes set off extra information, such as examples, explanatory or descriptive phrases, or supplemental facts. Like a colon, an em dash introduces a clause that explains or expands upon something that precedes it.
Use an Em Dash to Bring Focus to a List. When a sentence begins with an independent clause and ends with a list, you can use a colon between the clause and the list. When the list comes first, it's better to use a dash to connect the list to the clause.
Create an em dash by typing two hyphens without spaces between the hyphens and no spaces before or after the hyphens. Em-dashes are not generally used in formal documents. The en-dash is used between dates and times, and the 3-em dash is used to signal omitted information in certain (often legal) situations.
— is wrong. When you use numeric character references, the number refers to the Unicode codepoint. For numbers below 256 that is the same as the codepoint in ISO-8859-1. In 8859-1, character 151 is amongst the “C1 control codes”, and not a dash or any other visible character.
The confusion arises because character 151 is a dash in Windows code page 1252 (Western European). Many people think cp1252 is the same thing as ISO-8859-1, but in reality it's not: the characters in the C1 range (128 to 159) are different.
The first application is reading your “ASCII” file* as ISO-8859-1, but actually it's probably cp1252 and you'll need a way to clue the app in about what encoding it has to expect.
(*: “ASCII” is a misnomer if there are top-bit-set characters in the file. You probably mean “ANSI”, which is really also a misnomer, but one which has stuck in the Windows world to mean “text encoded in the current system-default code page”.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With