I'm not too familiar with the encoding that Microsoft Word uses. If someone where to save a .doc or .docx file from Word, what is the standard encoding that is used?
I'm guessing it's not UTF-8 as the resulting text (pasted in a UTF-8 encoded text file) does not honour certain punctuation (e.g quotes).
For example, an opening Word 'smart quote' when pasted in a UTF-8 text file, results in an ì
symbol. If Word does indeed encode in UTF-8, then how does Word attempt to render the actual UTF-8 character?
Edit
After doing a little digging, I can see that a Microsoft Word .docx file is actually a compressed format. Unzipping it results in a number of .xml files to be unpacked.
However, the inability for a UTF-8 encoded text file to honour these 'smart' quotes is still perplexing. Any enlightening information would be helpful.
As you can see, it's an UTF-8 encoding.
An encoding converts a sequence of code points to a sequence of bytes. An encoding is typically used when writing text to a file. To read it back in we have to know how it was encoded and decode it back into memory. A text encoding is basically a file format for text files.
Go to "File" -> "Options" -> "Advanced" and scroll down until the "General" section is reached. In the "General" section, check the box that says "Confirm file format conversion on open." Exit Word, and reopen the corrupt document again.
These days a docx
file is really a bunch of compressed xml files. One of these files, is the document.xml
file, which starts with the following line (i.e. an xml prolog):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
As you can see, it's an UTF-8 encoding.
UTF-8 supports the full set of Unicode characters. Just for the sake of completeness, that does not mean that all UTF-8 characters can actually be used in an xml file. Even a CDATA block has its limitations. But having said all that, storing an ` or an ì isn't a problem.
And more importantly, the file format does not really have anything to do with copy-paste behavior of the application itself.
Nevertheless, here's how word would store an ` and ì symbol.
A bit confusing, but I just realized that by "smart quote" you probably refer to the mechanism that Word has to represent the curly quotes. In my previous answer I thought you meant "backticks", which is a different thing. - Sorry for the confusion.
Well, anyway, here are the unicodes for these smart quotes:
Let's put them in a simple UTF-8 encoded text file. The result is not that spectacular:
U+2018
is encoded in UTF-8 as E2 80 98
U+2019
is encoded in UTF-8 as E2 80 99
U+201C
is encoded in UTF-8 as E2 80 9C
U+201D
is encoded in UTF-8 as E2 80 9D
So, I went 1 step further and put them in a word file. I entered a line with regular quotes, and one with smart quotes.
“ this is a test “
“ this is another test ”
And then, I saved the thing and looked how it was stored in Word's xml structure. And actually it is exactly stored as expected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With