Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character encoding of Microsoft Word DOC and DOCX files?

I'm not too familiar with the encoding that Microsoft Word uses. If someone where to save a .doc or .docx file from Word, what is the standard encoding that is used?

I'm guessing it's not UTF-8 as the resulting text (pasted in a UTF-8 encoded text file) does not honour certain punctuation (e.g quotes).

For example, an opening Word 'smart quote' when pasted in a UTF-8 text file, results in an ì symbol. If Word does indeed encode in UTF-8, then how does Word attempt to render the actual UTF-8 character?

Edit

After doing a little digging, I can see that a Microsoft Word .docx file is actually a compressed format. Unzipping it results in a number of .xml files to be unpacked.

However, the inability for a UTF-8 encoded text file to honour these 'smart' quotes is still perplexing. Any enlightening information would be helpful.

like image 950
shennan Avatar asked Jan 27 '15 13:01

shennan


People also ask

What is the encoding of Docx file?

As you can see, it's an UTF-8 encoding.

What is the encoding of a text file?

An encoding converts a sequence of code points to a sequence of bytes. An encoding is typically used when writing text to a file. To read it back in we have to know how it was encoded and decode it back into memory. A text encoding is basically a file format for text files.

How do I fix corrupted character encoding in Word?

Go to "File" -> "Options" -> "Advanced" and scroll down until the "General" section is reached. In the "General" section, check the box that says "Confirm file format conversion on open." Exit Word, and reopen the corrupt document again.


1 Answers

These days a docx file is really a bunch of compressed xml files. One of these files, is the document.xml file, which starts with the following line (i.e. an xml prolog):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

As you can see, it's an UTF-8 encoding.

EDIT

UTF-8 supports the full set of Unicode characters. Just for the sake of completeness, that does not mean that all UTF-8 characters can actually be used in an xml file. Even a CDATA block has its limitations. But having said all that, storing an ` or an ì isn't a problem.

And more importantly, the file format does not really have anything to do with copy-paste behavior of the application itself.

Nevertheless, here's how word would store an ` and ì symbol.

xml and hex

CORRECTION

A bit confusing, but I just realized that by "smart quote" you probably refer to the mechanism that Word has to represent the curly quotes. In my previous answer I thought you meant "backticks", which is a different thing. - Sorry for the confusion.

Well, anyway, here are the unicodes for these smart quotes:

the UTF smart quotes

Let's put them in a simple UTF-8 encoded text file. The result is not that spectacular:

  • U+2018 is encoded in UTF-8 as E2 80 98
  • U+2019 is encoded in UTF-8 as E2 80 99
  • U+201C is encoded in UTF-8 as E2 80 9C
  • U+201D is encoded in UTF-8 as E2 80 9D

So, I went 1 step further and put them in a word file. I entered a line with regular quotes, and one with smart quotes.

“ this is a test “ 
“ this is another test ”

And then, I saved the thing and looked how it was stored in Word's xml structure. And actually it is exactly stored as expected.

enter image description here

like image 200
bvdb Avatar answered Sep 27 '22 19:09

bvdb