I am trying to convert a markdown file to pdf using pandoc
on Windows system. Since my markdown contains Chinese characters, I use the following command to produce the pdf:
pandoc --pdf-engine=xelatex -V CJKmainfont=KaiTi test.md -o test.pdfbut
But pandoc complains that the file contains non-utf8 characters that it can not handle, the exact error message is:
Error producing PDF.
! Undefined control sequence.
pandoc.exe: Cannot decode byte '\xae': >Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream
According to what I have find in the internet. This is largely due to the encoding of the markdown file and may have nothing to do with pandoc. My file contains a lot of chinese characters and English characters. I have converted it to utf-8 encoding.
Following the instruction here and here. I have verified that the system locale is set to UTF-8, output of localectl status
is:
System Locale: LANG=en_US.UTF-8
VC Keymap: us
X11 Layout: us
I tried to grep for non-utf8 character. Command used is grep -axv '.*' test.md
. But the command output nothing. (I thought that means there are no invalid characters which can not be decoded by utf-8.)
I followed the instruction here trying to remove non-utf8 characters from my file. The command I use is:
iconv -f utf-8 -t utf-8 -c test.md > output.md
After that, When I tried to convert output.md
to pdf using pandoc
. I still met the same error message, which suggests that the file still contains non-utf8 characters.
How can I pinpoint which part of file is causing the problem or how to really remove the non-utf8 character from the file so that I can compile it with error?
You can find the markdown file here.
If you are using Linux system, you may need to set CJKmainfont
to other valid Chinese font name in your system.
The problem is caused by using backslashes in markdown without escaping. Pandoc treat backslash followed by text in markdown as LaTeX command. Using the following command to generate pdf:
pandoc -f markdown-raw_tex --pdf-engine=xelatex -V CJKmainfont=KaiTi test.md -o test.pdf
Then the error disappeared and the pdf file can be generated successfully.
Thanks to the guru in tex.stackexchange, the cause has been finally found. Essentially, it is xelatex which will produce invalid utf-8 sequence if it encounters invalid control sequence during its processing of tex files. For more information, see here and here.
update 2017.12.29
With the release of Pandoc 2.0.6, this behaviour is handled more properly:
Allow lenient decoding of latex error logs, which are not always properly UTF8-encoded
Now, it is easier to debug this kind of issues.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With