Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandoc complains about utf-8 decoding error even if my file is valid utf-8 encoded file

I am trying to convert a markdown file to pdf using pandoc on Windows system. Since my markdown contains Chinese characters, I use the following command to produce the pdf:

pandoc --pdf-engine=xelatex -V CJKmainfont=KaiTi test.md -o test.pdfbut

But pandoc complains that the file contains non-utf8 characters that it can not handle, the exact error message is:

Error producing PDF.
! Undefined control sequence.
pandoc.exe: Cannot decode byte '\xae': >Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream

According to what I have find in the internet. This is largely due to the encoding of the markdown file and may have nothing to do with pandoc. My file contains a lot of chinese characters and English characters. I have converted it to utf-8 encoding.

Things I have tried but without success

Grep for the non-utf8 character

Following the instruction here and here. I have verified that the system locale is set to UTF-8, output of localectl status is:

   System Locale: LANG=en_US.UTF-8
       VC Keymap: us
      X11 Layout: us

I tried to grep for non-utf8 character. Command used is grep -axv '.*' test.md. But the command output nothing. (I thought that means there are no invalid characters which can not be decoded by utf-8.)

Try to discard invalid characters

I followed the instruction here trying to remove non-utf8 characters from my file. The command I use is:

iconv -f utf-8 -t utf-8 -c test.md > output.md

After that, When I tried to convert output.md to pdf using pandoc. I still met the same error message, which suggests that the file still contains non-utf8 characters.

My question

How can I pinpoint which part of file is causing the problem or how to really remove the non-utf8 character from the file so that I can compile it with error?

Other information

  • You can find the markdown file here.

  • If you are using Linux system, you may need to set CJKmainfont to other valid Chinese font name in your system.

like image 936
jdhao Avatar asked Dec 23 '17 17:12

jdhao


1 Answers

The problem is caused by using backslashes in markdown without escaping. Pandoc treat backslash followed by text in markdown as LaTeX command. Using the following command to generate pdf:

pandoc -f markdown-raw_tex --pdf-engine=xelatex -V CJKmainfont=KaiTi test.md -o test.pdf

Then the error disappeared and the pdf file can be generated successfully.

Follow-ups

Thanks to the guru in tex.stackexchange, the cause has been finally found. Essentially, it is xelatex which will produce invalid utf-8 sequence if it encounters invalid control sequence during its processing of tex files. For more information, see here and here.

update 2017.12.29
With the release of Pandoc 2.0.6, this behaviour is handled more properly:

Allow lenient decoding of latex error logs, which are not always properly UTF8-encoded

Now, it is easier to debug this kind of issues.

like image 133
jdhao Avatar answered Oct 14 '22 21:10

jdhao