Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I fix "missing character" warnings when converting from docx to pdf using Pandoc and LaTeX?

Goal

I have several thousand Khmer-language .docx files and would like to convert them to .pdf format using Pandoc.

Background

I installed Pandoc using MacPorts. Pandoc requires LaTeX for PDF conversion, so I installed MacTeX. Installation appears to have gone properly, and I've been able to convert English-language .docx files into .pdf without difficulty.

Attempt 1

When I try to convert a Khmer-language file (you can find an example at https://briancroxall.net/pandoc/transcription.docx) to PDF, I use the following command:

pandoc transcription.docx  -s -o transcript.pdf

I receive the following error:

Error producing PDF.
! Package inputenc Error: Unicode character អ (U+17A2)
(inputenc)                not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              

l.64 ...�នៅសម័យប៉ុល ពត។}

Try running pandoc with --pdf-engine=xelatex.

Attempt 2

Following this suggestion, I use this command:

pandoc --pdf-engine=xelatex transcription.docx  -s -o transcript.pdf

Pandoc then throws an error message for every Khmer character in the text:

[WARNING] Missing character: There is no អ in font [lmroman10-bold]:mapping=tex-text;!
[WARNING] Missing character: There is no ្ in font [lmroman10-bold]:mapping=tex-text;!
[WARNING] Missing character: There is no ន in font [lmroman10-bold]:mapping=tex-text;!
...

A PDF is produced by this process (see https://briancroxall.net/pandoc/transcript.pdf), but it is largely empty.

Issue

As best as I can tell, this suggests that Khmer characters are not being available in the LaTeX engine that I'm trying to use to do the conversion. Whether or not that is so, how can I manage this file conversion successfully?

like image 745
Brian Croxall Avatar asked Mar 04 '20 22:03

Brian Croxall


1 Answers

mb21's comment helped me figure this out. Since my system has a couple of Khmer fonts installed, I had to set mainfont to use one of them.

$ pandoc --pdf-engine=xelatex transcription.docx \ 
      -V 'mainfont:Khmer MN' -s -o transcription.pdf

This produces a PDF with Khmer characters and no error messages.

The PDF does seem to have some issues in that some phrases in Khmer run off the margin of the page. I think this is due to segmentation issues that Word is equipped to deal with but that get messed up in conversion to PDF.

like image 65
Brian Croxall Avatar answered Oct 03 '22 13:10

Brian Croxall