Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pdfbox, cpu 100% while extracting text

Tags:

java

pdfbox

I'm using pdfbox 2.0.1 to parse pdf document like this.

        for (int i = 0; i < 5; i ++) {
            new Thread(new Runnable() {
                @Override
                public void run() {
                    InputStream in = new ByteArrayInputStream(fileContent);
                    PDDocument document = null;
                    PDFTextStripper stripper;
                    String content;

                    try {
                        document = PDDocument.load(in);

                        stripper = new PDFTextStripper();
                        content = stripper.getText(document).trim();
                    } finally {
                        if (document != null) {
                            document.close();
                        }
                        if (in != null) {
                            in.close();
                        }
                    }
                    System.out.println(content);
                }
            }).start();
        }

Sometimes it happened that cpu runs 100% while parsing pdf concurrently. The stack is as follow:

java.lang.Thread.State: RUNNABLE
at java.util.HashMap.get(HashMap.java:303)
at org.apache.pdfbox.pdmodel.font.encoding.GlyphList.toUnicode(GlyphList.java:231)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:308)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:273)
at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:668)
at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:609)
at org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:52)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:136)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)

GlyphList.java code is:

// Adobe Glyph List (AGL)
private static final GlyphList DEFAULT = load("glyphlist.txt", 4281);


 /**
     * Returns the Unicode character sequence for the given glyph name, or null if there isn't any.
     *
     * @param name PostScript glyph name
     * @return Unicode character(s), or null.
     */
public String toUnicode(String name)
{
    if (name == null)
    {
        return null;
    }

    String unicode = nameToUnicode.get(name);
    if (unicode != null)
    {
        return unicode;
    }

    // separate read/write cache for thread safety
    unicode = uniNameToUnicodeCache.get(name);
    if (unicode == null)
    {
        // test if we have a suffix and if so remove it
        if (name.indexOf('.') > 0)
        {
            unicode = toUnicode(name.substring(0, name.indexOf('.')));
        }
        else if (name.startsWith("uni") && name.length() == 7)
        {
            // test for Unicode name in the format uniXXXX where X is hex
            int nameLength = name.length();
            StringBuilder uniStr = new StringBuilder();
            try
            {
                for (int chPos = 3; chPos + 4 <= nameLength; chPos += 4)
                {
                    int codePoint = Integer.parseInt(name.substring(chPos, chPos + 4), 16);
                    if (codePoint > 0xD7FF && codePoint < 0xE000)
                    {
                        LOG.warn("Unicode character name with disallowed code area: " + name);
                    }
                    else
                    {
                        uniStr.append((char) codePoint);
                    }
                }
                unicode = uniStr.toString();
            }
            catch (NumberFormatException nfe)
            {
                LOG.warn("Not a number in Unicode character name: " + name);
            }
        }
        else if (name.startsWith("u") && name.length() == 5)
        {
            // test for an alternate Unicode name representation uXXXX
            try
            {
                int codePoint = Integer.parseInt(name.substring(1), 16);
                if (codePoint > 0xD7FF && codePoint < 0xE000)
                {
                    LOG.warn("Unicode character name with disallowed code area: " + name);
                }
                else
                {
                    unicode = String.valueOf((char) codePoint);
                }
            }
            catch (NumberFormatException nfe)
            {
                LOG.warn("Not a number in Unicode character name: " + name);
            }
        }
        uniNameToUnicodeCache.put(name, unicode);
    }
    return unicode;
}

so, when we call like this

GlyphList.DEFAULT.toUnicode(code)

the concurrent error occurs(pay attention to var uniNameToUnicodeCache), and PDSimpleFont.toUnicode just did that.

However, it seems that no other ones have met the same problem。i don't know what i said above is right, or wrong. And if it's really a bug, is it fixed?

like image 742
villa Avatar asked Mar 27 '26 23:03

villa


1 Answers

Reviewing the GlyphList class code it becomes apparent that it has not been prepared for multi-threaded use. On the other hand a DEFAULT instance of it is used as a singleton via getAdobeGlyphList concurrently by text extraction code.

This can become an issue in its toUnicode(String) method if the documents in question use glyph names using the inofficial scheme uniXXXX or uXXXX because in such a case this method not only tries to read from the HashMap uniNameToUnicodeCache but also writes to it (adding the found inofficial glyph name for later quick lookup).

If such a write happens concurrently with some other thread's read from the map, indeed a ConcurrentModificationException may occur.

I'd propose changing the GlyphList to either

  • not write to uniNameToUnicodeCache anymore, or
  • synchronize toUnicode(String) or more precisely the uniNameToUnicodeCache reads and writes therein, or
  • make uniNameToUnicodeCache a ConcurrentHashMap instead of a HashMap.

I would expect the third option to perform better than the second one.

like image 142
mkl Avatar answered Mar 31 '26 04:03

mkl



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!