I'm using pdfbox 2.0.1 to parse pdf document like this.
for (int i = 0; i < 5; i ++) {
new Thread(new Runnable() {
@Override
public void run() {
InputStream in = new ByteArrayInputStream(fileContent);
PDDocument document = null;
PDFTextStripper stripper;
String content;
try {
document = PDDocument.load(in);
stripper = new PDFTextStripper();
content = stripper.getText(document).trim();
} finally {
if (document != null) {
document.close();
}
if (in != null) {
in.close();
}
}
System.out.println(content);
}
}).start();
}
Sometimes it happened that cpu runs 100% while parsing pdf concurrently. The stack is as follow:
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.get(HashMap.java:303)
at org.apache.pdfbox.pdmodel.font.encoding.GlyphList.toUnicode(GlyphList.java:231)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:308)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:273)
at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:668)
at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:609)
at org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:52)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:136)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
GlyphList.java code is:
// Adobe Glyph List (AGL)
private static final GlyphList DEFAULT = load("glyphlist.txt", 4281);
/**
* Returns the Unicode character sequence for the given glyph name, or null if there isn't any.
*
* @param name PostScript glyph name
* @return Unicode character(s), or null.
*/
public String toUnicode(String name)
{
if (name == null)
{
return null;
}
String unicode = nameToUnicode.get(name);
if (unicode != null)
{
return unicode;
}
// separate read/write cache for thread safety
unicode = uniNameToUnicodeCache.get(name);
if (unicode == null)
{
// test if we have a suffix and if so remove it
if (name.indexOf('.') > 0)
{
unicode = toUnicode(name.substring(0, name.indexOf('.')));
}
else if (name.startsWith("uni") && name.length() == 7)
{
// test for Unicode name in the format uniXXXX where X is hex
int nameLength = name.length();
StringBuilder uniStr = new StringBuilder();
try
{
for (int chPos = 3; chPos + 4 <= nameLength; chPos += 4)
{
int codePoint = Integer.parseInt(name.substring(chPos, chPos + 4), 16);
if (codePoint > 0xD7FF && codePoint < 0xE000)
{
LOG.warn("Unicode character name with disallowed code area: " + name);
}
else
{
uniStr.append((char) codePoint);
}
}
unicode = uniStr.toString();
}
catch (NumberFormatException nfe)
{
LOG.warn("Not a number in Unicode character name: " + name);
}
}
else if (name.startsWith("u") && name.length() == 5)
{
// test for an alternate Unicode name representation uXXXX
try
{
int codePoint = Integer.parseInt(name.substring(1), 16);
if (codePoint > 0xD7FF && codePoint < 0xE000)
{
LOG.warn("Unicode character name with disallowed code area: " + name);
}
else
{
unicode = String.valueOf((char) codePoint);
}
}
catch (NumberFormatException nfe)
{
LOG.warn("Not a number in Unicode character name: " + name);
}
}
uniNameToUnicodeCache.put(name, unicode);
}
return unicode;
}
so, when we call like this
GlyphList.DEFAULT.toUnicode(code)
the concurrent error occurs(pay attention to var uniNameToUnicodeCache), and PDSimpleFont.toUnicode just did that.
However, it seems that no other ones have met the same problem。i don't know what i said above is right, or wrong. And if it's really a bug, is it fixed?
Reviewing the GlyphList class code it becomes apparent that it has not been prepared for multi-threaded use. On the other hand a DEFAULT instance of it is used as a singleton via getAdobeGlyphList concurrently by text extraction code.
This can become an issue in its toUnicode(String) method if the documents in question use glyph names using the inofficial scheme uniXXXX or uXXXX because in such a case this method not only tries to read from the HashMap uniNameToUnicodeCache but also writes to it (adding the found inofficial glyph name for later quick lookup).
If such a write happens concurrently with some other thread's read from the map, indeed a ConcurrentModificationException may occur.
I'd propose changing the GlyphList to either
uniNameToUnicodeCache anymore, ortoUnicode(String) or more precisely the uniNameToUnicodeCache reads and writes therein, oruniNameToUnicodeCache a ConcurrentHashMap instead of a HashMap.I would expect the third option to perform better than the second one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With