Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace string in PDF file using Itext but letter X not replace

I'm trying to replace the content of PDF in one text but the letter 'X' are not being replaced.

public static void main(String[] args) {

    String DEST = "/home/diego/Documentos/teste.pdf";

    try {
        PdfReader reader = new PdfReader("termoAdesaoCartao.pdf");
        PdfDictionary dictionary = reader.getPageN(1);
        PdfObject object = dictionary.getDirectObject(PdfName.CONTENTS);
        if (object instanceof PRStream) {
            PRStream stream = (PRStream)object;
            byte[] data = PdfReader.getStreamBytes(stream);
            stream.setData(new String(data).replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes());
        }
        PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(DEST));
        stamper.close();
        reader.close();
    } catch (IOException | DocumentException e) {
        e.printStackTrace();
    }

}

enter image description here

like image 489
user3503888 Avatar asked Dec 12 '15 11:12

user3503888


2 Answers

In general

Basically the OP's approach in general cannot work. There are two major misunderstandings his code is built upon:

  • He assumes that one can translate a complete content stream from byte[] to String (with all string parameters of text showing operators being legible) using a single character encoding.

    This assumption is wrong: Each font may have its own encoding, so if multiple fonts are used on the same page, the same byte value in string operands of different text showing operators may represent completely different characters. Actually the fonts do not even need to contain a mapping to characters, they merely need to map numeric values to glyph painting instructions.

    Cf. section 9.4.3 Text-Showing Operators in ISO 32000-1:

    A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.

    With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".

    With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap,

    Simple PDF generators often merely use standard encodings (which are ASCII'ish and may give rise to assumptions like the OP's one) but there are more and more non-simple PDF generators out there...

  • He assumes he can simply edit the string operands of text-showing operators and the matching glyphs will be shown in the PDF viewer.

    This assumption is wrong: Fonts usually only support a fairly limited character set, and a text showing operator uses only a single font, the currently selected one. If one replaces a code in a string argument of such an operator with a different one without a matching glyph in the font, one will at most see a gap!

    While complete fonts usually at least contain glyphs for all characters of a kind (e.g. latin letters with all Western European variations thereof), PDF allows embedding fonts partially, cf.section 9.6.4 Font Subsets in ISO 32000-1:

    PDF documents may include subsets of Type 1 and TrueType fonts.

    This option meanwhile often is used to only embed painting instructions for glyphs actually used in the existing text. Thus, one cannot count on embedded fonts containing all characters of the same kind if they contain some. There may be a glyph for A and C but not for B.

In the case at hand

Unfortunately the OP has not supplied his sample PDF. The symptoms , though:

  • his call replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z") makes a difference as can be seen in his screenshot

    and his comment to Viacheslav Vedenin's answer

    Before the text was (Nome Completo)Tj and after (A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z)Tj

  • but some codes do not show as the expected glyphs as can also be seen in the screenshot above

point in the direction that the latter one of his two major false assumption described above makes the OP's code fail him: Most likely the font in question uses a standard encoding (probably WinAnsiEncoding) but is only partially embedded, in particular without the capital letters K, W, X, and Y.

How to do it correctly

Instead of blindly editing the content stream, the OP (who already is using iText) can use the following iText concepts:

  • text extraction classes can be used to also extract coordinates of text, cf multiple answers on stackoverflow, in particular the bounding rectangle of the text he wants to replace;
  • the iText xtra library class PdfCleanUpProcessor can be used to remove all content existing in that bounding rectangle;
  • the PdfStamper.getOverContent() can then be used to properly add new content at those coordinates.

This may sound complicated but this takes care of a number of additional minor misconceptions visible in the OP's approach.

like image 193
mkl Avatar answered Nov 15 '22 03:11

mkl


Try to use instead of

stream.setData(new String(data).replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes());

following code

stream.setData(new String(data, "UTF8").replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes("UTF8"));

Accoring this post in Oracle manual using new String(data) and getBytes() can lead to some error:

Byte Encodings and Strings

If a byte array contains non-Unicode text, you can convert the text to Unicode with one of the String constructor methods. Conversely, you can convert a String object into a byte array of non-Unicode characters with the String.getBytes method. When invoking either of these methods, you specify the encoding identifier as one of the parameters.

The example that follows converts characters between UTF-8 and Unicode. UTF-8 is a transmission format for Unicode that is safe for UNIX file systems. The full source code for the example is in the file StringConverter.java.

Update: If it isn't working, can you replace code

byte[] data = PdfReader.getStreamBytes(stream);
stream.setData(new String(data).replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes());

to code

byte[] data = PdfReader.getStreamBytes(stream);
String str = new String(data);
System.out.printLn(str);
String newStr = str.replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z"); 
System.out.printLn(newStr);
stream.setData(newStr.getBytes());

And write what you show in console?

like image 2
Slava Vedenin Avatar answered Nov 15 '22 04:11

Slava Vedenin