Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pdfjs: get raw text from pdf with correct newline/withespace

Using pdf.js, i have made a simple function for extract the raw text from a pdf:

async getPdfText(path){

    const pdf = await PDFJS.getDocument(path);

    const pagePromises = [];
    for (let j = 1; j <= pdf.numPages; j++) {
        const page = pdf.getPage(j);

        pagePromises.push(page.then((page) => {
            const textContent = page.getTextContent();
            return textContent.then((text) => {
                return text.items.map((s) =>  s.str).join('');
            });
        }));
    }

    const texts = await Promise.all(pagePromises);
    return texts.join('');
}

// usage
getPdfText("C:\\my.pdf").then((text) => { console.log(text); });

however i can't find a way for extract correctly the new lines, all the text is extracted in only one line.

How extract correctly the text? i want extract the text in the same way as on desktop pc:

Open the pdf (doble click on the file) -> select all text (CTRL + A) -> copy the selected text (CTRL + C) -> paste the copied text (CTRL + V)

like image 919
ar099968 Avatar asked Mar 05 '23 13:03

ar099968


1 Answers

I know the question is more than a year old, but in case anyone has the same problem.

As this post said :

In PDF there no such thing as controlling layout using control chars such as '\n' -- glyphs in PDF positioned using exact coordinates. Use text y-coordinate (can be extracted from transform matrix) to detect a line change.

So with pdf.js, you can use the transform property of the textContent.items object. Specifically box 5 of the table. If this value changes, then it means that there is a new line

Here's my code :

            page.getTextContent().then(function (textContent) {
                var textItems = textContent.items;
                var finalString = "";
                var line = 0;

                // Concatenate the string of the item to the final string
                for (var i = 0; i < textItems.length; i++) {
                    if (line != textItems[i].transform[5]) {
                        if (line != 0) {
                            finalString +='\r\n';
                        }

                        line = textItems[i].transform[5]
                    }                     
                    var item = textItems[i];

                    finalString += item.str;
                }

                var node = document.getElementById('output');
                node.value = finalString;
            });

As weird as it sounds, instead of using tranform, you can also use the fontName property. With each new line, the fontName changes.

like image 108
LinkmanXBP Avatar answered Mar 24 '23 17:03

LinkmanXBP