I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDFminer as a library, and I find this question, but they are just all about extracting plain texts, without other information such as font name, font size, and so on.

This approach does not use PDFMiner but does the trick. First, convert the PDF document into docx. Using python-docx you can then retrieve font information. Here's an example of getting all the bold text <pre class="prettyprint"><code>from docx import * document = Document('/path/to/file.docx') for para in document.paragraphs: for run in para.runs: if run.bold: print run.text </code></pre> If you really want to use PDFMiner you can try this. Passing '-t' would convert the PDF into HTML with all the font information.

PDFminer: extract text with its font information

2 Answers

#!/usr/bin/env python
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer


def createPDFDoc(fpath):
    fp = open(fpath, 'rb')
    parser = PDFParser(fp)
    document = PDFDocument(parser, password='')
    # Check if the document allows text extraction. If not, abort.
    if not document.is_extractable:
        raise "Not extractable"
    else:
        return document


def createDeviceInterpreter():
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    return device, interpreter


def parse_obj(objs):
    for obj in objs:
        if isinstance(obj, pdfminer.layout.LTTextBox):
            for o in obj._objs:
                if isinstance(o,pdfminer.layout.LTTextLine):
                    text=o.get_text()
                    if text.strip():
                        for c in  o._objs:
                            if isinstance(c, pdfminer.layout.LTChar):
                                print "fontname %s"%c.fontname
        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)
        else:
            pass


document=createPDFDoc("/tmp/simple.pdf")
device,interpreter=createDeviceInterpreter()
pages=PDFPage.create_pages(document)
interpreter.process_page(pages.next())
layout = device.get_result()


parse_obj(layout._objs)

148

answered Sep 30 '22 01:09

Emilia Apostolova

This approach does not use PDFMiner but does the trick.

First, convert the PDF document into docx. Using python-docx you can then retrieve font information. Here's an example of getting all the bold text

from docx import *

document = Document('/path/to/file.docx')

for para in document.paragraphs:
    for run in para.runs:
        if run.bold:
            print run.text

If you really want to use PDFMiner you can try this. Passing '-t' would convert the PDF into HTML with all the font information.

answered Sep 30 '22 02:09

Samkit Jain

Related questions
                            
                                cpu_percent(interval=None) always returns 0 regardless of interval value PYTHON
                            
                                Is it possible to create grouping of input cells in IPython Notebook?
                            
                                Generate a random derangement of a list
                            
                                Linking Django and Postgresql with Docker
                            
                                Python Pandas: Passing Multiple Functions to agg() with Arguments
                            
                                Flatten DataFrame with multi-index columns
                            
                                Python Selenium get current window handle
                            
                                scipy - generate random variables with correlations
                            
                                Turn off marginal distribution axes on jointplot using seaborn package
                            
                                Why am i getting WindowsError: [Error 5] Access is denied?
                            
                                Tkinter look (theme) in Linux
                            
                                What is the unit of the y-axis when using distplot to plot a histogram?
                            
                                Why would MySQL execute return None?
                            
                                Create labeledPoints from Spark DataFrame in Python
                            
                                CountVectorizer: Vocabulary wasn't fitted
                            
                                Multilingual NLTK for POS Tagging and Lemmatizer
                            
                                Convert an RDD to iterable: PySpark?
                            
                                How do I connect with Python to a RESTful API using keys instead of basic authentication username and password?
                            
                                Passing an object created with SubFactory and LazyAttribute to a RelatedFactory in factory_boy
                            
                                Mysterious interaction between Python's slice bounds and "stride"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PDFminer: extract text with its font information

Tags:

python

text-extraction

pdfminer

aristotll

People also ask

2 Answers

Emilia Apostolova

Samkit Jain

Recent Activity

Donate For Us