Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDFplumber omits rightmost column in table

I wonder if anyone has any clues to the missing column? I have been using pdfplumber to extract table data with good results apart from one particular set of PDFs. The problem is that while page.search finds the rightmost column in the table, extract_table omits the rightmost column. This is on Windows 11. Here is an image of the PDF: Image of the PDF Link to PDF file on Dropbox:- https://www.dropbox.com/scl/fi/d3cg802h7cawl6vw9i7cm/testdoc.pdf?rlkey=tmz390ly5fbug0xi0kx06b2kt&dl=0

Here is the page image with vertical lines superimposed: Using PDFplumber's image debugging to show where the vertical lines are Here is the minimal code:`

# pdftesting.py
import pdfplumber
import sys

print('pdfplumber version:', pdfplumber.__version__)
print('Python version:', sys.version)
filepath  = 'C:/ProgramData/PythonProgs/testing/testdoc.pdf'
fn = pdfplumber.open(filepath)
page = fn.pages[0]

vlines = [26.0, 106.25, 152.25, 251.25, 395.5, 467.25, 539.5, 624.65, 692.5, \
          760.15, 818.9811199999999]
imagefile = 'C:/ProgramData/PythonProgs/testing/testdoc.png'
im = page.to_image(resolution=300)
im.draw_vlines(vlines, stroke_width=3)
im.save(imagefile)
lines = page.extract_table(table_settings=\
       {"vertical_strategy":"explicit",\
        "explicit_vertical_lines":vlines,\
        "horizontal_strategy": 'text',\
        "snap_tolerance": 5})
for item in lines:
    print('line:', item)
    
print('page width:', page.width)
target = 'inc'
X0 = page.search(target)[0]['x0']
X1 = page.search(target)[0]['x1']
size = page.search(target)[0]['chars'][0]['size']
print('Found:', target, X0, X1, size)
`

Here is the output from the code:

pdfplumber version: 0.11.0
Python: 3.12.0 (tags/v3.12.0:0fb18b0, Oct  2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)]
line: ['', '', '', 'tne minute, rou', 'naea up to tn', 'e nearest mi', 'nute', '', '']
line: ['UK calls', '', '', '', '', '', '', '', '']
line: ['', '', '', '', '', '', '', '', '']
line: ['Date', 'Time', 'Phone number', 'Destination', 'Duration', 'Charged', 'Included?', 'VAT', 'VAT']
line: ['', '', '', '', 'hh:mm:ss', 'hh:mm:ss', '', 'ex', 'rate']
line: ['', '', '', '', '', '', '', '', '']
line: ['Sun 17 May', '15:55', '07755221961', 'UK mobile', '00:05:26', '00:05:26', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Thu 21 May', '11:15', '07818818242', 'Vodafone mobile', '00:00:07', '00:01:00', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Fri 22 May', '15:44', '05706000459', 'Landline', '00:00:04', '00:01:00', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Mon 25 May', '20:48', '02085462206', 'Landline', '00:15:12', '00:15:12', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Sat 50 May', '10:58', '02056549856', 'Landline', '00:00:08', '00:01:00', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Fri 5 Jun', '09:58', '07818818242', 'Vodafone mobile', '00:00:11', '00:01:00', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Sat 6 Jun', '07:17', '07716065665', 'Vodafone mobile', '00:01:14', '00:01:14', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['', '', '', 'Tot', 'al of 7 calls', '23 mins 52 s', '', '£0.000', '']
page width: 856.800048828
Found: inc 761.15 773.9811199999999 9.961000000000013
like image 380
PMSK Avatar asked Sep 01 '25 03:09

PMSK


1 Answers

Resolving table extraction issues
with combined "text" and "explicit" strategies

Here's what happens when you choose the horizontal strategy "text":

  1. Words on the page are clustered based on their "top" parameter.
  2. These clusters are filtered by a minimum word count.
  3. Two horizontal edges are constructed at the top and bottom of each remaining cluster, using a fixed leftmost "x0" and rightmost "x1" for all.
  4. Cells are created at the intersections of these horizontal edges with the given vertical lines.
  5. Tables are formed by combining adjacent cells.

The issue you experienced occurred at step 4 because the horizontal edges built on words are shorter than the maximum width between the vertical lines. Let's visualize this:

import pdfplumber

filepath  = '/home/jakito/Desktop/testdoc.pdf'
pdf = pdfplumber.open(filepath)
page = pdf.pages[0]

v_lines = [26.0, 106.25, 152.25, 251.25, 395.5, 467.25, 539.5, 624.65, 692.5, 760.15, 818.9811199999999]
table_settings={
    "vertical_strategy": "explicit",
    "explicit_vertical_lines": v_lines,
    "horizontal_strategy": "text",
    "snap_tolerance": 5}
page.to_image(resolution=400).debug_tablefinder(table_settings).show()

horizontals do not intersect with the vertical on the right

Note that the horizontal edges do not touch the rightmost vertical line. As a result, there are not enough vertices to construct cells in the last column.

Here are several ways to resolve the issue:

  1. Adjust the position of the last vertical line so that it touches the words on the right
v_lines[-1] = max(char['x1'] for char in page.chars)

replaced position of the right vertical line

  1. Apply the "text" strategy to both directions with adjusted word limits
table_settings={
        "vertical_strategy":"text",
        "horizontal_strategy": "text",
        "min_words_vertical": 3,
        "min_words_horizontal": 11
}

page.to_image(resolution=300).debug_tablefinder(table_settings).show()

text strategy

  1. Use "explicit" horizontal strategy
from pdfplumber.table import words_to_edges_h

words = page.extract_words()
h_edges = words_to_edges_h(words, word_threshold=6)
h_lines = [x['top'] for x in h_edges[::2] if 0 <= x['top'] <= page.height]

v_lines = [26.0, 106.25, 152.25, 251.25, 395.5, 467.25, 539.5, 624.65, 692.5, 760.15, 818.9811199999999]

table_settings={
        "vertical_strategy":"explicit",
        "explicit_vertical_lines":v_lines,
        "horizontal_strategy": "explicit",
        "explicit_horizontal_lines": h_lines
}

page.to_image(resolution=300).debug_tablefinder(table_settings).show()

explicit horizontal strategy

Note: Currently, words_to_edges_h returns two edges for each cluster, which is excessive. To address this, I filtered them using h_edges[::2]. The lowest line can be added manually if needed, but in this case, it can be omitted. Additionally, I applied filtering based on 0 and the page height due to the specifics of the sample document, which appears to be a cropped version of a larger one. word_threshold=6 was added to avoid splitting "hh:mm:ss ..." into a separate line.

like image 70
Vitalizzare Avatar answered Sep 02 '25 17:09

Vitalizzare