I've just discovered the joy of tabula-py (and tabula-java of course) to extract tables from pdf. I am now programming a script for my job that reads some data from the pdf table, cleans it a little bit and the export that into excel. The pdf I am using has the same format every day, and the table is always in a certain area. To detect the area, I am using tabula.exe: I select the table, visualize the preview (which looks good), and then export the script, in order to see the -a parameter that is used by tabula.exe. I then use this in my command in Python, that is:
df = tabula.read_pdf(os.fsdecode(directory)+filename, encoding = 'ISO-8859-1',
stream=True, area = "81.106,302.475,384.697,552.491", pages = 2, pandas_options={'header':None})
I am using the encoding parameter because the standard utf-8 returns an error, and the stream method, because it's the one that shows a nice extracted table in tabula.exe. However, the dataframe has a problem, because the first 2 columns (which are displayed correctly as 2 different columns in the preview of tabula.exe) are actually one single column, so that names and values get mixed together.
Do you have any idea of why the same area yields 2 different results in tabula-py and tabula.exe? Thank you very much!
Figured it out on GitHub: tabula-py has the "guess" option set on True by default. So to correct the discrepancy, you can just add guess=False, and the output will be the same!
df = tabula.read_pdf(os.fsdecode(directory)+filename, encoding = 'ISO-8859-1',
stream=True, area = "81.106,302.475,384.697,552.491", pages = 2, guess = False, pandas_options={'header':None})
In case anyone else struggles with where to delineate tables and columns, you can very easily find exact dimensions with Adobe Acrobat. Open the pdf in Adobe Acrobat, turn on rulers, and set it to Points. Zoom way the heck in, and you can see the exact point measurements to split the area/tables on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With