Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tabula-py is not splitting columns right

I've just discovered the joy of tabula-py (and tabula-java of course) to extract tables from pdf. I am now programming a script for my job that reads some data from the pdf table, cleans it a little bit and the export that into excel. The pdf I am using has the same format every day, and the table is always in a certain area. To detect the area, I am using tabula.exe: I select the table, visualize the preview (which looks good), and then export the script, in order to see the -a parameter that is used by tabula.exe. I then use this in my command in Python, that is:

df = tabula.read_pdf(os.fsdecode(directory)+filename, encoding = 'ISO-8859-1',
stream=True, area = "81.106,302.475,384.697,552.491", pages = 2, pandas_options={'header':None})

I am using the encoding parameter because the standard utf-8 returns an error, and the stream method, because it's the one that shows a nice extracted table in tabula.exe. However, the dataframe has a problem, because the first 2 columns (which are displayed correctly as 2 different columns in the preview of tabula.exe) are actually one single column, so that names and values get mixed together.

Do you have any idea of why the same area yields 2 different results in tabula-py and tabula.exe? Thank you very much!

like image 923
giga Avatar asked Nov 17 '17 18:11

giga


Video Answer


2 Answers

Figured it out on GitHub: tabula-py has the "guess" option set on True by default. So to correct the discrepancy, you can just add guess=False, and the output will be the same!

    df = tabula.read_pdf(os.fsdecode(directory)+filename, encoding = 'ISO-8859-1', 
         stream=True, area = "81.106,302.475,384.697,552.491", pages = 2, guess = False,  pandas_options={'header':None})
like image 157
giga Avatar answered Oct 19 '22 23:10

giga


In case anyone else struggles with where to delineate tables and columns, you can very easily find exact dimensions with Adobe Acrobat. Open the pdf in Adobe Acrobat, turn on rulers, and set it to Points. Zoom way the heck in, and you can see the exact point measurements to split the area/tables on.

like image 38
MonkeySeeMonkeyDo Avatar answered Oct 20 '22 01:10

MonkeySeeMonkeyDo