How to convert PDF to CSV with tabula-py?

Tags:

In Python 3, I have a PDF file "Ativos_Fevereiro_2018_servidores_rj.pdf" with 6,041 pages. I'm on a machine with Ubuntu

On each page there is text at the top of the page, two lines. And below a table, with header and two columns. Each table in 36 rows, less on the last page

At the end of each page, after the tables, there is also a line of text

I want to create a CSV from this PDF, considering only the tables in the pages. And ignoring the texts before and after the tables

Initially I tested the tabula-py. But it generates an empty file:

from tabula import convert_into

convert_into("Ativos_Fevereiro_2018_servidores_rj.pdf", "test_s.csv", output_format="csv")

Please, does anyone know of another method to use tabula-py for this type of demand?

Or another way to convert PDF to CSV in this file type?

328

asked Mar 29 '18 16:03

Reinaldo Chaves

1 Answers

Ok, I've found the issue: you have to set spreadsheet=True and keep utf-8 encoding:

df = tabula.read_pdf("Ativos_Fevereiro_2018_servidores_rj.pdf", encoding='utf-8', spreadsheet=True, pages='1-6041')

In the picture below I tested it with just the first page (because your file is huge):

enter image description here

You can save the DataFrame as csv afterwards:

df.to_csv('otuput.csv', encoding='utf-8')

Edit:

Ok, the error could be a java-memory issue. To make it faster I added the pages option. And there also was an encoding problem, so encoding='utf-8' added to the csv export. If you keep running into the java-error, try parse it in chunks, e.g. pages='1-300'. I just did all 6041 (on a 64GB RAM Machine), it worked fine.

107

answered Oct 20 '22 05:10

ilja

Related questions
                            
                                cx_Oracle.DatabaseError: DPI-1047: 64-bit Oracle Client library cannot be loaded: "dlopen(libclntsh.dylib, 1): image not found"
                            
                                In Factory Boy, how to join strings created with Faker?
                            
                                Is it a good practice to use serializer as query parameters validators?
                            
                                Pandas Merge row data with multiple values to Python list for a column
                            
                                Seemingly infinite recursion with generator based coroutines
                            
                                How do I preserve datatype when using apply row-wise in pandas dataframe?
                            
                                Custom connections between layers Keras
                            
                                What does an "Executing <Handle <TaskWakeupMethWrapper..." warning in python asyncio mean
                            
                                TypeError: 'zip' object is not callable in Python 3.x
                            
                                Django OAuth- Separate Resource and Authorization Server
                            
                                Debug where method returns None
                            
                                Import Error: "No module named 'dateutil' "
                            
                                Python Pandas read_sql_query “'NoneType' object is not iterable” error
                            
                                Keras: TypeError: can't pickle _thread.lock objects with KerasClassifier
                            
                                spaCy 2.0: Save and Load a Custom NER model
                            
                                python pandas percent change with columns of dataframe
                            
                                pandas ffill based on condition in another column
                            
                                Python 3 - importing .py file in same directory - ModuleNotFoundError: No module named '__main__.char'; '__main__' is not a package
                            
                                Why when I use GridSearchCV with roc_auc scoring, the score is different for grid_search.score(X,y) and roc_auc_score(y, y_predict)?
                            
                                Convert pandas.core.groupby.SeriesGroupBy to a DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert PDF to CSV with tabula-py?

Tags:

python

csv

pdf

tabula

Reinaldo Chaves

People also ask

1 Answers

Edit:

ilja

Recent Activity

Donate For Us