I have a 5 pages pdf file, each page has a table that I need extract. I need to extract all the tables from each page and save them as a data-frame file all using python so i converted the file, to a csv file using tabula
tabula.convert_into('input.pdf', "output.csv", output_format="csv", pages='all')
The main issue with the file output.csv is that there are several extra commas.
Example
Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361
When i convert the csv file into row/columns i got some lines offset
Check the image below to get the problem:
As you can see in the image there are some lines offset(each table in each page of file has specific lines offset) how can i fix this problem
NB: The dataframe should have 6 columns with empty fields. I guess the extra commas comes from space in the pdf file. how can i remove extra commas from csv file or removing extra space on pdf file.
The expected output in the image below:

I would really appreciate your help.
I am basing my strategy on a short regular expression to capture the first 2 columns, and the numbers at the end.
(\d+,[^,]+,) → numbers + comma + anything but comma + comma
,* → zero or more commas
(\d.+) → the rest of the line starting from the first number
Then I am concatenating those two groups inserting enough commas in the middle so that the total is 5 (= 6 columns).
The seems like a pretty straightforward approach to me. It will work for any variation of input where random spaces and commas are inserted as long as the numeric data is right-aligned.
import re,io
def fix_line(line):
# remove duplicate commas and spaces
line = re.sub(',,', ',', line.replace(' ', ''))
# groups: first two rows / middle (non-captured) / numbers
match = re.match(r'(\d+,[^,]+,),*(\d.+)', line)
if not match: # removes the headers
return ''
# align numbers to right: 6 columns = 5 commas
return match.groups()[0]+(','*(5-2-match.groups()[1].count(',')))+match.groups()[1]
data_corr = [fix_line(line) for line in lines]
df = pd.read_csv(io.StringIO('\n'.join(data_corr)),
names=re.sub(',,+', ',', lines[0]).split(',') # assign column names
)
Assuming this input as the variable lines:
['Id,Name,Age,,Score,Rang,Bonus',
'181,ALEX,,,,20,987',
'182,Julia,,,,18,8.390',
'183,Marian,,,,21,9.170',
'184,Julien,,0,175,60,9.095',
'Id,Name,Age,,Score,Rang,Bonus',
'215,Asma,26,,35,19,3.807',
'216,Juan,,,,20,7.982',
'217,Rami,,,,10,1.832',
'Id,Name,Age,,Score,Rang,Bonus',
'415,Jessica,,4 920,8 873,538,7.994',
'416,Karen,,890,6,12,9.993',
'417,Andrea,,0,69,283,7.200',
'Id,Name,Age,,Score,Rang,Bonus',
'419,Rym,10,,18,,10,7.196',
'420,Noor,10,,70,,910,8.291',
'421,Nathalie,0,,5,,0,0.900',
'"",Id,Name,Age,,Score,Rang,Bonus',
'456,,Joe,,10,13,0,74.917',
'457,,Loula,,0,18,11,9.990',
'458,,Maria,,0,15,172,6.425',
'459,,Carl,,15,17,11,3.349',
'Id,Name,Age,,Score,Rang,Bonus',
'566,Diego,,,,0,3.680',
'567,Carla,0,,26,1,19.361']
output:
Id Name Age Score Rang Bonus
0 181 ALEX NaN NaN 20 987.000
1 182 Julia NaN NaN 18 8.390
2 183 Marian NaN NaN 21 9.170
3 184 Julien 0.0 175.0 60 9.095
4 215 Asma 26.0 35.0 19 3.807
5 216 Juan NaN NaN 20 7.982
6 217 Rami NaN NaN 10 1.832
7 415 Jessica 4920.0 8873.0 538 7.994
8 416 Karen 890.0 6.0 12 9.993
9 417 Andrea 0.0 69.0 283 7.200
10 419 Rym 10.0 18.0 10 7.196
11 420 Noor 10.0 70.0 910 8.291
12 421 Nathalie 0.0 5.0 0 0.900
13 456 Joe 10.0 13.0 0 74.917
14 457 Loula 0.0 18.0 11 9.990
15 458 Maria 0.0 15.0 172 6.425
16 459 Carl 15.0 17.0 11 3.349
17 566 Diego NaN NaN 0 3.680
18 567 Carla 0.0 26.0 1 19.361
NB. if the input is a file then first read lines using:
with open('/path/to/file', 'r') as f:
lines = f.readlines()
I found this a bit easier to follow than Martin Evans' answer
It's a the generator yields lines of the same length as the cleaned up first line. And removes the first empty string until a line has the right length.
Like Martin's answer it yields your expected dataframe from your example data.
import pandas as pd
from io import StringIO
import csv
f = StringIO("""Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361""")
def clean_up(csv_file):
header = None
for line in csv_file:
if not header:
header = [v for v in line if v]
length = len(header)
continue
while len(line) > length:
line.remove('')
if line != header:
yield(dict(zip(header,line)))
df = pd.DataFrame(clean_up(csv.reader(f)))
print(df)
Which gives you:
Id Name Age Score Rang Bonus
0 181 ALEX 20 987
1 182 Julia 18 8.390
2 183 Marian 21 9.170
3 184 Julien 0 175 60 9.095
4 215 Asma 26 35 19 3.807
5 216 Juan 20 7.982
6 217 Rami 10 1.832
7 415 Jessica 4 920 8 873 538 7.994
8 416 Karen 890 6 12 9.993
9 417 Andrea 0 69 283 7.200
10 419 Rym 10 18 10 7.196
11 420 Noor 10 70 910 8.291
12 421 Nathalie 0 5 0 0.900
13 456 Joe 10 13 0 74.917
14 457 Loula 0 18 11 9.990
15 458 Maria 0 15 172 6.425
16 459 Carl 15 17 11 3.349
17 566 Diego 0 3.680
18 567 Carla 0 26 1 19.361
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With