Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove extra commas, space & lines offset in .csv file using python

I have a 5 pages pdf file, each page has a table that I need extract. I need to extract all the tables from each page and save them as a data-frame file all using python so i converted the file, to a csv file using tabula

tabula.convert_into('input.pdf', "output.csv", output_format="csv", pages='all')

The main issue with the file output.csv is that there are several extra commas.

Example

Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361

When i convert the csv file into row/columns i got some lines offset

Check the image below to get the problem: enter image description here As you can see in the image there are some lines offset(each table in each page of file has specific lines offset) how can i fix this problem

NB: The dataframe should have 6 columns with empty fields. I guess the extra commas comes from space in the pdf file. how can i remove extra commas from csv file or removing extra space on pdf file.

The expected output in the image below: enter image description here

I would really appreciate your help.

like image 827
Learner Avatar asked Jun 26 '26 01:06

Learner


2 Answers

I am basing my strategy on a short regular expression to capture the first 2 columns, and the numbers at the end.

(\d+,[^,]+,) → numbers + comma + anything but comma + comma
,*           → zero or more commas
(\d.+)       → the rest of the line starting from the first number

Then I am concatenating those two groups inserting enough commas in the middle so that the total is 5 (= 6 columns).

The seems like a pretty straightforward approach to me. It will work for any variation of input where random spaces and commas are inserted as long as the numeric data is right-aligned.

import re,io

def fix_line(line):
    # remove duplicate commas and spaces 
    line = re.sub(',,', ',', line.replace(' ', ''))
    # groups: first two rows / middle (non-captured) / numbers
    match = re.match(r'(\d+,[^,]+,),*(\d.+)', line)
    if not match: # removes the headers
        return ''
    # align numbers to right: 6 columns = 5 commas
    return match.groups()[0]+(','*(5-2-match.groups()[1].count(',')))+match.groups()[1]
    

data_corr = [fix_line(line) for line in lines]

df = pd.read_csv(io.StringIO('\n'.join(data_corr)),
                 names=re.sub(',,+', ',', lines[0]).split(',') # assign column names
                )

Assuming this input as the variable lines:

['Id,Name,Age,,Score,Rang,Bonus',
 '181,ALEX,,,,20,987',
 '182,Julia,,,,18,8.390',
 '183,Marian,,,,21,9.170',
 '184,Julien,,0,175,60,9.095',
 'Id,Name,Age,,Score,Rang,Bonus',
 '215,Asma,26,,35,19,3.807',
 '216,Juan,,,,20,7.982',
 '217,Rami,,,,10,1.832',
 'Id,Name,Age,,Score,Rang,Bonus',
 '415,Jessica,,4 920,8 873,538,7.994',
 '416,Karen,,890,6,12,9.993',
 '417,Andrea,,0,69,283,7.200',
 'Id,Name,Age,,Score,Rang,Bonus',
 '419,Rym,10,,18,,10,7.196',
 '420,Noor,10,,70,,910,8.291',
 '421,Nathalie,0,,5,,0,0.900',
 '"",Id,Name,Age,,Score,Rang,Bonus',
 '456,,Joe,,10,13,0,74.917',
 '457,,Loula,,0,18,11,9.990',
 '458,,Maria,,0,15,172,6.425',
 '459,,Carl,,15,17,11,3.349',
 'Id,Name,Age,,Score,Rang,Bonus',
 '566,Diego,,,,0,3.680',
 '567,Carla,0,,26,1,19.361']

output:

     Id      Name     Age   Score  Rang    Bonus
0   181      ALEX     NaN     NaN    20  987.000
1   182     Julia     NaN     NaN    18    8.390
2   183    Marian     NaN     NaN    21    9.170
3   184    Julien     0.0   175.0    60    9.095
4   215      Asma    26.0    35.0    19    3.807
5   216      Juan     NaN     NaN    20    7.982
6   217      Rami     NaN     NaN    10    1.832
7   415   Jessica  4920.0  8873.0   538    7.994
8   416     Karen   890.0     6.0    12    9.993
9   417    Andrea     0.0    69.0   283    7.200
10  419       Rym    10.0    18.0    10    7.196
11  420      Noor    10.0    70.0   910    8.291
12  421  Nathalie     0.0     5.0     0    0.900
13  456       Joe    10.0    13.0     0   74.917
14  457     Loula     0.0    18.0    11    9.990
15  458     Maria     0.0    15.0   172    6.425
16  459      Carl    15.0    17.0    11    3.349
17  566     Diego     NaN     NaN     0    3.680
18  567     Carla     0.0    26.0     1   19.361

NB. if the input is a file then first read lines using:

with open('/path/to/file', 'r') as f:
    lines = f.readlines()
like image 133
mozway Avatar answered Jun 28 '26 16:06

mozway


I found this a bit easier to follow than Martin Evans' answer

It's a the generator yields lines of the same length as the cleaned up first line. And removes the first empty string until a line has the right length.

Like Martin's answer it yields your expected dataframe from your example data.

import pandas as pd
from io import StringIO
import csv

f = StringIO("""Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361""")


def clean_up(csv_file):
    header = None
    for line in csv_file:
        if not header:
            header = [v for v in line if v]
            length = len(header)
            continue
        while len(line) > length:
            line.remove('')
        if line != header:
            yield(dict(zip(header,line)))

df = pd.DataFrame(clean_up(csv.reader(f)))
print(df)

Which gives you:

     Id      Name    Age  Score Rang   Bonus
0   181      ALEX                 20     987
1   182     Julia                 18   8.390
2   183    Marian                 21   9.170
3   184    Julien      0    175   60   9.095
4   215      Asma     26     35   19   3.807
5   216      Juan                 20   7.982
6   217      Rami                 10   1.832
7   415   Jessica  4 920  8 873  538   7.994
8   416     Karen    890      6   12   9.993
9   417    Andrea      0     69  283   7.200
10  419       Rym     10     18   10   7.196
11  420      Noor     10     70  910   8.291
12  421  Nathalie      0      5    0   0.900
13  456       Joe     10     13    0  74.917
14  457     Loula      0     18   11   9.990
15  458     Maria      0     15  172   6.425
16  459      Carl     15     17   11   3.349
17  566     Diego                  0   3.680
18  567     Carla      0     26    1  19.361
like image 29
Chris Wesseling Avatar answered Jun 28 '26 15:06

Chris Wesseling