Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a very large CSV file. Need to break apart one field into lots of smaller rows & keep ID in each row.

I have a large CSV and it's made up with an "ID" column and a "History" column.

The ID is simple, just an integer.

The History though is a single cell and made up of up to hundreds of entries that are separated by * NOTE * in the text area.

I want to parse this with Python and the CSV module to read the data in and export it out as a new CSV as below.

EXISTING DATA STRUCTURE:

ID,History

56457827, "*** NOTE ***
2014-02-25
Long note here.  This is just a stand in to give you an idea
*** NOTE ***
2014-02-20
Another example.
This one has carriage returns.

Demonstrates they're all a bit different, though are really just text based"
56457896, "*** NOTE ***
2015-03-26
Another example of a note here.  This is the text portion.
*** NOTE ***
2015-05-24
Another example yet again."

REQUIRED DATA STRUCTURE:

ID, Date, History

56457827, 2014-02-25, "Long note here.  This is just a stand in to give you an idea"
56457827, 2014-02-20, "Another example.
This one has carriage returns.

Demonstrates they're all a bit different, though are really just text based"
56457896, 2015-03-26, "Another example of a note here.  This is the text portion."
56457896, 2015-05-24, "Another example yet again."

So I will need to master some commands. I'm guessing a loop that brings the data in which I'll be able to manage I'm sure, but then I need to analyse the data.

I believe I'll need to:

  • 1 start looping through the CSV structure
  • 2 make note of the first ID
  • 3 search for * NOTE * in the History field
  • 4 somehow grab the date string and make a note of it
  • 5 add all following string data we find after the date string to a variable (let's call it "historyShaper") until...
  • 6 ... until I find the next * NOTE *
  • 7 remove all instances of * NOTE * from the new variable "historyShaper"
  • 8 write the ID and the "historyShaper" to a new line in a new CSV file
  • 9 repeat steps 2-8 until the end of the History field

    This file is about 5MB. Is this the best approach to do this? I'm relatively new to programming and data manipulation so I'm open to any constructive critisism before I kick into this tonight when I crack open the laptop and dig in.

    Thanks so much, all feedback greatly appreciated.

like image 291
robster Avatar asked Dec 02 '25 13:12

robster


1 Answers

Ok you can easily parse the input file with the csv module, but you will need to set skipinitialspace, because your file has white spaces after the comma. I also assume that the empty line after the header should not be there.

Then, you should split the History column on '*** NOTE ***'. The first line on the text of each note should be a date, and the remaining part the actual History. Code could be:

with open(input_file_name, newline = '') as fd, \
     open(output_file_name, "w", newline='') as fdout:
    rd = csv.reader(fd, skipinitialspace=True)
    ID, Hist = next(rd)    # skip header line
    wr = csv.writer(fdout)
    _ = wr.writerow((ID, 'Date', Hist))  # write header of output file
    for row in rd:
        # print(row)      # uncomment for debug traces
        hists = row[1].split('*** NOTE ***')
        for h in hists:
            h = h.strip()
            if len(h) == 0:     # skip initial empty note
                continue
            # should begin with a data line
            date, h2 = h.split('\n', 1)
            _ = wr.writerow((row[0], date.strip(), h2.strip()))
like image 61
Serge Ballesta Avatar answered Dec 04 '25 02:12

Serge Ballesta



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!