Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split the csv based on multiple columns

Tags:

python

csv

filter

I am trying to split a csv into multiple files based on two column values. For example,

Source file:

Header1  Header2             Header3
Alpha    energy              0.1
Alpha    energy              0.34
Beta     energy_imbalance    0.66
Beta     energy              0.7
Beta     energy              0.1
Gamma    energy_imbalance    0.3

Expected output:

Outfile1:

Header1  Header2             Header3
Alpha    energy              0.1
Alpha    energy              0.34

Outfile2:

Header1  Header2             Header3
Beta     energy_imbalance    0.66

Outfile3:

Header1  Header2             Header3
Beta     energy              0.7
Beta     energy              0.1

Outfile4:

Header1  Header2             Header3
Gamma    energy_imbalance    0.3

The following is what I started with:

filein = open('test.csv')
csvin = csv.DictReader(filein)

outputs = {}
for row in csvin:
    primaryValue = row['Header1']
    secondaryValue = row['Header2']
    if primaryValue not in outputs:
        fileout = open('{}_{}.csv'.format(primaryValue,secondaryValue),'w')
        dw = csv.DictWriter(fileout, fieldnames=csvin.fieldnames)
        dw.writeheader()
        outputs[primaryValue] = fileout, dw
    outputs[primaryValue][1].writerow(row)

for fileout, _ in outputs.values():
    fileout.close()

I was able to split the file based on column = Header1, however I am not sure how to proceed further.

like image 656
JPcodes Avatar asked May 14 '26 05:05

JPcodes


1 Answers

Here's how to implement in a manner along the lines of what @Barmar's suggested (i.e. using the two column values as a dictionary key). As shown, the key is used to look-up which csv.DictWriter instance that gets used to write the row — creating new ones as necessary. It also closes all the associated files that were opened at the end by using a separate list that keesp track of those.

import csv

infile_name = 'multicol_test.csv'

with open(infile_name, newline='') as infile:
    csv_writers = {}
    files = []
    reader = csv.DictReader(infile)

    for row in reader:
        if (key := f"{row['Header1']}_{row['Header2']}") not in csv_writers:
            # Create the csv file and a corresponding DictWriter.
            outfile_name = f'{key}.csv'
            fileout = open(outfile_name, 'w', newline='')
            files.append(fileout)  # To have it closed later.
            writer = csv.DictWriter(fileout, fieldnames=reader.fieldnames)
            writer.writeheader()
            csv_writers[key] = writer

        # Write the line to corresponding csv writer.
        csv_writers[key].writerow(row)

    # Close all CSV output files.
    for f in files:
        f.close()

Applied to the sample input file, this would produce the following csv output files:

Alpha_energy.csv
Beta_energy.csv
Beta_energy_imbalance.csv
Gamma_energy_imbalance.csv

with the data in them you expect.

like image 87
martineau Avatar answered May 16 '26 19:05

martineau



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!