My source data is in a TSV file, 6 columns and greater than 2 million rows.
Here's what I'm trying to accomplish:
Below is what I came up with.
My question: is this an efficient way to do it? It seems like it might be intensive when attempted on 2 million rows.
First, I made a sample tab separate file to work with, and called it 'sample.txt'. It's basic and only has four rows:
Row1_Column1 Row1-Column2 Row1-Column3 Row1-Column4 2 Row1-Column6 Row2_Column1 Row2-Column2 Row2-Column3 Row2-Column4 3 Row2-Column6 Row3_Column1 Row3-Column2 Row3-Column3 Row3-Column4 1 Row3-Column6 Row4_Column1 Row4-Column2 Row4-Column3 Row4-Column4 2 Row4-Column6
then I have this code:
import csv with open('sample.txt','r') as tsv: AoA = [line.strip().split('\t') for line in tsv] for a in AoA: count = int(a[4]) while count > 0: with open('sample_new.csv', 'a', newline='') as csvfile: csvwriter = csv.writer(csvfile, delimiter=',') csvwriter.writerow([a[2], a[3]]) count = count - 1
CSV uses an escape syntax to represent commas and newlines in the data. TSV takes a different approach, disallowing TABs and newlines in the data. The escape syntax enables CSV to fully represent common written text. This is a good fit for human edited documents, notably spreadsheets.
The very simple way to read data from TSV File in Python is using split(). We can read a given TSV file and store its data into a list.
You should use the csv
module to read the tab-separated value file. Do not read it into memory in one go. Each row you read has all the information you need to write rows to the output CSV file, after all. Keep the output file open throughout.
import csv with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout: tsvin = csv.reader(tsvin, delimiter='\t') csvout = csv.writer(csvout) for row in tsvin: count = int(row[4]) if count > 0: csvout.writerows([row[2:4] for _ in range(count)])
or, using the itertools
module to do the repeating with itertools.repeat()
:
from itertools import repeat import csv with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout: tsvin = csv.reader(tsvin, delimiter='\t') csvout = csv.writer(csvout) for row in tsvin: count = int(row[4]) if count > 0: csvout.writerows(repeat(row[2:4], count))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With