reading and parsing a TSV file, then manipulating it for saving as CSV (efficiently)

Tags:

My source data is in a TSV file, 6 columns and greater than 2 million rows.

Here's what I'm trying to accomplish:

I need to read the data in 3 of the columns (3, 4, 5) in this source file
The fifth column is an integer. I need to use this integer value to duplicate a row entry with using the data in the third and fourth columns (by the number of integer times).
I want to write the output of #2 to an output file in CSV format.

Below is what I came up with.

My question: is this an efficient way to do it? It seems like it might be intensive when attempted on 2 million rows.

First, I made a sample tab separate file to work with, and called it 'sample.txt'. It's basic and only has four rows:

Row1_Column1    Row1-Column2    Row1-Column3    Row1-Column4    2   Row1-Column6 Row2_Column1    Row2-Column2    Row2-Column3    Row2-Column4    3   Row2-Column6 Row3_Column1    Row3-Column2    Row3-Column3    Row3-Column4    1   Row3-Column6 Row4_Column1    Row4-Column2    Row4-Column3    Row4-Column4    2   Row4-Column6

then I have this code:

import csv   with open('sample.txt','r') as tsv:     AoA = [line.strip().split('\t') for line in tsv]  for a in AoA:     count = int(a[4])     while count > 0:         with open('sample_new.csv', 'a', newline='') as csvfile:             csvwriter = csv.writer(csvfile, delimiter=',')             csvwriter.writerow([a[2], a[3]])         count = count - 1

304

asked Dec 21 '12 15:12

CJH

1 Answers

You should use the csv module to read the tab-separated value file. Do not read it into memory in one go. Each row you read has all the information you need to write rows to the output CSV file, after all. Keep the output file open throughout.

import csv  with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:     tsvin = csv.reader(tsvin, delimiter='\t')     csvout = csv.writer(csvout)      for row in tsvin:         count = int(row[4])         if count > 0:             csvout.writerows([row[2:4] for _ in range(count)])

or, using the itertools module to do the repeating with itertools.repeat():

from itertools import repeat import csv  with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:     tsvin = csv.reader(tsvin, delimiter='\t')     csvout = csv.writer(csvout)      for row in tsvin:         count = int(row[4])         if count > 0:             csvout.writerows(repeat(row[2:4], count))

113

answered Oct 01 '22 09:10

Martijn Pieters

Related questions
                            
                                Numpy shuffle multidimensional array by row only, keep column order unchanged
                            
                                Pandas - check if ALL values are NaN in Series
                            
                                Transform a Counter object into a Pandas DataFrame
                            
                                Python: Adding element to list while iterating
                            
                                Is it safe to replace '==' with 'is' to compare Boolean-values
                            
                                Calculating difference between two rows in Python / Pandas
                            
                                Convert Z-score (Z-value, standard score) to p-value for normal distribution in Python
                            
                                Warning message while running Flask
                            
                                How can I "zip sort" parallel numpy arrays?
                            
                                Cumulative sum and percentage on column?
                            
                                how do I clear a stringio object?
                            
                                print(__doc__) in Python 3 script
                            
                                I have 2 versions of python installed, but cmake is using older version. How do I force cmake to use the newer version?
                            
                                Return value from thread
                            
                                How to test a function with input call?
                            
                                Specifying the order of matplotlib layers
                            
                                Lost connection to MySQL server during query
                            
                                PyCharm with Pyenv
                            
                                django - convert a list back to a queryset [duplicate]
                            
                                Python replace string pattern with output of function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

reading and parsing a TSV file, then manipulating it for saving as CSV (efficiently)

Tags:

python

file

csv

CJH

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us

reading and parsing a TSV file, then manipulating it for saving as CSV (*efficiently*)

Tags:

python

file

csv

CJH

People also ask

1 Answers

Martijn Pieters

Related questions

Recent Activity

Donate For Us

reading and parsing a TSV file, then manipulating it for saving as CSV (efficiently)