Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reading and parsing a TSV file, then manipulating it for saving as CSV (*efficiently*)

Tags:

python

file

csv

My source data is in a TSV file, 6 columns and greater than 2 million rows.

Here's what I'm trying to accomplish:

  1. I need to read the data in 3 of the columns (3, 4, 5) in this source file
  2. The fifth column is an integer. I need to use this integer value to duplicate a row entry with using the data in the third and fourth columns (by the number of integer times).
  3. I want to write the output of #2 to an output file in CSV format.

Below is what I came up with.

My question: is this an efficient way to do it? It seems like it might be intensive when attempted on 2 million rows.

First, I made a sample tab separate file to work with, and called it 'sample.txt'. It's basic and only has four rows:

Row1_Column1    Row1-Column2    Row1-Column3    Row1-Column4    2   Row1-Column6 Row2_Column1    Row2-Column2    Row2-Column3    Row2-Column4    3   Row2-Column6 Row3_Column1    Row3-Column2    Row3-Column3    Row3-Column4    1   Row3-Column6 Row4_Column1    Row4-Column2    Row4-Column3    Row4-Column4    2   Row4-Column6 

then I have this code:

import csv   with open('sample.txt','r') as tsv:     AoA = [line.strip().split('\t') for line in tsv]  for a in AoA:     count = int(a[4])     while count > 0:         with open('sample_new.csv', 'a', newline='') as csvfile:             csvwriter = csv.writer(csvfile, delimiter=',')             csvwriter.writerow([a[2], a[3]])         count = count - 1 
like image 304
CJH Avatar asked Dec 21 '12 15:12

CJH


People also ask

Is a TSV file the same as CSV?

CSV uses an escape syntax to represent commas and newlines in the data. TSV takes a different approach, disallowing TABs and newlines in the data. The escape syntax enables CSV to fully represent common written text. This is a good fit for human edited documents, notably spreadsheets.

How does Python read TSV?

The very simple way to read data from TSV File in Python is using split(). We can read a given TSV file and store its data into a list.


1 Answers

You should use the csv module to read the tab-separated value file. Do not read it into memory in one go. Each row you read has all the information you need to write rows to the output CSV file, after all. Keep the output file open throughout.

import csv  with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:     tsvin = csv.reader(tsvin, delimiter='\t')     csvout = csv.writer(csvout)      for row in tsvin:         count = int(row[4])         if count > 0:             csvout.writerows([row[2:4] for _ in range(count)]) 

or, using the itertools module to do the repeating with itertools.repeat():

from itertools import repeat import csv  with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:     tsvin = csv.reader(tsvin, delimiter='\t')     csvout = csv.writer(csvout)      for row in tsvin:         count = int(row[4])         if count > 0:             csvout.writerows(repeat(row[2:4], count)) 
like image 113
Martijn Pieters Avatar answered Oct 01 '22 09:10

Martijn Pieters