Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing large tab-delimited .txt file into Python

I have a tab delimited .txt file that I'm trying to import into a matrix array in Python of the same format as the text file is as shown below:

123088 266 248 244 266 244 277

123425 275 244 241 289 248 231

123540 156 654 189 354 156 987

Note there are many, many more rows of the stuff above (roughly 200) that I want to pass into Python and maintain the same formatting when creating a matrix array from it.

The current code that I have for this is:

d = {}
with open('file name', 'rb') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter='\t')
    for row in csv_reader:
        d[row[0]] = row[1:]

Which it slightly does what I need it to do, but not my target goal for it. I want to finish code that I can type in print(d[0,3]) and it will spit out 248.

like image 744
Harley Avatar asked Jun 07 '13 17:06

Harley


2 Answers

First, you are loading it into a dictionary, which is not going to get the list of lists that you want.

It's dead simple to use the CSV module to generate a list of lists like this:

import csv
with open(path) as f:
    reader = csv.reader(f, delimiter="\t")
    d = list(reader)
print d[0][2] # 248

That would give you a list of lists of strings, so if you wanted to get numbers, you'd have to convert to int.

That said, if you have a large array (or are doing any kind of numeric calculations), you should consider using something like NumPy or pandas. If you wanted to use NumPy, you could do

import numpy as np
d = np.loadtxt(path, delimiter="\t")
print d[0,2] # 248

As a bonus, NumPy arrays allow you to do quick vector/matrix operations. (Also, note that d[0][2] would work with the NumPy array too).

like image 187
Jeff Tratner Avatar answered Sep 28 '22 10:09

Jeff Tratner


Try this:

d = []
with open(sourcefile,'rb') as source:
    for line in source:
        fields = line.split('\t')
        d.append(fields)

print d[0][1] will print 266.

print d[0][2] (remember your arrays are 0-based) will print 248.

To output the data in the same format as your input:

for line in d:
    print "\t".join(line)
like image 45
jsucsy Avatar answered Sep 28 '22 10:09

jsucsy