How to get the pivot lines from two pairs of files in Python?

Question

From How to get the pivot lines from two tab-separated files?, there's a quick way to use unix command to pivot lines from two files.

If we have two pairs of files:

f1a and f1b
f2a and f2b

The goal is to provide a 3 column tab-separated file, that comprises:

f1a / f2a
f1b
f2b

Where f1a / f2a are lines in the files that both occurs in f1a and f1b:

I've tried the following which works but if the file is extremely large, it will take significant amount of memory to store the f1 and f2 dictionary. E.g. files with billions of lines.

import sys
from tqdm import tqdm 

f1a, f1b, f2a, f2b = sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4]


# Read first pair of file into memory.
with open(f1a) as fin_f1a, open(f1a) as fin_f1b:
  f1 = {s.strip().replace('	', ' ') :t.strip().replace('	', ' ') for s, t in tqdm(zip(fin_f1a, fin_f1b))}

with open(s2) as fin_f2a, open(t2) as fin_f2b:
  f2 = {s.strip().replace('	', ' ') :t.strip().replace('	', ' ') for s, t in tqdm(zip(fin_f2a, fin_f2b))}


with open('pivoted.tsv', 'w') as fout:
  for s in tqdm(f1.keys() & f2.keys()):
    print('	'.join([s, f1[s], f2[s]]), end='
', file=fout)

Is there a faster/better/easier way to achieve the same 3-columns tab-separated file in Python? Are there libraries that can do such operations efficiently for huge files?

Using turicreate.SFrame, I could also do:

from turicreate import SFrame

f1a, f1b, f2a, f2b = sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4]

sf1a = SFrame.read_csv(f1a, delimited='\0', header=False)
sf1b = SFrame.read_csv(f1b, delimited='\0', header=False) 

sf2a = SFrame.read_csv(f2a, delimited='\0', header=False)
sf2b = SFrame.read_csv(f2b, delimited='\0', header=False)

sf1 = sf1a.join(sf1b) 
sf2 = sf2a.join(sf2b)

sf = sf1.join(sf2, on='X1', how='left') 
sf.save('pivoted')

Bob · Accepted Answer

Generic merge

The zip function will not store a whole copy of the iterables. So we can use it safely.

Assuming you have two iterables thatare sorted in ascending order by the first column you can join the two tables as follows.

def merge(t1, t2):
    end = object()
    end_ = end, None
    a1, b1 = next(t1, end_)
    a2, b2 = next(t2, end_)
    while a1 is not end and a2 is not end:
        if a1 < a2:
            a1, b1 = next(t1, end_)
        elif a1 > a2:
            a2, b2 = next(t2, end_)
        else:
            yield a1, b1, b2
            a1, b1 = next(t1, end_)
            a2, b2 = next(t2, end_)

Merge is invoked with two iteratos and produce a third iterator and only one element of each iterator needs to be stored at a time.

list(merge(iter([(0, 1), (1, 1), (3, 2)]), 
  iter([(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e')])))

[(0, 1, 'a'), (1, 1, 'b'), (3, 2, 'd')]

Scaning and writing

In order to prevent the whole file from being stored I have the scan method that will read and yield one line at a time of each file.

def scan(fa, fb):
    for a, b in zip(fa, fb):
        a = a.strip().replace('	', ' ')
        b = b.strip().replace('	', ' ')
        yield a, b
def scan_by_name(fa, fb):
    with open(fa) as fha, open(fb) as fhb:
        yield from scan(fha, fhb)

Then you could apply to your problem this way (untested, I don't have your files)

with open('pivoted.tsv', 'w') as fout:
    t1 = scan_by_name(f1a, f1b)
    t2 = scan_by_name(f2a, f2b)
    for row in merge(t1, t2):
        print('	'.join(row), end='
', file=fout)

How to get the pivot lines from two pairs of files in Python?

Tags:

performance

python

dictionary

csv

memory-efficient

alvas

1 Answers

Generic merge

Scaning and writing

Bob

Recent Activity

Donate For Us

How to get the pivot lines from two pairs of files in Python?

Tags:

performance

python

dictionary

csv

memory-efficient

alvas

1 Answers

Generic merge

Scaning and writing

Bob

Related questions

Recent Activity

Donate For Us