Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the pivot lines from two tab-separated files?

Given two files file1.txt

abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432 

and file2.txt

foo bar \t hello world
abc def \t good morning
xyz \t 456

The task is to extract the lines where the first column matches and achieve:

abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world

I can do it in Python as such:

from io import StringIO

file1 = """abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432"""


file2 = """foo bar \t hello world
abc def \t good morning
xyz \t 456"""

map1, map2 = {}, {}

with StringIO(file1) as fin1:
    for line in file1.split('\n'):
        one, two = line.strip().split('\t')
        map1[one] = two
    
    
with StringIO(file2) as fin2:
    for line in file2.split('\n'):
        one, two = line.strip().split('\t')
        map2[one] = two
        
        
for k in set(map1).intersection(set(map2)):
    print('\t'.join([k, map1[k], map2[k]]))

The actual task files have billions of lines, are there faster solution without loading everything and keeping the hashmaps/dictionaries?

Maybe using unix/bash commands? Would pre-sorting the files help?

like image 719
alvas Avatar asked Feb 09 '21 14:02

alvas


People also ask

How do I create a pivot table from multiple files?

Steps to Create a Pivot Table using Data from Multiple Workbooks. 1 Step 1 – Combine Files using Power Query. First of all, we need to combine all the files into one file with power query. 2 Step 2 – Prepare Data for the Pivot Table. 3 Step 3 – Insert the Pivot Table.

How to separate a PivotTable from a preexisting pivot table?

If you select Yes, then the PivotTable will be calculated on the same Data Cache as preexisting Pivot Tables and it will suffer from all the symptoms described above. If you select No, then you will create a new Data Cache for this Pivot Table and therefor it will be separate from the preexisting Pivot Tables! The method is quite simple.

How to use power query for a pivot table?

But, the simple and best way is to use refresh button from the data tab. It will refersh both (Pivot Table + Query). Here I have listed some points which you need to remember while using power query for a pivot table. Files should be in a single folder. Data should be in the same format in all the files.

What is a pivot table in Microsoft Access?

Pivot Tables from multiple sheets is a concept where there should be two or more tables to be added to one table and the fields can be selected according to the requirement from one place. In one word different tables, data can be appended from different sheets using some techniques and shortcuts.


2 Answers

The join command is sometimes hard to work with, but here it's straightforward:

join -t $'\t' <(sort file1.txt) <(sort file2.txt)

That uses bash's ANSI-C quoting to specify the tab separator, and process substitutions to treat program output as a file.

To see the output, pipe the above into cat -A to see the tabs represented as ^I:

abc def^I123 456^Igood morning$
foo bar^I789 123^Ihello world$
like image 158
glenn jackman Avatar answered Sep 27 '22 22:09

glenn jackman


You may try this awk:

awk '{key = $1 FS $2} FNR==NR {sub(/^([^[:blank:]]+[[:blank:]]+){2}/, ""); map[key] = $0; next} key in map {print $0, map[key]}' file2.txt file1.txt

abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world

A more readable version:

awk '{
   key = $1 FS $2
}
FNR == NR {
   sub(/^([^[:blank:]]+[[:blank:]]+){2}/, "")
   map[key] = $0
   next
}
key in map {
   print $0, map[key]
}' file2.txt file1.txt

It only loads data from file2 into memory and processes records of file1 line by line.

like image 38
anubhava Avatar answered Sep 27 '22 22:09

anubhava