I have N tab-separated files. Each file has a header line saying the names of the columns. Some of the columns are common to all of the files, but some are unique.
I want to combine all of the files into one big file containing all of the relevant headers.
Example:
> cat file1.dat
a b c
5 7 2
3 9 1
> cat file2.dat
a b e f
2 9 8 3
2 8 3 3
1 0 3 2
> cat file3.dat
a c d g
1 1 5 2
> merge file*.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
The -
can be replaced by anything, for example NA
.
Caveat: the files are so big that I can not load all of them into memory simultaneously.
I had a solution in R using
write.table(do.call(plyr:::rbind.fill,
Map(function(filename)
read.table(filename, header=1, check.names=0),
filename=list.files('.'))),
'merged.dat', quote=FALSE, sep='\t', row.names=FALSE)
but this fails with a memory error when the data are too large.
What is the best way to accomplish this?
I am thinking the best route will be to first loop through all the files to collect the column names, then loop through the files to put them into the right format, and write them to disc as they are encountered. However, is there perhaps already some code available that performs this?
From an algorithm point of view I would take the following steps:
Process the headers:
- read all headers of all input files and extract all column names
- sort the column names in the order you want
- create a lookup table which returns the column-name when a field number is given (
h[n] -> "name"
)process the files: after the headers, you can reprocess the files
- read the header of the file
- create a lookup table which returns the field number when given a column name. An associative array is useful here: (
a["name"] -> field_number
)process the remainder of the file
- loop over all fields of the merged file
- get the column name with
h
- check if the column name is in
a
, if not print-
, if so print the field number corresponding witha
.
This is easily done with a GNU awk making use of the extensions nextfile
and asorti
. The nextfile
function allows us to read the header only and move to the next file without processing the full file. Since we need to process the file twice (step 1 reading the header and step 2 reading the file), we will ask awk to dynamically manipulate its argument list. Every time a file's header is processed, we add it at the end of the argument list ARGV
so it can be used for step 2
.
BEGIN { s="-" } # define symbol
BEGIN { f=ARGC-1 } # get total number of files
f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
ARGV[ARGC++] = FILENAME # add file at end of argument list
if (--f == 0) { # did we process all headers?
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
}
nextfile # end of processing headers
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
If you store the above in a file merge.awk
you can use the command:
awk -f merge.awk f1 f2 f3 f4 ... fx
A similar way, but less hastle with f
:
BEGIN { s="-" } # define symbol
BEGIN { # modify argument list from
c=ARGC; # from: arg1 arg2 ... argx
ARGV[ARGC++]="f=1" # to: arg1 arg2 ... argx f=1 arg1 arg2 ... argx
for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
nextfile
}
(f==1) && (FNR==1) { # process merged header
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
f=2
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
This method is slightly different, but allows the processing of files with different field separators as
awk -f merge.awk f1 FS="," f2 f3 FS="|" f4 ... fx
If your argument list becomes too long, you can use awk
to create it for you :
BEGIN { s="-" } # define symbol
BEGIN { # read argument list from input file:
fname=(ARGC==1 ? "-" : ARGV[1])
ARGC=1 # from: filelist or /dev/stdin
while ((getline < fname) > 0) # to: arg1 arg2 ... argx
ARGV[ARGC++]=$0
}
BEGIN { # modify argument list from
c=ARGC; # from: arg1 arg2 ... argx
ARGV[ARGC++]="f=1" # to: arg1 arg2 ... argx f=1 arg1 arg2 ... argx
for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
nextfile
}
(f==1) && (FNR==1) { # process merged header
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
f=2
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
which can be ran as:
$ awk -f merge.awk filelist
$ find . | awk -f merge.awk "-"
$ find . | awk -f merge.awk
or any similar command.
As you see, by adding only a tiny block of code, we were able to flexibly adjust to awk code to support our needs.
Miller (johnkerl/miller) is so underused when dealing with huge files. It has tons of features included from all useful file processing tools out there. Like the official documentation says
Miller is like
awk, sed, cut, join
, andsort
for name-indexed data such as CSV, TSV, and tabular JSON. You get to work with your data using named fields, without needing to count positional column indices.
For this particular case, it supports a verb unsparsify, which by the documentation says
Prints records with the union of field names over all input records. For field names absent in a given record but present in others, fills in a value. This verb retains all input before producing any output.
You just need to do the following and reorder the file back with the column positions as you desire
mlr --tsvlite --opprint unsparsify then reorder -f a,b,c,d,e,f file{1..3}.dat
which produces the output in one-shot as
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
You can even customize what characters you can use to fill the empty fields with, with default being -
. For custom characters use unsparsify --fill-with '#'
A brief explanation of the fields used
--tsvlite
--opprint
unsparsify
like explained above does a union of all the field names over all input streamreorder
is needed because the column headers appear in random order between the files. So to define the order explicitly, use the -f
option with the column headers you want the output to appear with.And installation of the package is so straightforward. Miller is written in portable, modern C, with zero runtime dependencies. The installation via package managers is so easy and it supports all major package managers Homebrew, MacPorts, apt-get
, apt
and yum
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With