Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining big data files with different columns into one big file

I have N tab-separated files. Each file has a header line saying the names of the columns. Some of the columns are common to all of the files, but some are unique.

I want to combine all of the files into one big file containing all of the relevant headers.

Example:

> cat file1.dat
a b c
5 7 2
3 9 1

> cat file2.dat
a b e f
2 9 8 3
2 8 3 3
1 0 3 2

> cat file3.dat
a c d g
1 1 5 2

> merge file*.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2

The - can be replaced by anything, for example NA.

Caveat: the files are so big that I can not load all of them into memory simultaneously.

I had a solution in R using

write.table(do.call(plyr:::rbind.fill, 
            Map(function(filename) 
                    read.table(filename, header=1, check.names=0), 
                filename=list.files('.'))), 
    'merged.dat', quote=FALSE, sep='\t', row.names=FALSE)

but this fails with a memory error when the data are too large.

What is the best way to accomplish this?

I am thinking the best route will be to first loop through all the files to collect the column names, then loop through the files to put them into the right format, and write them to disc as they are encountered. However, is there perhaps already some code available that performs this?

like image 539
rhombidodecahedron Avatar asked Jun 12 '19 13:06

rhombidodecahedron


2 Answers

From an algorithm point of view I would take the following steps:

  1. Process the headers:

    • read all headers of all input files and extract all column names
    • sort the column names in the order you want
    • create a lookup table which returns the column-name when a field number is given (h[n] -> "name")
  2. process the files: after the headers, you can reprocess the files

    • read the header of the file
    • create a lookup table which returns the field number when given a column name. An associative array is useful here: (a["name"] -> field_number)
    • process the remainder of the file

      1. loop over all fields of the merged file
      2. get the column name with h
      3. check if the column name is in a, if not print -, if so print the field number corresponding with a.

This is easily done with a GNU awk making use of the extensions nextfile and asorti. The nextfile function allows us to read the header only and move to the next file without processing the full file. Since we need to process the file twice (step 1 reading the header and step 2 reading the file), we will ask awk to dynamically manipulate its argument list. Every time a file's header is processed, we add it at the end of the argument list ARGV so it can be used for step 2.

BEGIN { s="-" }                # define symbol
BEGIN { f=ARGC-1 }             # get total number of files
f { for (i=1;i<=NF;++i) h[$i]  # read headers in associative array h[key]
    ARGV[ARGC++] = FILENAME    # add file at end of argument list
    if (--f == 0) {            # did we process all headers?
       n=asorti(h)             # sort header into h[idx] = key
       for (i=1;i<=n;++i)      # print header
           printf "%s%s", h[i], (i==n?ORS:OFS)
    }
    nextfile                   # end of processing headers
}           
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }

If you store the above in a file merge.awk you can use the command:

awk -f merge.awk f1 f2 f3 f4 ... fx

A similar way, but less hastle with f:

BEGIN { s="-" }                 # define symbol
BEGIN {                         # modify argument list from
        c=ARGC;                 #   from: arg1 arg2  ... argx
        ARGV[ARGC++]="f=1"      #   to:   arg1 arg2  ... argx f=1 arg1 arg2  ... argx
        for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i]  # read headers in associative array h[key]
     nextfile
}
(f==1) && (FNR==1) {            # process merged header
     n=asorti(h)                # sort header into h[idx] = key
     for (i=1;i<=n;++i)         # print header
        printf "%s%s", h[i], (i==n?ORS:OFS)
     f=2                         
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }

This method is slightly different, but allows the processing of files with different field separators as

awk -f merge.awk f1 FS="," f2 f3 FS="|" f4 ... fx

If your argument list becomes too long, you can use awk to create it for you :

BEGIN { s="-" }                 # define symbol
BEGIN {                         # read argument list from input file:
  fname=(ARGC==1 ? "-" : ARGV[1])
  ARGC=1                        # from: filelist or /dev/stdin
  while ((getline < fname) > 0) #   to:   arg1 arg2 ... argx
     ARGV[ARGC++]=$0
}
BEGIN {                         # modify argument list from
        c=ARGC;                 #   from: arg1 arg2  ... argx
        ARGV[ARGC++]="f=1"      #   to:   arg1 arg2  ... argx f=1 arg1 arg2  ... argx
        for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i]  # read headers in associative array h[key]
     nextfile
}
(f==1) && (FNR==1) {            # process merged header
     n=asorti(h)                # sort header into h[idx] = key
     for (i=1;i<=n;++i)         # print header
        printf "%s%s", h[i], (i==n?ORS:OFS)
     f=2                         
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }

which can be ran as:

$ awk -f merge.awk filelist
$ find . | awk -f merge.awk "-"
$ find . | awk -f merge.awk

or any similar command.

As you see, by adding only a tiny block of code, we were able to flexibly adjust to awk code to support our needs.

like image 191
kvantour Avatar answered Oct 22 '22 04:10

kvantour


Miller (johnkerl/miller) is so underused when dealing with huge files. It has tons of features included from all useful file processing tools out there. Like the official documentation says

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. You get to work with your data using named fields, without needing to count positional column indices.

For this particular case, it supports a verb unsparsify, which by the documentation says

Prints records with the union of field names over all input records. For field names absent in a given record but present in others, fills in a value. This verb retains all input before producing any output.

You just need to do the following and reorder the file back with the column positions as you desire

mlr --tsvlite --opprint unsparsify then reorder -f a,b,c,d,e,f file{1..3}.dat

which produces the output in one-shot as

a   b   c   d   e   f   g
5   7   2   -   -   -   -
3   9   1   -   -   -   -
2   9   -   -   8   3   -
2   8   -   -   3   3   -
1   0   -   -   3   2   -
1   -   1   5   -   -   2

You can even customize what characters you can use to fill the empty fields with, with default being -. For custom characters use unsparsify --fill-with '#'

A brief explanation of the fields used

  • To delimit the input stream as a tab delimited content, --tsvlite
  • To pretty print the tabular data --opprint
  • And unsparsify like explained above does a union of all the field names over all input stream
  • The reordering verb reorder is needed because the column headers appear in random order between the files. So to define the order explicitly, use the -f option with the column headers you want the output to appear with.

And installation of the package is so straightforward. Miller is written in portable, modern C, with zero runtime dependencies. The installation via package managers is so easy and it supports all major package managers Homebrew, MacPorts, apt-get, apt and yum.

like image 42
Inian Avatar answered Oct 22 '22 05:10

Inian