Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

awk associative array grows fast

I have a file that assigns numbers to md5sums like follows:

0   0000001732816557DE23435780915F75
1   00000035552C6F8B9E7D70F1E4E8D500
2   00000051D63FACEF571C09D98659DC55
3   0000006D7695939200D57D3FBC30D46C
4   0000006E501F5CBD4DB56CA48634A935
5   00000090B9750D99297911A0496B5134
6   000000B5AEA2C9EA7CC155F6EBCEF97F
7   00000100AD8A7F039E8F48425D9CB389
8   0000011ADE49679AEC057E07A53208C1

Another file containts three md5sums in each line like follows:

00000035552C6F8B9E7D70F1E4E8D500    276EC96E149571F8A27F4417D7C6BC20    9CFEFED8FB9497BAA5CD519D7D2BB5D7
00000035552C6F8B9E7D70F1E4E8D500    44E48C092AADA3B171CE899FFC6943A8    1B757742E1BF2AA5DB6890E5E338F857

What I want to to is replace the first and third md5sums in the second file with the integers of the first file. Currently i am trying the following awk script:

awk '{OFS="\t"}FNR==NR{map[$2]=$1;next}
{print map[$1],$2,map[$3]}' mapping.txt relation.txt

The problem is that the script needs more that 16g ram despite the fact that the first file is only 5.7g on the hard drive.

like image 425
pNRuag Avatar asked Apr 05 '15 02:04

pNRuag


3 Answers

If you don't have enough memory to store the first file, then you need to write something like this to look up the 1st file for each value in the 2nd file:

awk 'BEGIN{OFS="\t"}
{
    val1 = val3 = ""
    while ( (getline line < "mapping.txt") > 0 ) {
        split(line,flds)
        if (flds[2] == $1) {
            val1 = flds[1]
        }
        if (flds[2] == $3) {
            val3 = flds[1]
        }
        if ( (val1 != "") && (val3 != "") ) {
            break
        }
    }
    close("mapping.txt")

    print val1,$2,val3

}' relation.txt

It will be slow. You could add a cache of N getline-d lines to speed it up if you like.

like image 74
Ed Morton Avatar answered Oct 22 '22 07:10

Ed Morton


This problem could be solved, as follows (file1.txt is the file with the integers and md5sums while file2.txt is the file with the three columns of md5sums):

#!/bin/sh
# First sort each of file 1 and the first and third columns of file 2 by MD5
awk '{ print $2 "\t" $1}' file1.txt | sort >file1_n.txt
# Before we sort the file 2 columns, we number the rows so we can put them
# back into the original order later
cut -f1 file2.txt | cat -n - | awk '{ print $2 "\t" $1}' | sort >file2_1n.txt
cut -f3 file2.txt | cat -n - | awk '{ print $2 "\t" $1}' | sort >file2_3n.txt
# Now do a join between them, extract the two columns we want, and put them back in order
join -t'    ' file2_1n.txt file1_n.txt | awk '{ print $2 "\t" $3}' | sort -n | cut -f2 >file2_1.txt
join -t'    ' file2_3n.txt file1_n.txt | awk '{ print $2 "\t" $3}' | sort -n | cut -f2 >file2_3.txt
cut -f2 file2.txt | paste file2_1.txt - file2_3.txt >file2_new1.txt

For a case where file1.txt and file2.txt are each 1 million lines long, this solution and Ed Morton's awk-only solution take about the same length of time on my system. My system would take a very long time to solve the problem for 140 million lines, regardless of the approach used but I ran a test case for files with 10 million lines.

I had assumed that a solution that relied on sort (which automatically uses temporary files when required) should be faster for large numbers of lines because it would be O(N log N) runtime, whereas a solution that re-reads the mapping file for each line of the input would be O(N^2) if the two files are of similar size.

Timing results

My assumption with respect to the performance relationship of the two candidate solutions turned out to be faulty for the test cases that I've tried. On my system, the sort-based solution and the awk-only solution took similar (within 30%) amounts of time to each other for each of 1 million and 10 million line input files, with the awk-only solution being faster in each case. I don't know if that relationship will hold true when the input file size goes up by another factor of more than 10, of course.

Strangely, the 10 million line problem took about 10 times as long to run with both solutions as the 1 million line problem, which puzzles me as I would have expected a non-linear relationship with file length for both solutions.

like image 42
Simon Avatar answered Oct 22 '22 05:10

Simon


If the size of a file causes awk to run out of memory, then either use another tool, or another approach entirely.

The sed command might succeed with much less memory usage. The idea is to read the index file and create a sed script which performs the remapping, and then invoke sed on the generated sedscript.

The bash script below is a implementation of this idea. It includes some STDERR output to help track progress. I like to produce progress-tracking output for problems with large data sets or other kinds of time-consuming processing.

This script has been tested on a small set of data; it may work on your data. Please give it a try.

#!/bin/bash

# md5-indexes.txt
# 0   0000001732816557DE23435780915F75
# 1   00000035552C6F8B9E7D70F1E4E8D500
# 2   00000051D63FACEF571C09D98659DC55
# 3   0000006D7695939200D57D3FBC30D46C
# 4   0000006E501F5CBD4DB56CA48634A935
# 5   00000090B9750D99297911A0496B5134
# 6   000000B5AEA2C9EA7CC155F6EBCEF97F
# 7   00000100AD8A7F039E8F48425D9CB389
# 8   0000011ADE49679AEC057E07A53208C1

# md5-data.txt
# 00000035552C6F8B9E7D70F1E4E8D500    276EC96E149571F8A27F4417D7C6BC20    9CFEFED8FB9497BAA5CD519D7D2BB5D7
# 00000035552C6F8B9E7D70F1E4E8D500    44E48C092AADA3B171CE899FFC6943A8    1B757742E1BF2AA5DB6890E5E338F857

# Goal replace field 1 and field 3 with indexes to md5 checksums from md5-indexes

md5_indexes='md5-indexes.txt'
md5_data='md5-data.txt'

talk()  { echo 1>&2 "$*" ; }
talkf() { printf 1>&2 "$@" ; }
track() {
  local var="$1" interval="$2"
  local val
  eval "val=\$$var"
  if (( interval == 0 || val % interval == 0 )); then
    shift 2
    talkf "$@"
  fi
  eval "(( $var++ ))"   # increment the counter
}

# Build a sedscript to translate all occurances of the 1st & 3rd MD5 sums into their
# corresponding indexes

talk "Building the sedscript from the md5 indexes.."

sedscript=/tmp/$$.sed

linenum=0
lines=`wc -l <$md5_indexes`
interval=$(( lines / 100 ))

while read index md5sum ; do
  track linenum $interval "..$linenum"
  echo "s/^[[:space:]]*[[:<:]]$md5sum[[:>:]]/$index/" >>$sedscript
  echo "s/[[:<:]]$md5sum[[:>:]]\$/$index/"            >>$sedscript
done <$md5_indexes
talk ''

sedlength=`wc -l <$sedscript`

talkf "The sedscript is %d lines\n" $sedlength

cmd="sed -E -f $sedscript -i .bak $md5_data"
talk "Invoking: $cmd"

$cmd

changes=`diff -U 0 $md5_data.bak $md5_data | tail +3 | grep -c '^+'`

talkf "%d lines changed in $md5_data\n" $changes

exit

Here are the two files:

cat md5-indexes.txt
0   0000001732816557DE23435780915F75
1   00000035552C6F8B9E7D70F1E4E8D500
2   00000051D63FACEF571C09D98659DC55
3   0000006D7695939200D57D3FBC30D46C
4   0000006E501F5CBD4DB56CA48634A935
5   00000090B9750D99297911A0496B5134
6   000000B5AEA2C9EA7CC155F6EBCEF97F
7   00000100AD8A7F039E8F48425D9CB389
8   0000011ADE49679AEC057E07A53208C1

cat md5-data.txt
00000035552C6F8B9E7D70F1E4E8D500    276EC96E149571F8A27F4417D7C6BC20    9CFEFED8FB9497BAA5CD519D7D2BB5D7
00000035552C6F8B9E7D70F1E4E8D500    44E48C092AADA3B171CE899FFC6943A8    1B757742E1BF2AA5DB6890E5E338F857

Here is the sample run:

$ ./md5-reindex.sh
Building the sedscript from the md5 indexes..
..0..1..2..3..4..5..6..7..8
The sedscript is 18 lines
Invoking: sed -E -f /tmp/83800.sed -i .bak md5-data.txt
2 lines changed in md5-data.txt

Finally, the resulting file:

$ cat md5-data.txt
1    276EC96E149571F8A27F4417D7C6BC20    9CFEFED8FB9497BAA5CD519D7D2BB5D7
1    44E48C092AADA3B171CE899FFC6943A8    1B757742E1BF2AA5DB6890E5E338F857
like image 29
aks Avatar answered Oct 22 '22 05:10

aks