Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Appending the datetime to the end of every line in a 600 million row file

Tags:

sed

awk

bigdata

I have a 680 million rows (19gig) file that I need the datetime appended onto every line. I get this file every night and I have to add the time that I processed it to the end of each line. I have tried many ways to do this including sed/awk and loading it into a SQL database with the last column being defaulted to the current timestamp.

I was wondering if there is a fast way to do this? My fastest way so far takes two hours and that is just not fast enough given the urgency of the information in this file. It is a flat CSV file.

edit1:

Here's what I've done so far:

awk -v date="$(date +"%Y-%m-%d %r")" '{ print $0","date}' lrn.ae.txt > testoutput.txt

Time = 117 minutes

perl -ne 'chomp; printf "%s.pdf\n", $_' EXPORT.txt > testoutput.txt

Time = 135 minutes

mysql load data local infile '/tmp/input.txt' into table testoutput

Time = 211 minutes

like image 929
MJCS Avatar asked Jan 03 '23 22:01

MJCS


2 Answers

You don't specify if the timestamps have to be different for each of the lines. Would a "start of processing" time be enough?

If so, a simple solution is to use the paste command, with a pre-generated file of timestamps, exactly the same length as the file you're processing. Then just paste the whole thing together. Also, if the whole process is I/O bound, as others are speculating, then maybe running this on a box with an SSD drive would help speed up the process.

I just tried it locally on a 6 million row file (roughly 1% of yours), and it's actually able to do it in less than one second, on Macbook Pro, with an SSD drive.

 ~> date; time paste file1.txt timestamps.txt > final.txt; date
 Mon Jun  5 10:57:49 MDT 2017

 real   0m0.944s
 user   0m0.680s
 sys    0m0.222s
 Mon Jun  5 10:57:49 MDT 2017

I'm going to now try a ~500 million row file, and see how that fares.

Updated:

Ok, the results are in. Paste is blazing fast compared to your solution, it took just over 90 seconds total to process the whole thing, 600M rows of simple data.

~> wc -l huge.txt 
600000000 huge.txt
~> wc -l hugetimestamps.txt 
600000000 hugetimestamps.txt
~> date; time paste huge.txt hugetimestamps.txt > final.txt; date
Mon Jun  5 11:09:11 MDT 2017

real    1m35.652s
user    1m8.352s
sys 0m22.643s
Mon Jun  5 11:10:47 MDT 2017

You still need to prepare the timestamps file ahead of time, but that's a trivial bash loop. I created mine in less than one minute.

like image 110
mjuarez Avatar answered Jan 13 '23 10:01

mjuarez


A solution that simplifies mjuarez' helpful approach:

yes "$(date +"%Y-%m-%d %r")" | paste -d',' file - | head -n "$(wc -l < file)" > out-file

Note that, as with the approach in the linked answer, you must know the number of input lines in advance - here I'm using wc -l to count them, but if the number is fixed, simply use that fixed number.

  • yes keeps repeating its argument indefinitely, each on its own output line, until it is terminated.

  • paste -d',' file - pastes a corresponding pair of lines from file and stdin (-) on a single output line, separated with ,

  • Since yes produces "endless" output, head -n "$(wc -l < file)" ensures that processing stops once all input lines have been processed.

The use of a pipeline acts as a memory throttle, so running out of memory shouldn't be a concern.

like image 28
mklement0 Avatar answered Jan 13 '23 11:01

mklement0