I have around 350 text files (and each file is around 75MB). I'm trying to combine all the files and remove duplicate entries. The file is in the following format: <pre class="prettyprint"><code>ip1,dns1 ip2,dns2 ... </code></pre> I wrote a small shell script to do this <pre class="prettyprint"><code>#!/bin/bash for file in data/* do cat "$file" >> dnsFull done sort dnsFull > dnsSorted uniq dnsSorted dnsOut rm dnsFull dnsSorted </code></pre> I'm doing this processing often and was wondering if there is anything I could do to improve the processing next time when I run it. I'm open to any programming language and suggestions. Thanks!

First off, you're not using the full power of <code>cat</code>. The loop can be replaced by just <pre class="prettyprint"><code>cat data/* > dnsFull </code></pre> assuming that file is initially empty. Then there's all those temporary files that force programs to wait for hard disks (commonly the slowest parts in modern computer systems). Use a pipeline: <pre class="prettyprint"><code>cat data/* | sort | uniq > dnsOut </code></pre> This is still wasteful since <code>sort</code> alone can do what you're using <code>cat</code> and <code>uniq</code> for; the whole script can be replaced by <pre class="prettyprint"><code>sort -u data/* > dnsOut </code></pre> If this is still not fast enough, then realize that sorting takes O(n lg n) time while deduplication can be done in linear time with Awk: <pre class="prettyprint"><code>awk '{if (!a[$0]++) print}' data/* > dnsOut </code></pre>

combine multiple text files and remove duplicates

Tags:

text

merge

shell

unix

duplicate-removal

I have around 350 text files (and each file is around 75MB). I'm trying to combine all the files and remove duplicate entries. The file is in the following format:

ip1,dns1
ip2,dns2
...

I wrote a small shell script to do this

#!/bin/bash
for file in data/*
do
    cat "$file" >> dnsFull
done
sort dnsFull > dnsSorted
uniq dnsSorted dnsOut
rm dnsFull dnsSorted

I'm doing this processing often and was wondering if there is anything I could do to improve the processing next time when I run it. I'm open to any programming language and suggestions. Thanks!

348

asked Jun 01 '13 14:06

drk

1 Answers

First off, you're not using the full power of cat. The loop can be replaced by just

cat data/* > dnsFull

assuming that file is initially empty.

Then there's all those temporary files that force programs to wait for hard disks (commonly the slowest parts in modern computer systems). Use a pipeline:

cat data/* | sort | uniq > dnsOut

This is still wasteful since sort alone can do what you're using cat and uniq for; the whole script can be replaced by

sort -u data/* > dnsOut

If this is still not fast enough, then realize that sorting takes O(n lg n) time while deduplication can be done in linear time with Awk:

awk '{if (!a[$0]++) print}' data/* > dnsOut

111

answered Sep 25 '22 03:09

Fred Foo

Related questions
                            
                                Is there a way to force a shell script to run under bash instead of sh? [closed]
                            
                                How do I preserve leading whitespaces with echo on a shell script?
                            
                                Execute a shell command
                            
                                Run a shell command when a file is added
                            
                                find all files except e.g. *.xml files in shell
                            
                                Print dates in date range linux
                            
                                grep command to add end line after every match [duplicate]
                            
                                syntax error near unexpected token `echo'
                            
                                Shell command to split large file into 10 smaller files
                            
                                My Bash aliases don't work
                            
                                Command not found error message when running script
                            
                                Succinct way to create a tracking branch with Git
                            
                                Does the `shell` in `shell=True` in subprocess means `bash`?
                            
                                Is there a good Python GUI shell?
                            
                                Linux: Removing files that don't contain all the words specified
                            
                                delete all directories except one
                            
                                How to get extension of a file in shell script
                            
                                how to check if a host is in your known_host ssh
                            
                                Set shell environment variable via python script
                            
                                NSTask not picking up $PATH from the user's environment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With