I have around 350 text files (and each file is around 75MB). I'm trying to combine all the files and remove duplicate entries. The file is in the following format:
ip1,dns1
ip2,dns2
...
I wrote a small shell script to do this
#!/bin/bash
for file in data/*
do
cat "$file" >> dnsFull
done
sort dnsFull > dnsSorted
uniq dnsSorted dnsOut
rm dnsFull dnsSorted
I'm doing this processing often and was wondering if there is anything I could do to improve the processing next time when I run it. I'm open to any programming language and suggestions. Thanks!
Go to the Tools menu > Scratchpad or press F2. Paste the text into the window and press the Do button. The Remove Duplicate Lines option should already be selected in the drop down by default. If not, select it first.
Two quick options for combining text files.Open the two files you want to merge. Select all text (Command+A/Ctrl+A) from one document, then paste it into the new document (Command+V/Ctrl+V). Repeat steps for the second document. This will finish combining the text of both documents into one.
On can remove duplicated rows in a text file with the menu command Edit > Line Operations > Remove Duplicate Lines .
To start your duplicate search, go to File -> Find Duplicates or click the Find Duplicates button on the main toolbar. The Find Duplicates dialog will open, as shown below. The Find Duplicates dialog is intuitive and easy to use. The first step is to specify which folders should be searched for duplicates.
First off, you're not using the full power of cat
. The loop can be replaced by just
cat data/* > dnsFull
assuming that file is initially empty.
Then there's all those temporary files that force programs to wait for hard disks (commonly the slowest parts in modern computer systems). Use a pipeline:
cat data/* | sort | uniq > dnsOut
This is still wasteful since sort
alone can do what you're using cat
and uniq
for; the whole script can be replaced by
sort -u data/* > dnsOut
If this is still not fast enough, then realize that sorting takes O(n lg n) time while deduplication can be done in linear time with Awk:
awk '{if (!a[$0]++) print}' data/* > dnsOut
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With