Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

combine multiple text files and remove duplicates

I have around 350 text files (and each file is around 75MB). I'm trying to combine all the files and remove duplicate entries. The file is in the following format:

ip1,dns1
ip2,dns2
...

I wrote a small shell script to do this

#!/bin/bash
for file in data/*
do
    cat "$file" >> dnsFull
done
sort dnsFull > dnsSorted
uniq dnsSorted dnsOut
rm dnsFull dnsSorted

I'm doing this processing often and was wondering if there is anything I could do to improve the processing next time when I run it. I'm open to any programming language and suggestions. Thanks!

like image 348
drk Avatar asked Jun 01 '13 14:06

drk


People also ask

How do I remove duplicates from a text file?

Go to the Tools menu > Scratchpad or press F2. Paste the text into the window and press the Do button. The Remove Duplicate Lines option should already be selected in the drop down by default. If not, select it first.

How can I quickly combine text files?

Two quick options for combining text files.Open the two files you want to merge. Select all text (Command+A/Ctrl+A) from one document, then paste it into the new document (Command+V/Ctrl+V). Repeat steps for the second document. This will finish combining the text of both documents into one.

How do I remove duplicates in notepad?

On can remove duplicated rows in a text file with the menu command Edit > Line Operations > Remove Duplicate Lines .

How do I find duplicates in a text file?

To start your duplicate search, go to File -> Find Duplicates or click the Find Duplicates button on the main toolbar. The Find Duplicates dialog will open, as shown below. The Find Duplicates dialog is intuitive and easy to use. The first step is to specify which folders should be searched for duplicates.


1 Answers

First off, you're not using the full power of cat. The loop can be replaced by just

cat data/* > dnsFull

assuming that file is initially empty.

Then there's all those temporary files that force programs to wait for hard disks (commonly the slowest parts in modern computer systems). Use a pipeline:

cat data/* | sort | uniq > dnsOut

This is still wasteful since sort alone can do what you're using cat and uniq for; the whole script can be replaced by

sort -u data/* > dnsOut

If this is still not fast enough, then realize that sorting takes O(n lg n) time while deduplication can be done in linear time with Awk:

awk '{if (!a[$0]++) print}' data/* > dnsOut
like image 111
Fred Foo Avatar answered Sep 25 '22 03:09

Fred Foo