Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Duplicates from Text File

I want to remove duplicate word from a text file.

i have some text file which contain such like following:

None_None

ConfigHandler_56663624
ConfigHandler_56663624
ConfigHandler_56663624
ConfigHandler_56663624

None_None

ColumnConverter_56963312
ColumnConverter_56963312

PredicatesFactory_56963424
PredicatesFactory_56963424

PredicateConverter_56963648
PredicateConverter_56963648

ConfigHandler_80134888
ConfigHandler_80134888
ConfigHandler_80134888
ConfigHandler_80134888

The resulted output needs to be:

None_None

ConfigHandler_56663624

ColumnConverter_56963312

PredicatesFactory_56963424

PredicateConverter_56963648

ConfigHandler_80134888

I have used just this command: en=set(open('file.txt') but it does not work.

Could anyone help me with how to extract only the unique set from the file

Thank you

like image 555
Kaushik Avatar asked Dec 05 '22 11:12

Kaushik


2 Answers

Here is a simple solution using sets to remove the duplicates from the text file.

lines = open('workfile.txt', 'r').readlines()

lines_set = set(lines)

out  = open('workfile.txt', 'w')

for line in lines_set:
    out.write(line)
like image 95
StuGrey Avatar answered Dec 20 '22 11:12

StuGrey


Here's about option that preserves order (unlike a set), but still has the same behaviour (note that the EOL character is deliberately stripped and blank lines are ignored)...

from collections import OrderedDict

with open('/home/jon/testdata.txt') as fin:
    lines = (line.rstrip() for line in fin)
    unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )

print unique_lines.keys()
# ['None_None', 'ConfigHandler_56663624', 'ColumnConverter_56963312',PredicatesFactory_56963424', 'PredicateConverter_56963648', 'ConfigHandler_80134888']

Then you just need to write the above to your output file.

like image 35
Jon Clements Avatar answered Dec 20 '22 11:12

Jon Clements