Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate lines from a string

Tags:

python

I'm not much familiar with Python. But, I want to remove duplicates from lines of a string.

Ex:

str = "aaa
       aaa
       aaa
       abb
       abb
       ccc"

List is a sorted ordered list.

str = "aaa
       abb
       ccc"

I've millions of such lines. I know the long way of removing duplicates, but would like to know if any possible short form.

like image 815
user1919035 Avatar asked Nov 18 '25 22:11

user1919035


2 Answers

  1. Don't use str as a variable name, since it's a builtin type
  2. use '''...''' to wrap multi-line strings
  3. use sorted, set, split in your case,

e.g.:

In [895]: print '\n'.join(sorted(set(ss.split())))
aaa
abb
ccc

thank @user2357112 for mentioning, if you want to preserve the order the words apear, use OrderedDict:

In [910]: ss = '''zzz #<----------
     ...:        aaa
     ...:        aaa
     ...:        aaa
     ...:        abb
     ...:        abb
     ...:        ccc'''

In [911]: from collections import OrderedDict
     ...: print '\n'.join(OrderedDict.fromkeys(ss.split()))
zzz #here zzz ranks the first
aaa
abb
ccc
like image 58
zhangxaochen Avatar answered Nov 20 '25 11:11

zhangxaochen


If the list is sorted, you don't need a set, because all the duplicates will be grouped together. Just track the last element

prevLine = NIL
for line in lines
  if line != prevLine:
    # output line
  prevLine = line

(My python is rusty, don't trust the syntax here. I'll check it)

like image 26
torquestomp Avatar answered Nov 20 '25 11:11

torquestomp



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!