Count Duplicate URLs, fastest method possible

Question

I'm still working with this huge list of URLs, all the help I have received has been great.

At the moment I have the list looking like this (17000 URLs though):

http://www.example.com/page?CONTENT_ITEM_ID=1
http://www.example.com/page?CONTENT_ITEM_ID=3
http://www.example.com/page?CONTENT_ITEM_ID=2
http://www.example.com/page?CONTENT_ITEM_ID=1
http://www.example.com/page?CONTENT_ITEM_ID=2
http://www.example.com/page?CONTENT_ITEM_ID=3
http://www.example.com/page?CONTENT_ITEM_ID=3

I can filter out the duplicates no problem with a couple of methods, awk etc. What I am really looking to do it take out the duplicate URLs but at the same time taking a count of how many times the URL exists in the list and printing the count next to the URL with a pipe separator. After processing the list it should look like this:

url	count
`http://www.example.com/page?CONTENT_ITEM_ID=1`	2
`http://www.example.com/page?CONTENT_ITEM_ID=2`	2
`http://www.example.com/page?CONTENT_ITEM_ID=3`	3

What method would be the fastest way to achieve this?

Vinko Vrsalovic · Accepted Answer

This is probably as fast as you can get without writing code.

    $ cat foo.txt
    http://www.example.com/page?CONTENT_ITEM_ID=1
    http://www.example.com/page?CONTENT_ITEM_ID=3
    http://www.example.com/page?CONTENT_ITEM_ID=2
    http://www.example.com/page?CONTENT_ITEM_ID=1
    http://www.example.com/page?CONTENT_ITEM_ID=2
    http://www.example.com/page?CONTENT_ITEM_ID=3
    http://www.example.com/page?CONTENT_ITEM_ID=3
    $ sort foo.txt | uniq -c
          2 http://www.example.com/page?CONTENT_ITEM_ID=1
          2 http://www.example.com/page?CONTENT_ITEM_ID=2
          3 http://www.example.com/page?CONTENT_ITEM_ID=3

Did a bit of testing, and it's not particularly fast, although for 17k it'll take little more than 1 second (on a loaded P4 2.8Ghz machine)

$ wc -l foo.txt
174955 foo.txt
vinko@mithril:~/i3media/2008/product/Pending$ time sort foo.txt | uniq -c
  54482 http://www.example.com/page?CONTENT_ITEM_ID=1
  48212 http://www.example.com/page?CONTENT_ITEM_ID=2
  72261 http://www.example.com/page?CONTENT_ITEM_ID=3

real    0m23.534s
user    0m16.817s
sys     0m0.084s

$ wc -l foo.txt
14955 foo.txt
$ time sort foo.txt | uniq -c
   4233 http://www.example.com/page?CONTENT_ITEM_ID=1
   4290 http://www.example.com/page?CONTENT_ITEM_ID=2
   6432 http://www.example.com/page?CONTENT_ITEM_ID=3

real    0m1.349s
user    0m1.216s
sys     0m0.012s

Although O() wins the game hands down, as usual. Tested S.Lott's solution and


$ cat pythoncount.py
from collections import defaultdict
myFile = open( "foo.txt", "ru" )
fq= defaultdict( int )
for n in myFile:
    fq[n] += 1
for n in fq.items():
    print "%s|%s" % (n[0].strip(),n[1])

$ wc -l foo.txt
14955 foo.txt

$ time python pythoncount.py
http://www.example.com/page?CONTENT_ITEM_ID=2|4290
http://www.example.com/page?CONTENT_ITEM_ID=1|4233
http://www.example.com/page?CONTENT_ITEM_ID=3|6432

real    0m0.072s
user    0m0.028s
sys     0m0.012s

$ wc -l foo.txt
1778955 foo.txt

$ time python pythoncount.py
http://www.example.com/page?CONTENT_ITEM_ID=2|504762
http://www.example.com/page?CONTENT_ITEM_ID=1|517557
http://www.example.com/page?CONTENT_ITEM_ID=3|756636

real    0m2.718s
user    0m2.440s
sys     0m0.072s

Count Duplicate URLs, fastest method possible

Tags:

text-processing

RailsSon

1 Answers

Vinko Vrsalovic

Recent Activity

Donate For Us

Count Duplicate URLs, fastest method possible

Tags:

text-processing

RailsSon

1 Answers

Vinko Vrsalovic

Related questions

Recent Activity

Donate For Us