Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl beats Python in fetching HTML pages? [closed]

Tags:

python

html

perl

I have a Perl script which is fetching html pages. I tried rewriting it in Python (Just Coz I am trying to learn Python) and I found it to be really slow!

Here's the test Script in Perl

#!/usr/bin/perl
use LWP::Simple;

$url = "http://majorgeeks.com/page.php?id=";

open(WEB,">>"."perldata.txt");

for ($column = 1 ; $column <= 20 ; $column ++)

{

    $temp = $url.$column;
    print "val = $temp\n\n";

    $response=get($temp)or die("[-] Failed!!\n");

    print WEB "$response\n\n";

}

And here's the Equivalent Code in Python

import urllib2

url = "http://majorgeeks.com/page.php?id="

f = open("pydata.txt", 'w')

for i in range(20):

   tempurl = url + str(i+1)
   print "Val : " + tempurl + "\n\n"

   #req = urllib2.Request(tempurl)
   res = urllib2.urlopen(tempurl)

   f.write(res.read())

f.close()

The difference I found is huge! Perl Script finished in approx 30 seconds. While Python Script took 7mins approx. (420 seconds)!!

I am using Ubuntu 11.10, 64bit, Core i7, Tested it on a 12MBPS connection. I tried this several times, and every time I get the same amount of difference.

Am I doing something wrong here? Or I need to do something? Or the difference is justified? (I hope not)

Thanks a lot for your help.

Update 3: I just came home and booted my laptop, ran the code again and it finished it in 11 seconds!!! :/ Was it because I "rebooted" my comp?? Here's the Profiler output

Note - Perl still took 31 seconds for the same thing!! :/

Update 2: As suggested by @Makoto Here's the profiler data That I did. And It is is really slow! I know some python configuration has to do with this, but don't know what. For one simple request, It shouldn't take 20 seconds!!!

Update : Fixed url to tempurl. Commented out the urllib2.Request as suggested here. Not much difference at all.

like image 741
firesofmay Avatar asked Jan 06 '12 20:01

firesofmay


2 Answers

Your code could be improved, although I am not sure it will fix all the performance problems:

from urllib2 import urlopen

url = "http://majorgeeks.com/page.php?id={}"

with open("pydata.txt", 'w') as f:
    for i in xrange(1, 21):
        tempurl = url.format(i)
        print "Val : {}\n\n".format(tempurl)
        f.write(urlopen(tempurl).read())

I changed it also logically - it now requests different URLs (defined by tempurl), it used to request the same URL 20 times (defined by url). I also used string formatting, although I am not sure how it influences the efficiency.

I tested it on my system (Windows 7 64-bit, Python 2.7.2, within IDLE, moderate internet connection) and it took 40 seconds (40.262) to finish.

like image 82
Tadeck Avatar answered Nov 20 '22 12:11

Tadeck


I still have to scratch my head and figure out why this code is taking so long for both @mayjune and @Tadeck. I've had a chance to run both pieces of code formally through a profiler, and here are the results. I strongly encourage you to run these results for yourself on your machine, since mine will produce different results (AMD Athlon II X4 @3GHz, 8GB RAM, Ubuntu 11.04 x64, 7Mbit line).

To run:

python -m cProfile -o profile.dat <path/to/code.py>; python -m pstats profile.dat

(From inside of the profiler, you can check help for commands.)


Original Code:

Fri Jan  6 17:49:29 2012    profile.dat

20966 function calls (20665 primitive calls) in 13.566 CPU seconds

Ordered by: cumulative time
List reduced from 306 to 15 due to restriction <15>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.001    0.001   13.567   13.567 websiteretrieval.py:1(<module>)
20    0.000    0.000    7.874    0.394 /usr/lib/python2.7/urllib2.py:122(urlopen)
20    0.000    0.000    7.874    0.394 /usr/lib/python2.7/urllib2.py:373(open)
20    0.000    0.000    7.870    0.394 /usr/lib/python2.7/urllib2.py:401(_open)
40    0.000    0.000    7.870    0.197 /usr/lib/python2.7/urllib2.py:361(_call_chain)
20    0.000    0.000    7.870    0.393 /usr/lib/python2.7/urllib2.py:1184(http_open)
20    0.001    0.000    7.870    0.393 /usr/lib/python2.7/urllib2.py:1112(do_open)
1178    7.596    0.006    7.596    0.006 {method 'recv' of '_socket.socket' objects}
20    0.000    0.000    5.911    0.296 /usr/lib/python2.7/httplib.py:953(request)
20    0.000    0.000    5.911    0.296 /usr/lib/python2.7/httplib.py:974(_send_request)
20    0.000    0.000    5.911    0.296 /usr/lib/python2.7/httplib.py:938(endheaders)
20    0.000    0.000    5.911    0.296 /usr/lib/python2.7/httplib.py:796(_send_output)
20    0.000    0.000    5.910    0.296 /usr/lib/python2.7/httplib.py:769(send)
20    0.000    0.000    5.909    0.295 /usr/lib/python2.7/httplib.py:751(connect)
20    0.001    0.000    5.909    0.295 /usr/lib/python2.7/socket.py:537(create_connection)

...so from observation, the only thing that could slow you down is...urlopen and open. I/O is slow, so that's sort of understandable.

Revised Code

Fri Jan  6 17:52:36 2012    profileTadeck.dat

21008 function calls (20707 primitive calls) in 13.249 CPU seconds

Ordered by: cumulative time
List reduced from 305 to 15 due to restriction <15>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.002    0.002   13.249   13.249 websiteretrievalTadeck.py:1(<module>)
20    0.000    0.000    7.706    0.385 /usr/lib/python2.7/urllib2.py:122(urlopen)
20    0.000    0.000    7.706    0.385 /usr/lib/python2.7/urllib2.py:373(open)
20    0.000    0.000    7.702    0.385 /usr/lib/python2.7/urllib2.py:401(_open)
40    0.000    0.000    7.702    0.193 /usr/lib/python2.7/urllib2.py:361(_call_chain)
20    0.000    0.000    7.702    0.385 /usr/lib/python2.7/urllib2.py:1184(http_open)
20    0.001    0.000    7.702    0.385 /usr/lib/python2.7/urllib2.py:1112(do_open)
1178    7.348    0.006    7.348    0.006 {method 'recv' of '_socket.socket' objects}
20    0.000    0.000    5.841    0.292 /usr/lib/python2.7/httplib.py:953(request)
20    0.000    0.000    5.841    0.292 /usr/lib/python2.7/httplib.py:974(_send_request)
20    0.000    0.000    5.840    0.292 /usr/lib/python2.7/httplib.py:938(endheaders)
20    0.000    0.000    5.840    0.292 /usr/lib/python2.7/httplib.py:796(_send_output)
20    0.000    0.000    5.840    0.292 /usr/lib/python2.7/httplib.py:769(send)
20    0.000    0.000    5.839    0.292 /usr/lib/python2.7/httplib.py:751(connect)
20    0.001    0.000    5.839    0.292 /usr/lib/python2.7/socket.py:537(create_connection)

Again, the largest two culprits of time being spent are on urlopen and open. This leads me to believe that I/O has a major role in bogging down your code. However, the difference is not substantial on the machine that I've tested it on - Perl's script executes in roughly the same time.

real    0m11.129s
user    0m0.230s
sys 0m0.070s

I'm not convinced that it's software's fault that your code is slow, although your machine is pretty beefy. I strongly encourage you to run the profiler suite (code is included above) to see if you can find any bottlenecks that I've missed.

like image 42
Makoto Avatar answered Nov 20 '22 14:11

Makoto