I have a Perl script which is fetching html pages. I tried rewriting it in Python (Just Coz I am trying to learn Python) and I found it to be really slow!
Here's the test Script in Perl
#!/usr/bin/perl
use LWP::Simple;
$url = "http://majorgeeks.com/page.php?id=";
open(WEB,">>"."perldata.txt");
for ($column = 1 ; $column <= 20 ; $column ++)
{
$temp = $url.$column;
print "val = $temp\n\n";
$response=get($temp)or die("[-] Failed!!\n");
print WEB "$response\n\n";
}
And here's the Equivalent Code in Python
import urllib2
url = "http://majorgeeks.com/page.php?id="
f = open("pydata.txt", 'w')
for i in range(20):
tempurl = url + str(i+1)
print "Val : " + tempurl + "\n\n"
#req = urllib2.Request(tempurl)
res = urllib2.urlopen(tempurl)
f.write(res.read())
f.close()
The difference I found is huge! Perl Script finished in approx 30 seconds. While Python Script took 7mins approx. (420 seconds)!!
I am using Ubuntu 11.10, 64bit, Core i7, Tested it on a 12MBPS connection. I tried this several times, and every time I get the same amount of difference.
Am I doing something wrong here? Or I need to do something? Or the difference is justified? (I hope not)
Thanks a lot for your help.
Update 3: I just came home and booted my laptop, ran the code again and it finished it in 11 seconds!!! :/ Was it because I "rebooted" my comp?? Here's the Profiler output
Note - Perl still took 31 seconds for the same thing!! :/
Update 2: As suggested by @Makoto Here's the profiler data That I did. And It is is really slow! I know some python configuration has to do with this, but don't know what. For one simple request, It shouldn't take 20 seconds!!!
Update : Fixed url to tempurl. Commented out the urllib2.Request as suggested here. Not much difference at all.
Your code could be improved, although I am not sure it will fix all the performance problems:
from urllib2 import urlopen
url = "http://majorgeeks.com/page.php?id={}"
with open("pydata.txt", 'w') as f:
for i in xrange(1, 21):
tempurl = url.format(i)
print "Val : {}\n\n".format(tempurl)
f.write(urlopen(tempurl).read())
I changed it also logically - it now requests different URLs (defined by tempurl
), it used to request the same URL 20 times (defined by url
). I also used string formatting, although I am not sure how it influences the efficiency.
I tested it on my system (Windows 7 64-bit, Python 2.7.2, within IDLE, moderate internet connection) and it took 40 seconds (40.262) to finish.
I still have to scratch my head and figure out why this code is taking so long for both @mayjune and @Tadeck. I've had a chance to run both pieces of code formally through a profiler, and here are the results. I strongly encourage you to run these results for yourself on your machine, since mine will produce different results (AMD Athlon II X4 @3GHz, 8GB RAM, Ubuntu 11.04 x64, 7Mbit line).
To run:
python -m cProfile -o profile.dat <path/to/code.py>; python -m pstats profile.dat
(From inside of the profiler, you can check help
for commands.)
Fri Jan 6 17:49:29 2012 profile.dat
20966 function calls (20665 primitive calls) in 13.566 CPU seconds
Ordered by: cumulative time
List reduced from 306 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 13.567 13.567 websiteretrieval.py:1(<module>)
20 0.000 0.000 7.874 0.394 /usr/lib/python2.7/urllib2.py:122(urlopen)
20 0.000 0.000 7.874 0.394 /usr/lib/python2.7/urllib2.py:373(open)
20 0.000 0.000 7.870 0.394 /usr/lib/python2.7/urllib2.py:401(_open)
40 0.000 0.000 7.870 0.197 /usr/lib/python2.7/urllib2.py:361(_call_chain)
20 0.000 0.000 7.870 0.393 /usr/lib/python2.7/urllib2.py:1184(http_open)
20 0.001 0.000 7.870 0.393 /usr/lib/python2.7/urllib2.py:1112(do_open)
1178 7.596 0.006 7.596 0.006 {method 'recv' of '_socket.socket' objects}
20 0.000 0.000 5.911 0.296 /usr/lib/python2.7/httplib.py:953(request)
20 0.000 0.000 5.911 0.296 /usr/lib/python2.7/httplib.py:974(_send_request)
20 0.000 0.000 5.911 0.296 /usr/lib/python2.7/httplib.py:938(endheaders)
20 0.000 0.000 5.911 0.296 /usr/lib/python2.7/httplib.py:796(_send_output)
20 0.000 0.000 5.910 0.296 /usr/lib/python2.7/httplib.py:769(send)
20 0.000 0.000 5.909 0.295 /usr/lib/python2.7/httplib.py:751(connect)
20 0.001 0.000 5.909 0.295 /usr/lib/python2.7/socket.py:537(create_connection)
...so from observation, the only thing that could slow you down is...urlopen
and open
. I/O is slow, so that's sort of understandable.
Fri Jan 6 17:52:36 2012 profileTadeck.dat
21008 function calls (20707 primitive calls) in 13.249 CPU seconds
Ordered by: cumulative time
List reduced from 305 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.002 0.002 13.249 13.249 websiteretrievalTadeck.py:1(<module>)
20 0.000 0.000 7.706 0.385 /usr/lib/python2.7/urllib2.py:122(urlopen)
20 0.000 0.000 7.706 0.385 /usr/lib/python2.7/urllib2.py:373(open)
20 0.000 0.000 7.702 0.385 /usr/lib/python2.7/urllib2.py:401(_open)
40 0.000 0.000 7.702 0.193 /usr/lib/python2.7/urllib2.py:361(_call_chain)
20 0.000 0.000 7.702 0.385 /usr/lib/python2.7/urllib2.py:1184(http_open)
20 0.001 0.000 7.702 0.385 /usr/lib/python2.7/urllib2.py:1112(do_open)
1178 7.348 0.006 7.348 0.006 {method 'recv' of '_socket.socket' objects}
20 0.000 0.000 5.841 0.292 /usr/lib/python2.7/httplib.py:953(request)
20 0.000 0.000 5.841 0.292 /usr/lib/python2.7/httplib.py:974(_send_request)
20 0.000 0.000 5.840 0.292 /usr/lib/python2.7/httplib.py:938(endheaders)
20 0.000 0.000 5.840 0.292 /usr/lib/python2.7/httplib.py:796(_send_output)
20 0.000 0.000 5.840 0.292 /usr/lib/python2.7/httplib.py:769(send)
20 0.000 0.000 5.839 0.292 /usr/lib/python2.7/httplib.py:751(connect)
20 0.001 0.000 5.839 0.292 /usr/lib/python2.7/socket.py:537(create_connection)
Again, the largest two culprits of time being spent are on urlopen
and open
. This leads me to believe that I/O has a major role in bogging down your code. However, the difference is not substantial on the machine that I've tested it on - Perl's script executes in roughly the same time.
real 0m11.129s
user 0m0.230s
sys 0m0.070s
I'm not convinced that it's software's fault that your code is slow, although your machine is pretty beefy. I strongly encourage you to run the profiler suite (code is included above) to see if you can find any bottlenecks that I've missed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With