Following an earlier thread I boiled down my problem to it's bare bones, in migrating from a Perl script to a Python one I found a huge performance issue with slurping files in Python. Running this on Ubuntu Server.
NB: this is not a X vs. Y thread I need to know fundamentally if this is how it is or if I'm doing something stupid.
I created my test data, 50,000 10kb files (this mirrors the avg file size of what I'm processing):
mkdir 1 cd 1 for i in {1..50000}; do dd if=/dev/zero of=$i.xml bs=1 count=10000; done cd .. cp -r 1 2
Created my 2 scripts as simply as possible:
Perl
foreach my $file (<$ARGV[0]/*.xml>){ my $fh; open($fh, "< $file"); my $contents = do { local $/; <$fh> }; close($fh); }
Python
import glob, sys for file in glob.iglob(sys.argv[1] + '/*.xml'): with open(file) as x: f = x.read()
I then cleared the caches and ran my 2 slurp scripts, between each run I cleaned the caches again using:
sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
Then monitored to ensure it was reading everything from disk each time:
sudo iotop -a -u me
I tried this on a physical machine with RAID 10 disks and on a brand new VM I setup where the VM is on RAID 1 SSDs, have just included the test runs from my VM as the physical server was much the same just faster.
$ time python readFiles.py 1 real 5m2.493s user 0m1.783s sys 0m5.013s $ time perl readFiles.pl 2 real 0m13.059s user 0m1.690s sys 0m2.471s $ time perl readFiles.pl 2 real 0m13.313s user 0m1.670s sys 0m2.579s $ time python readFiles.py 1 real 4m43.378s user 0m1.772s sys 0m4.731s
I noticed on iotop when Perl was running DISK READ was around 45 M/s and IOWAIT approx 70%, when running Python DISK READ was 2M/s and IOWAIT 97%. I'm not sure where to go from here having boiled them down to as simple as I can get.
In case it is relevant
$ python Python 2.7.6 (default, Mar 22 2014, 22:59:56) [GCC 4.8.2] on linux2 $ perl -v This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
FURTHER INFO AS REQUESTED
I ran strace and grabbed the info for file 1000.xml but all seem to do the same things:
Perl
$strace -f -T -o trace.perl.1 perl readFiles.pl 2 32303 open("2/1000.xml", O_RDONLY) = 3 <0.000020> 32303 ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7fff7f6f7b90) = -1 ENOTTY (Inappropriate ioctl for device) <0.000016> 32303 lseek(3, 0, SEEK_CUR) = 0 <0.000016> 32303 fstat(3, {st_mode=S_IFREG|0664, st_size=10000, ...}) = 0 <0.000016> 32303 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 <0.000017> 32303 fstat(3, {st_mode=S_IFREG|0664, st_size=10000, ...}) = 0 <0.000030> 32303 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 <0.005323> 32303 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 1808 <0.000022> 32303 read(3, "", 8192) = 0 <0.000019> 32303 close(3) = 0 <0.000017>
Python
$strace -f -T -o trace.python.1 python readFiles.py 1 32313 open("1/1000.xml", O_RDONLY) = 3 <0.000021> 32313 fstat(3, {st_mode=S_IFREG|0664, st_size=10000, ...}) = 0 <0.000017> 32313 fstat(3, {st_mode=S_IFREG|0664, st_size=10000, ...}) = 0 <0.000019> 32313 lseek(3, 0, SEEK_CUR) = 0 <0.000018> 32313 fstat(3, {st_mode=S_IFREG|0664, st_size=10000, ...}) = 0 <0.000018> 32313 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa18820a000 <0.000019> 32313 lseek(3, 0, SEEK_CUR) = 0 <0.000018> 32313 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 <0.006795> 32313 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 1808 <0.000031> 32313 read(3, "", 4096) = 0 <0.000018> 32313 close(3) = 0 <0.000027> 32313 munmap(0x7fa18820a000, 4096) = 0 <0.000022>
One difference I noticed, not sure if it's relevant, is that Perl appears to run this against all files before it starts opening them whereas python doesn't:
32303 lstat("2/1000.xml", {st_mode=S_IFREG|0664, st_size=10000, ...}) = 0 <0.000022>
Also ran strace with -c (just took top few calls):
Perl
$ time strace -f -c perl readFiles.pl 2 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 44.07 3.501471 23 150018 read 12.54 0.996490 10 100011 fstat 9.47 0.752552 15 50000 lstat 7.99 0.634904 13 50016 open 6.89 0.547016 11 50017 close 6.19 0.491944 10 50008 50005 ioctl 6.12 0.486208 10 50014 3 lseek 6.10 0.484374 10 50001 fcntl real 0m37.829s user 0m6.373s sys 0m25.042s
Python
$ time strace -f -c python readFiles.py 1 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 42.97 4.186173 28 150104 read 15.58 1.518304 10 150103 fstat 10.51 1.023681 20 50242 174 open 10.12 0.986350 10 100003 lseek 7.69 0.749387 15 50047 munmap 6.85 0.667576 13 50071 close 5.90 0.574888 11 50073 mmap real 5m5.237s user 0m7.278s sys 0m30.736s
Did some parsing of the strace output with -T turned on and counted the first 8192 byte read for each file and it's clear this is where the time is going, below is the total time spent for the 50000 first reads of a file followed by the average time for each read.
300.247128000002 (0.00600446220302379) - Python 11.6845620000003 (0.000233681892724297) - Perl
Not sure if that helps!
UPDATE 2 Updated code in Python to use os.open and os.read and just do a single read of first 4096 bytes (that would work for me as info I want is in top part of file), also eliminates all the other calls in strace:
18346 open("1/1000.xml", O_RDONLY) = 3 <0.000026> 18346 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0.007206> 18346 close(3) = 0 <0.000024> $ time strace -f -c python readFiles.py 1 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 55.39 2.388932 48 50104 read 22.86 0.986096 20 50242 174 open 20.72 0.893579 18 50071 close real 4m48.751s user 0m3.078s sys 0m12.360s Total Time (avg read call) 282.28626 (0.00564290374812595)
Still no better...next up I'm going to create a VM on Azure and try there for another example!!
UPDATE 3 - Apologies for the size of this!!
Ok some interesting results using your (@J.F.Sebastian) script on 3 setups, stripped the output at start for brevity and also removed all the tests which just run super fast from cache and look like:
0.23user 0.26system 0:00.50elapsed 99%CPU (0avgtext+0avgdata 9140maxresident)k 0inputs+0outputs (0major+2479minor)pagefaults 0swaps
Azure A2 Standard VM (2 cores 3.5GB RAM Disk Unknown but slow)
$ uname -a Linux servername 3.13.0-35-generic #62-Ubuntu SMP Fri Aug 15 01:58:42 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux $ python Python 2.7.6 (default, Mar 22 2014, 22:59:56) [GCC 4.8.2] on linux2 $ perl -v This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi (with 41 registered patches, see perl -V for more detail) + /usr/bin/time perl slurp.pl 1 1.81user 2.95system 3:11.28elapsed 2%CPU (0avgtext+0avgdata 9144maxresident)k 1233840inputs+0outputs (20major+2461minor)pagefaults 0swaps + clearcache + sync + sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' + /usr/bin/time python slurp.py 1 1.56user 3.76system 3:06.05elapsed 2%CPU (0avgtext+0avgdata 8024maxresident)k 1232232inputs+0outputs (14major+52273minor)pagefaults 0swaps + /usr/bin/time perl slurp.pl 2 1.90user 3.11system 6:02.17elapsed 1%CPU (0avgtext+0avgdata 9144maxresident)k 1233776inputs+0outputs (16major+2465minor)pagefaults 0swaps
Comparable first slurp results for both, not sure what was going on during the 2nd Perl slurp?
My VMWare Linux VM (2 cores 8GB RAM Disk RAID1 SSD)
$ uname -a Linux servername 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux $ python Python 2.7.6 (default, Mar 22 2014, 22:59:56) [GCC 4.8.2] on linux2 $ perl -v This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi (with 41 registered patches, see perl -V for more detail) + /usr/bin/time perl slurp.pl 1 1.66user 2.55system 0:13.28elapsed 31%CPU (0avgtext+0avgdata 9136maxresident)k 1233152inputs+0outputs (20major+2460minor)pagefaults 0swaps + clearcache + sync + sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' + /usr/bin/time python slurp.py 1 2.10user 4.67system 4:45.65elapsed 2%CPU (0avgtext+0avgdata 8012maxresident)k 1232056inputs+0outputs (14major+52269minor)pagefaults 0swaps + /usr/bin/time perl slurp.pl 2 2.13user 4.11system 5:01.40elapsed 2%CPU (0avgtext+0avgdata 9140maxresident)k 1233264inputs+0outputs (16major+2463minor)pagefaults 0swaps
This time, as before, Perl is way faster on first slurp, unsure what is happening on second Perl slurp though not seen this behaviour before. Ran measure.sh again and result was exactly the same give or take a few seconds. I then did what any normal person would do and updated the kernel to match the Azure machine 3.13.0-35-generic and ran measure.sh again and made no difference to results.
Out of curiosity I then swapped the 1 and 2 parameter in measure.sh and something strange happened..Perl slowed down and Python sped up!
+ /usr/bin/time perl slurp.pl 2 1.78user 3.46system 4:43.90elapsed 1%CPU (0avgtext+0avgdata 9140maxresident)k 1234952inputs+0outputs (21major+2458minor)pagefaults 0swaps + clearcache + sync + sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' + /usr/bin/time python slurp.py 2 1.19user 3.09system 0:10.67elapsed 40%CPU (0avgtext+0avgdata 8012maxresident)k 1233632inputs+0outputs (14major+52269minor)pagefaults 0swaps + /usr/bin/time perl slurp.pl 1 1.36user 2.32system 0:13.40elapsed 27%CPU (0avgtext+0avgdata 9136maxresident)k 1232032inputs+0outputs (17major+2465minor)pagefaults 0swaps
This has just confused me even further :-(
Physical Server (32 cores 132 GB RAM Disk RAID10 SAS)
$ uname -a Linux servername 3.5.0-23-generic #35~precise1-Ubuntu SMP Fri Jan 25 17:13:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux $ python Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2 $ perl -v This is perl 5, version 14, subversion 2 (v5.14.2) built for x86_64-linux-gnu-thread-multi (with 55 registered patches, see perl -V for more detail) + /usr/bin/time perl slurp.pl 1 2.22user 2.60system 0:15.78elapsed 30%CPU (0avgtext+0avgdata 43728maxresident)k 1233264inputs+0outputs (15major+2984minor)pagefaults 0swaps + clearcache + sync + sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' + /usr/bin/time python slurp.py 1 2.51user 4.79system 1:58.53elapsed 6%CPU (0avgtext+0avgdata 34256maxresident)k 1234752inputs+0outputs (16major+52385minor)pagefaults 0swaps + /usr/bin/time perl slurp.pl 2 2.17user 2.95system 0:06.96elapsed 73%CPU (0avgtext+0avgdata 43744maxresident)k 1232008inputs+0outputs (14major+2987minor)pagefaults 0swaps
Here Perl seems to win every time.
baffled
Given the oddity on my local VM, when I swapped directories, which is the machine I have most control over I'm going to try a binary approach on all the possible options of running python vs perl using 1 or 2 as the data directory and try to run them multiple times for consistency but it'll take a while and I'm going a little crazy so break may be required first! All I want is consistency :-(
UPDATE 4 - Consistency
(Below is run on an ubuntu-14.04.1-server VM, Kernel is 3.13.0-35-generic #62-Ubuntu)
I think I've found some consistency, running the tests every way possible for Python/Perl slurp on data dir 1/2 I found the following:
So I looked at OS level copying and it seems like on Ubuntu 'cp' behaves in the same way as Python, i.e. slow on original files and fast on copied files.
This is what I ran and the results, I did this a few times on a machine with a single SATA HD and on a RAID10 system, results:
$ mkdir 1 $ cd 1 $ for i in {1..50000}; do dd if=/dev/urandom of=$i.xml bs=1K count=10; done $ cd .. $ cp -r 1 2 $ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' $ time strace -f -c -o trace.copy2c cp -r 2 2copy real 0m28.624s user 0m1.429s sys 0m27.558s $ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' $ time strace -f -c -o trace.copy1c cp -r 1 1copy real 5m21.166s user 0m1.348s sys 0m30.717s
Trace results show where time is being spent
$ head trace.copy1c trace.copy2c ==> trace.copy1c <== % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 60.09 2.541250 25 100008 read 12.22 0.516799 10 50000 write 9.62 0.406904 4 100009 open 5.59 0.236274 2 100013 close 4.80 0.203114 4 50004 1 lstat 4.71 0.199211 2 100009 fstat 2.19 0.092662 2 50000 fadvise64 0.72 0.030418 608 50 getdents ==> trace.copy2c <== % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 47.86 0.802376 8 100008 read 13.55 0.227108 5 50000 write 13.02 0.218312 2 100009 open 7.36 0.123364 1 100013 close 6.83 0.114589 1 100009 fstat 6.31 0.105742 2 50004 1 lstat 3.38 0.056634 1 50000 fadvise64 1.62 0.027191 544 50 getdents
So it seems copying copies is much faster than copying original files, my current guess is that when copied the files get aligned on disk better than when they were originally created making them more efficient to read?
Interestingly 'rsyn' and 'cp' seem to work in opposite ways speedwise, much like Perl and Python!
$ rm -rf 1copy 2copy; sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; echo "Rsync 1"; /usr/bin/time rsync -a 1 1copy; sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; echo "Rsync 2"; /usr/bin/time rsync -a 2 2copy Rsync 1 3.62user 3.76system 0:13.00elapsed 56%CPU (0avgtext+0avgdata 5072maxresident)k 1230600inputs+1200000outputs (13major+2684minor)pagefaults 0swaps Rsync 2 4.87user 6.52system 5:06.24elapsed 3%CPU (0avgtext+0avgdata 5076maxresident)k 1231832inputs+1200000outputs (13major+2689minor)pagefaults 0swaps $ rm -rf 1copy 2copy; sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; echo "Copy 1"; /usr/bin/time cp -r 1 1copy; sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; echo "Copy 2"; /usr/bin/time cp -r 2 2copy Copy 1 0.48user 6.42system 5:05.30elapsed 2%CPU (0avgtext+0avgdata 1212maxresident)k 1229432inputs+1200000outputs (6major+415minor)pagefaults 0swaps Copy 2 0.33user 4.17system 0:11.13elapsed 40%CPU (0avgtext+0avgdata 1212maxresident)k 1230416inputs+1200000outputs (6major+414minor)pagefaults 0swaps
In summary: code is slowed down by the compilation and interpretation that occurs during runtime. Compare this to a statically typed, compiled language which runs just the CPU instructions once compilated. It's actually possible to extend Python with compiled modules that are written in C.
Why is Python slow? The default implementation of Python 'CPython' uses GIL (Global Interpreter Lock) to execute exactly one thread at the same time, even if run on a multi-core processor as GIL works only on one core regardless of the number of cores present in the machine.
Python programs are generally expected to run slower than Java programs, but they also take much less time to develop. Python programs are typically 3-5 times shorter than equivalent Java programs. This difference can be attributed to Python's built-in high-level data types and its dynamic typing.
I will focus on only one of your examples, because rest things should be analogical:
What I think, may matter in this situation is Read-Ahead (or maybe another technique related to this) feature:
Let consider such example:
I have created 1000 xml files in "1" dir (names 1.xml to 1000.xml) as you did by dd command and then I copied orginal dir 1 to dir 2
$ mkdir 1 $ cd 1 $ for i in {1..1000}; do dd if=/dev/urandom of=$i.xml bs=1K count=10; done $ cd .. $ cp -r 1 2 $ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' $ time strace -f -c -o trace.copy2c cp -r 2 2copy $ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' $ time strace -f -c -o trace.copy1c cp -r 1 1copy
In the next step I debugged cp command (by strace) to found out in what order data are copied:
So cp does it in following order (only first 4 files, because I saw that the second read from original directory is more time consuming that second read from copied directory)
100.xml 150.xml 58.xml 64.xml ... * in my example
Now, take a look on filesystem blocks which are used by these files (debugfs output - ext3 fs):
Original directory:
BLOCKS: (0-9):63038-63047 100.xml (0-9):64091-64100 150.xml (0-9):57926-57935 58.xml (0-9):60959-60968 64.xml .... Copied directory: BLOCKS: (0-9):65791-65800 100.xml (0-9):65801-65810 150.xml (0-9):65811-65820 58.xml (0-9):65821-65830 64.xml
....
As you can see, in the "Copied directory" the block are adjacent, so it means that during reading of the first file 100.xml the "Read Ahead" technique (controller or system settings) can increase performance.
dd create file in order 1.xml to 1000.xml, but cp command copies it in another order (100.xml, 150.xml, 58.xml,64.xml). So when you execute:
cp -r 1 1copy
to copy this dir to another, the blocks of files which you are copied are not adjacent, so read of such files take more time.
When you copy dir which you copied by cp command (so files are not created by dd command), then file are adjacent so creating:
cp -r 2 2copy
copy of the copy is faster.
Summary: So to test performance python/perl you should use the same dir (or two dirs copied by cp command) and also you can use option O_DIRECT to read bypassing all kernel buffers and read data from disk directly.
Please remember, that results can be different on different type of kernel, system, disk controller, system settings, fs and so on.
Additions:
[debugfs] [root@dhcppc3 test]# debugfs /dev/sda1 debugfs 1.39 (29-May-2006) debugfs: cd test debugfs: stat test.xml Inode: 24102 Type: regular Mode: 0644 Flags: 0x0 Generation: 3385884179 User: 0 Group: 0 Size: 4 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 2 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x543274bf -- Mon Oct 6 06:53:51 2014 atime: 0x543274be -- Mon Oct 6 06:53:50 2014 mtime: 0x543274bf -- Mon Oct 6 06:53:51 2014 BLOCKS: (0):29935 TOTAL: 1 debugfs:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With