Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I use Python or Assembly for a super fast copy program

As a maintenance issue I need to routinely (3-5 times per year) copy a repository that is now has over 20 million files and exceeds 1.5 terabytes in total disk space. I am currently using RICHCOPY, but have tried others. RICHCOPY seems the fastest but I do not believe I am getting close to the limits of the capabilities of my XP machine.

I am toying around with using what I have read in The Art of Assembly Language to write a program to copy my files. My other thought is to start learning how to multi-thread in Python to do the copies.

I am toying around with the idea of doing this in Assembly because it seems interesting, but while my time is not incredibly precious it is precious enough that I am trying to get a sense of whether or not I will see significant enough gains in copy speed. I am assuming that I would but I only started really learning to program 18 months and it is still more or less a hobby. Thus I may be missing some fundamental concept of what happens with interpreted languages.

Any observations or experiences would be appreciated. Note, I am not looking for any code. I have already written a basic copy program in Python 2.6 that is no slower than RICHCOPY. I am looking for some observations on which will give me more speed. Right now it takes me over 50 hours to make a copy from a disk to a Drobo and then back from the Drobo to a disk. I have a LogicCube for when I am simply duplicating a disk but sometimes I need to go from a disk to Drobo or the reverse. I am thinking that given that I can sector copy a 3/4 full 2 terabyte drive using the LogicCube in under seven hours I should be able to get close to that using Assembly, but I don't know enough to know if this is valid. (Yes, sometimes ignorance is bliss)

The reason I need to speed it up is I have had two or three cycles where something has happened during copy (fifty hours is a long time to expect the world to hold still) that has caused me to have to trash the copy and start over. For example, last week the water main broke under our building and shorted out the power.

Thanks for the early responses but I don't think it is I/O limitations. I am not going over a network, the drive is plugged into my mother board with a sata connection and my Drobo is plugged into a Firewire port, my thinking is that both connections should allow faster transfer.

Actually I can't use a sector copy except going from a single disk to the Drobo. It won't work the other way since the Drobo file structure is a mystery. My unscientific observation is that the copy from one internal disk to another is no faster than a copy to or from the Drobo to an internal disk.

I am bound by the hardware, I can't afford 10K rpm 2 terabyte drives (if they even make them).

A number of you are suggesting a file synching solution. But that does not solve my problem. First off, the file synching solutions I have played with build a map (for want of a better term) of the data first, I have too many little files so they choke. One of the reasons I use RICHCOPY is that it starts copying immediately, it does not use memory to build a map. Second, I had one of my three Drobo backups fail a couple of weeks ago. My rule is if I have a backup failure the other two have to stay off line until the new one is built. So I need to copy from one of the three back up single drive copies I have that I use with the LogicCube.

At the end of the day I have to have a good copy on a single drive because that is what I deliver to my clients. Because my clients have diverse systems I deliver to them on SATA drives.

I rent some cloud space from someone where my data is also stored as the deepest backup but it is expensive to pull if off of there.

like image 244
PyNEwbie Avatar asked Jun 06 '10 01:06

PyNEwbie


2 Answers

Copying files is an I/O bound process. It is unlikely that you will see any speed up from rewriting it in assembly, and even multithreading may just cause things to go slower as different threads requesting different files at the same time will result in more disk seeks.

Using a standard tool is probably the best way to go here. If there is anything to optimize, you might want to consider changing your file system or your hardware.

like image 172
Mark Byers Avatar answered Oct 18 '22 16:10

Mark Byers


As the other answers mention (+1 to mark), when copying files, disk i/o is the bottleneck. The language you use won't make much of a difference. How you've laid out your files will make a difference, how you're transferring data will make a difference.

You mentioned copying to a DROBO. How is your DROBO connected? Check out this graph of connection speeds.

Let's look at the max copy rates you can get over certain wire types:

  • USB = 97 days (1.5 TB / 1.5 Mbps). Lame, at least your performance is not this bad.
  • USB2.0 = ~7hrs (1.5 TB / 480 Mbps). Maybe LogicCube?
  • Fast SCSI = ~40hrs (1.5 TB / 80 Mbps). Maybe your hard drive speed?
  • 100 Mbps ethernet = 1.4 days (1.5 TB / 100 Mbps).

So, depending on the constraints of your problem, it's possible you can't do better. But you may want to start doing a raw disk copy (like Unix's dd), which should be much faster than a file-system level copy (it's faster because there are no random disk seeks for directory walks or fragmented files).

To use dd, you could live boot linux onto your machine (or maybe use cygwin?). See this page for reference or this one about backing up from windows using a live-boot of Ubuntu.

If you were to organize your 1.5 TB data on a RAID, you could probably speed up the copy (because the disks will be reading in parallel), and (depending on the configuration) it'll have the added benefit of protecting you from drive failures.

like image 25
Stephen Avatar answered Oct 18 '22 18:10

Stephen