I have never used threads before, but think I may have encountered an opportunity:
I have written a script that chews through an array of ~500 Excel files, and uses Parse::Excel to pull values from specific sheets in the workbook (on average, two sheets per workbook; one cell extracted per sheet.)
Running it now, where I just go through the array of files one by one and extract the relevant info from the file, it takes about 45 minutes to complete.
My question is: is this an opportunity to use threads, and have more than one file get hit at a time*, or should I maybe just accept the 45 minute run time?
(* - if this is a gross misunderstanding of what I can do with threads, please say so!)
Thanks in advance for any guidance you can offer!
Edit - adding example code. The code below is a sub that is called in a foreach loop for each file location stored in an array:
# Init the parser
my $parser = Spreadsheet::ParseExcel->new;
my $workbook = $parser->parse($inputFile) or die("Unable to load $inputFile: $!");
# Get a list of any sheets that have 'QA' in the sheet name
foreach my $sheet ($workbook->worksheets) {
if ($sheet->get_name =~ m/QA/) {
push @sheetsToScan, $sheet->get_name;
}
}
shift @sheetsToScan;
# Extract the value from the appropriate cell
foreach (@sheetsToScan) {
my $worksheet = $workbook->worksheet($_);
if ($_ =~ m/Production/ or $_ =~ m/Prod/) {
$cell = $worksheet->get_cell(1, 1);
$value = $cell ? $cell->value: undef;
if (not defined $value) {
$value = "Not found.";
}
} else {
$cell = $worksheet->get_cell(6,1);
$value = $cell ? $cell->value: undef;
if (not defined $value) {
$value = "Not found.";
}
}
push(@outputBuffer, $line);
A study published by the National Institutes of Health (NIH) confirms PDO thread lifts do work, stating thread lifts indisputably lift and shape facial soft tissues. Another study published by the NIH indicates this therapy yields around a 90 percent success rate.
Potential PDO thread lift complications Potential complications include: visible sutures (especially in people with thin skin) pain. minor bruising.
Your forehead, jowls, under-eye area, and eyebrow are all parts of your face that can be considered for a thread lift. You can choose to target just one area or several at once, increasing the cost.
The ideal thread lift candidate is usually in his or her late thirties to early fifties, whereas most patients over the age of about 55 will benefit more profoundly from facelift surgery. Thread lifts can, however, provide a facelift alternative for older patients who are unable to have surgery for medical reasons.
Threads (or using multiple processes using fork
) allow your script to utilize more than one CPU at time. For many tasks, this can save a lot of "user time" but will not save "system time" (and may even increase system time to handle the overhead of starting and managing threads and processes). Here are the situations where threading/multiprocessing will not be helpful:
the task of your script does not lend itself to parallelization -- when each step of your algorithm depends on the previous steps
the task your script performs is fast and lightweight compared to the overhead of creating and managing a new thread or new process
your system only has one CPU or your script is only enabled to use one CPU
your task is constrained by a different resource than CPU, such as disk access, network bandwidth, or memory -- if your task involves processing large files that you download through a slow network connection, then your network is the bottleneck, and processing the file on multiple CPUs will not help. Likewise, if your task consumes 70% of your system's memory, than using a second and third thread will require paging to your swap space and will not save any time. Parallelization will also be less effective if your threads compete for some synchronized resource -- file locks, database access, etc.
you need to be considerate of other users on your system -- if you are using all the cores on a machine, then other users will have a poor experience
[added, threads only] your code uses any package that is not thread-safe. Most pure Perl code will be thread-safe, but packages that use XS may not be
[added] when you are still actively developing your core task. Debugging is a lot harder in parallel code
Even if none of these apply, it is sometimes hard to tell how much a task will benefit from parallelization, and the only way to be sure is to actually implement the parallel task and benchmark it. But the task you have described looks like it could be a good candidate for parallelization.
It seems to me that your task should benefit from multiple threads of execution (processes or threads), as it seems to have a very roughly even blend of I/O and CPU. I would expect a speedup of a factor of a few but it is hard to tell without knowing details.
One way is to break the list of files into groups, as many as there are cores that you can spare. Then process each group in a fork
, which assembles its results and passes them back to the parent once done, via a pipe or files. There are modules that do this and much more, for example Forks::Super or Parallel::ForkManager. They also offer a queue, another approach you can use.
I do this regularly when a lot of data in files is involved and get near linear speedup with up to 4 or 5 cores (on NFS), or even with more cores depending on the job details and on hardware.
I would cautiously assert that this may be simpler than threads, so to try first.
Another way would be to create a thread queue (Thread::Queue) and feed it the filename groups. Note that Perl's threads are not the lightweight "threads" as one might expect; quite the opposite, they are heavy, they copy everything to each thread (so start them upfront, before there is much data in the program), and they come with yet other subtleties. Have a small number of workers with a sizable job (nice list of files) for each, instead of many threads rapidly working with the queue.
In this approach, too, be careful about how to pass results back since frequent communication poses a significant overhead for (Perl's) threads.
In either case it is important that the groups are formed so to provide for a balanced workload per thread/process. If this is not possible (you may not know which files may take much longer than others), then threads should take smaller batches while for forks use a queue from a module.
Handing only a file or a few to a thread or a process is most likely way too light of a workload, in which case the overhead of managing may erase (or reverse) possible speed gains. The I/O overlap across threads/processes would also increase, which is the main limit to speedup here.
The optimal number of files to pass to a thread/process is hard to estimate, even with all details on hand; just have to try. I assume that the reported runtime (over 5sec for a file) is due to some inefficiency which can be removed so first check your code for undue inefficiencies. If a file somehow really takes that long to process then start by passing a single file at a time to the queue.
Also, please consider mob's answer carefully. And note that these are advanced techniques.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With