Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I make Powershell parse XML faster or further optimize my script?

I have a setup that contains 7 million XML files, varying in size from a few KB to multiple MB. All in all, it's about 180GB of XML files. The job I need performed is to analyze each XML file and determine if the file contains the string <ref>, and if it does not to move it out of the Chunk folder that it currently is contained in to the Referenceless folder.

The script I have created works well enough, but it's extremely slow for my purposes. It's slated to finish analyzing all 7 million files in about 24 days, going at a rate of about 3 files per second. Is there anything I can change in my script to eek out more performance?

Also, to make matters even more complicated, I do not have the correct permissions on my server box to run .PS1 files, and so the script will need to be able to be run from the PowerShell in one command. I would set the permissions if I had the authorization to.

# This script will iterate through the Chunk folders, removing pages that contain no 
# references and putting them into the Referenceless folder.

# Change this variable to start the program on a different chunk. This is the first   
# command to be run in Windows PowerShell. 
$chunknumber = 1
#This while loop is the second command to be run in Windows PowerShell. It will stop after completing Chunk 113.
while($chunknumber -le 113){
#Jumps the terminal to the correct folder.
cd C:\Wiki_Pages
#Creates an index for the chunk being worked on.
$items = Get-ChildItem -Path "Chunk_$chunknumber"
echo "Chunk $chunknumber Indexed"
#Jumps to chunk folder.
cd C:\Wiki_Pages\Chunk_$chunknumber
#Loops through the index. Each entry is one of the pages.
foreach ($page in $items){
#Creates a variable holding the page's content.
$content = Get-Content $page
#If the page has a reference, then it's echoed.
if($content | Select-String "<ref>" -quiet){echo "Referenced!"}
#if the page doesn't have a reference, it's copied to Referenceless then deleted.
else{
Copy-Item $page C:\Wiki_Pages\Referenceless -force
Remove-Item $page -force
echo "Moved to Referenceless!"
}
}
#The chunk number is increased by one and the cycle continues.
$chunknumber = $chunknumber + 1
}

I have very little knowledge of PowerShell, yesterday was the first time I had ever even opened the program.

like image 600
user55397 Avatar asked Sep 17 '25 20:09

user55397


2 Answers

You will want to add the -ReadCount 0 argument to your Get-Content commands to speed them up (it helps tremendously). I learned this tip from this great article that shows running a foreach over a whole file's contents is faster than trying to parse it through a pipeline.

Also, you can use Set-ExecutionPolicy Bypass -Scope Process in order to run scripts in your current Powershell session, without needing extra permissions!

like image 105
SpellingD Avatar answered Sep 21 '25 02:09

SpellingD


The PowerShell pipeline can be markedly slower than native system calls.

PowerShell: pipeline performance

In this article a performance test is performed between two equivalent commands executed on PowerShell and a classical windows command prompt.

PS> grep [0-9] numbers.txt | wc -l > $null
CMD> cmd /c "grep [0-9] numbers.txt | wc -l > nul"

Here's a sample of its output.

PS C:\temp> 1..5 | % { .\perf.ps1 ([Math]::Pow(10, $_)) }

10 iterations

   30 ms  (   0 lines / ms)  grep in PS
   15 ms  (   1 lines / ms)  grep in cmd.exe

100 iterations

   28 ms  (   4 lines / ms)  grep in PS
   12 ms  (   8 lines / ms)  grep in cmd.exe

1000 iterations

  147 ms  (   7 lines / ms)  grep in PS
   11 ms  (  89 lines / ms)  grep in cmd.exe

10000 iterations

 1347 ms  (   7 lines / ms)  grep in PS
   13 ms  ( 786 lines / ms)  grep in cmd.exe

100000 iterations

13410 ms  (   7 lines / ms)  grep in PS
   22 ms  (4580 lines / ms)  grep in cmd.exe

EDIT: The original answer to this question mentioned pipeline performance along with some other suggestions. To keep this post succinct I've removed the other suggestions that didn't actually have anything to do with pipeline performance.

like image 42
Sean Glover Avatar answered Sep 21 '25 01:09

Sean Glover