Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find and Replace in a Large File

Tags:

powershell

I want to find a piece of text in a large xml file and want to replace with some other text. The size of the file is around (50GB). I want to do this in command line. I am looking at PowerShell and want to know if it can handle the large size.

Currently I am trying something like this but it does not like it

Get-Content C:\File1.xml | Foreach-Object {$_ -replace "xmlns:xsi=\"http:\/\/www\.w3\.org\/2001\/XMLSchema-instance\"", ""} | Set-Content C:\File1.xml

The text I want to replace is xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" with an empty string "".

Questions

  1. Can PowerShell handle large files
  2. I don't want the replace to happen in memory and prefer streaming assuming that will not bring the server to its knees.
  3. Are there any other approaches I can take (different tools/strategy?)

Thanks

like image 740
Raj73 Avatar asked May 06 '10 19:05

Raj73


People also ask

How do I find and replace text in multiple files?

Remove all the files you don't want to edit by selecting them and pressing DEL, then right-click the remaining files and choose Open all. Now go to Search > Replace or press CTRL+H, which will launch the Replace menu. Here you'll find an option to Replace All in All Opened Documents.

Does SED work on large files?

Don't use sed for large files.

Can you do find and replace from files?

You can see the Find/replace in files option in the "Home" tab, close to the other search options. You can also get to find in files by pressing Ctrl + Shift + F. Find and replace in files are powerful search options that allow you to search for strings or text within multiple files contained in a directory.

How do I find and replace in a text file?

The Ctrl + F and Command + F keyboard shortcut keys also work in Microsoft Excel and other spreadsheet programs to open the Find and Replace text box. In Microsoft Excel, older versions featured the Edit menu, and the Replace option is found in that menu.


4 Answers

I had a similar need (and similar lack of powershell experience) but cobbled together a complete answer from the other answers on this page plus a bit more research.

I also wanted to avoid the regex processing, since I didn't need it either -- just a simple string replace -- but on a large file, so I didn't want it loaded into memory.

Here's the command I used (adding linebreaks for readability):

Get-Content sourcefile.txt
    | Foreach-Object {$_.Replace('http://example.com', 'http://another.example.com')}
    | Set-Content result.txt

Worked perfectly! Never sucked up much memory (it very obviously didn't load the whole file into memory), and just chugged along for a few minutes then finished.

like image 70
Rob Whelan Avatar answered Oct 06 '22 07:10

Rob Whelan


Aside from worrying about reading the file in chunks to avoid loading it into memory, you need to dump to disk often enough that you aren't storing the entire contents of the resulting file in memory.

Get-Content sourcefile.txt -ReadCount 10000 | 
    Foreach-Object {
        $line = $_.Replace('http://example.com', 'http://another.example.com')
        Add-Content -Path result.txt -Value $line
    }

The -ReadCount <number> sets the number of lines to read at a time. Then the ForEach-Object writes each line as it is read. For a 30GB file filled with SQL Inserts, I topped out around 200MB of memory and 8% CPU. While, piping it all into Set-Content at hit 3GB of memory before I killed it.

like image 26
Digital Coyote Avatar answered Oct 06 '22 07:10

Digital Coyote


It does not like it because you can't read from a file and write back to it at the same time using Get-Content/Set-Content. I recommend using a temp file and then at the end, rename file1.xml to file1.xml.bak and rename the temp file to file1.xml.

  1. Yes as long as you don't try to load the whole file at once. Line-by-line will work but is going to be a bit slow. Use the -ReadCount parameter and set it to 1000 to improve performance.
  2. Which command line? PowerShell? If so then you can invoke your script like so .\myscript.ps1 and if it takes parameters then c:\users\joe\myscript.ps1 c:\temp\file1.xml.
  3. In general for regexes I would use single quotes if you don't need to reference PowerShell variables. Then you only need to worry about regex escaping and not PowerShell escaping as well. If you need to use double-quotes then the back-tick character is the escape char in double-quotes e.g. "`$p1 is set to $ps1". In your example single quoting simplifies your regex to (note: forward slashes aren't metacharacters in regex):

    'xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"'

  4. Absolutely you want to stream this since 50GB won't fit into memory. However, this poses an issue if you process line-by-line. What if the text you want to replace is split across multiple lines?

  5. If you don't have the split line issue then I think PowerShell can handle this.
like image 31
Keith Hill Avatar answered Oct 06 '22 06:10

Keith Hill


This is my take on it, building on some of the other answers here:

Function ReplaceTextIn-File{
  Param(
    $infile,
    $outfile,
    $find,
    $replace
  )

  if( -Not $outfile)
  {
    $outfile = $infile
  }

  $temp_out_file = "$outfile.temp"

  Get-Content $infile | Foreach-Object {$_.Replace($find, $replace)} | Set-Content $temp_out_file

  if( Test-Path $outfile)
  {
    Remove-Item $outfile
  }

  Move-Item $temp_out_file $outfile
}

And called like so:

ReplaceTextIn-File -infile "c:\input.txt" -find 'http://example.com' -replace 'http://another.example.com' 
like image 31
Willem van Ketwich Avatar answered Oct 06 '22 05:10

Willem van Ketwich