I wrote a simple script that is supposed to read an entire directory and then parse the HTML data into normal script by getting rid off the HTML tags and then write it into one file.
I have 8GB memory and also plenty of available virtual memory. When I am doing this I have more than 5GB RAM available. The largest file in the directory is 3.8 GB.
The script is
file_count = 1
File.open("allscraped.txt", 'w') do |out1|
for file_name in Dir["allParts/*.dat"] do
puts "#{file_name}#:#{file_count}"
file_count +=1
File.open(file_name, "r") do |file|
source = ""
tmp_src = ""
counter = 0
file.each_line do |line|
scraped_content = line.gsub(/<.*?\/?>/, '')
tmp_src << scraped_content
if (counter % 10000) == 0
tmp_src = tmp_src.gsub( /\s{2,}/, "\n" )
source << tmp_src
tmp_src = ""
counter = 0
end
counter += 1
end
source << tmp_src.gsub( /\s{2,}/, "\n" )
out1.write(source)
break
end
end
end
The full error code is:
realscraper.rb:33:in `block (4 levels) in <main>': failed to allocate memory (No
MemoryError)
from realscraper.rb:27:in `each_line'
from realscraper.rb:27:in `block (3 levels) in <main>'
from realscraper.rb:23:in `open'
from realscraper.rb:23:in `block (2 levels) in <main>'
from realscraper.rb:13:in `each'
from realscraper.rb:13:in `block in <main>'
from realscraper.rb:12:in `open'
from realscraper.rb:12:in `<main>'
Where line#27 is file.each_line do |line|
and 33 is source << tmp_src
. The failing file is the largest one (3.8 GB). What is the problem here? Why am I getting this error even though I have enough memory? Also how can I fix it?
The problem is on these two lines:
source << tmp_src
source << tmp_src.gsub( /\s{2,}/, "\n" )
When you read a large file you are slowly growing a very large string in memory.
The simplest solution is not to use this temporary source
string at all, but to write the results directly to the file. Just replace those two lines with this instead:
# source << tmp_src
out1.write(tmp_src)
# source << tmp_src.gsub( /\s{2,}/, "\n" )
out1.write(tmp_src.gsub( /\s{2,}/, "\n" ))
This way you're not creating any big temporary strings in memory and it should work better (and faster) this way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With