I am collecting full HTML from a service that provides access to a very large collection of blogs and news websites. I am checking the HTML as it comes (in real-time) to see if it contains some keywords. If it contains one of the keywords, I am writing the HTML to a text file to store it.
I want to do this for a week. Therefore I am collecting a large amount of data. Testing the program for 3 minutes yielded a text file of 100MB. I have 4 TB of space, and I can't use more than this.
Also, I don't want the text files to become too large, because I assume they'll become un-openable.
What I am proposing is to open a text file, and write HTML to it, frequently checking its size. If it becomes bigger than, let's say 200MB, I close the text file and open another. I also need to keep a running log of how much space I've used in total, so that I can make sure that I don't get close to 4 TB.
The question I have at this point is how to check the size of the text file before the file has been closed (using FileWriter.close()). Is there a function for this or should I count the number of characters written to the file and use that to estimate the file size?
A separate question: are there ways of minimising the amount of space my text files take up? I am working in Java.
Create a writer which counts the number of characters written and use that to wrap your OutputStreamWriter
.
[EDIT] Note: The correct way to save text to a file is:
new BufferedWriter( new OutputStreamWriter( new FileOutputStream( file ), encoding ) ) );
The encoding is important; it's usually "UTF-8".
This chain gives you two places where you can inject your wrapper: You can wrap the writer to get the number of characters or the inner OutputStream
to get bytes written.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With