I have a very large file (several GB) in AWS S3, and I only need a small number of lines in the file which satisfy a certain condition. I don't want to load the entire file in-memory and then search for and print those few lines - the memory load for this would be too high. The right way would be to only load those lines in-memory which are needed.
As per AWS documentation to read from file:
fullObject = s3Client.getObject(new GetObjectRequest(bucketName, key));
displayTextInputStream(fullObject.getObjectContent());
private static void displayTextInputStream(InputStream input) throws IOException {
// Read the text input stream one line at a time and display each line.
BufferedReader reader = new BufferedReader(new InputStreamReader(input));
String line = null;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
System.out.println();
}
Here we are using a BufferedReader. It is not clear to me what is happening underneath here.
Are we making a network call to S3 each time we are reading a new line, and only keeping the current line in the buffer? Or is the entire file loaded in-memory and then read line-by-line by BufferedReader? Or is it somewhere in between?
In the Amazon S3 console, choose your S3 bucket, choose the file that you want to open or download, choose Actions, and then choose Open or Download. If you are downloading an object, specify where you want to save it. The procedure for saving the object depends on the browser and operating system that you are using.
Class BufferedReader. Reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines. The buffer size may be specified, or the default size may be used. The default is large enough for most purposes.
Reading objects without downloading them Similarly, if you want to upload and read small pieces of textual data such as quotes, tweets, or news articles, you can do that using the S3 resource method put(), as demonstrated in the example below (Gist).
One of the answer of your question is already given in the documentation you linked:
Your network connection remains open until you read all of the data or close the input stream.
A BufferedReader
doesn't know where the data it reads is coming from, because you're passing another Reader
to it. A BufferedReader
creates a buffer of a certain size (e.g. 4096 characters) and fills this buffer by reading from the underlying Reader
before starting to handing out data of calls of read()
or read(char[] buf)
.
The Reader
you pass to the BufferedReader
is - by the way - using another buffer for itself to do the conversion from a byte
-based stream to a char
-based reader. It works the same way as with BufferedReader
, so the internal buffer is filled by reading from the passed InputStream
which is the InputStream
returned by your S3-client.
What exactly happens within this client if you attempt to load data from the stream is implementation dependent. One way would be to keep open one network connection and you can read from it as you wish or the network connection can be closed after a chunk of data has been read and a new one is opened when you try to get the next one.
The documentation quoted above seems to say that we've got the former situation here, so: No, calls of readLine
are not leading to single network calls.
And to answer your other question: No, a BufferedReader
, the InputStreamReader
and most likely the InputStream
returned by the S3-client are not loading in the whole document into memory. That would contradict the whole purpose of using streams in the first place and the S3 client could simply return a byte[][]
instead (to come around the limit of 2^32 bytes per byte
-array)
Edit: There is an exception of the last paragraph. If the whole gigabytes big document has no line breaks, calling readLine
will actually lead to the reading of the whole data into memory (and most likely to a OutOfMemoryError). I assumed a "regular" text document while answering your question.
If you are basically not searching for a specific word/words, and you are aware of the bytes range, you can also use Range header in S3. This should be specifically useful as you are working with a single file of several GB size. Specifying Range not only helps to reduce the memory, but also is faster, as only the specified part of the file is read.
See Is there "S3 range read function" that allows to read assigned byte range from AWS-S3 file?
Hope this helps.
Sreram
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With