I didn't think there was a difference between an inputstream object read from a local file vs one from a network source (Amazon S3 in this case) so hopefully someone can enlighten me.
These programs were run on a VM running Centos 6.3. The test file in both cases are 10MB.
Local file code:
InputStream is = new FileInputStream("/home/anyuser/test.jpg");
int read = 0;
int buf_size = 1024 * 1024 * 2;
byte[] buf = new byte[buf_size];
ByteArrayOutputStream baos = new ByteArrayOutputStream(buf_size);
long t3 = System.currentTimeMillis();
int i = 0;
while ((read = is.read(buf)) != -1) {
baos.write(buf,0,read);
System.out.println("reading for the " + i + "th time");
i++;
}
long t4 = System.currentTimeMillis();
System.out.println("Time to read = " + (t4-t3) + "ms");
The output of this code is this: it reads 5 times, which makes sense since the buffer size read in is 2MB and the file is 10MB.
reading for the 0th time
reading for the 1th time
reading for the 2th time
reading for the 3th time
reading for the 4th time
Time to read = 103ms
Now, we have the same code run with the same 10MB test file, except this time, the source is from Amazon S3. We don't start reading until we finish getting the stream from S3. However, this time, the read loop is running through thousands of times, when it should only read it 5 times.
InputStream is;
long t1 = System.currentTimeMillis();
is = getS3().getFileFromBucket(S3Path,input);
long t2 = System.currentTimeMillis();
System.out.print("Time to get file " + input + " from S3: ");
System.out.println((t2-t1) + "ms");
int read = 0;
int buf_size = 1024*1024*2;
byte[] buf = new byte[buf_size];
ByteArrayOutputStream baos = new ByteArrayOutputStream(buf_size);
long t3 = System.currentTimeMillis();
int i = 0;
while ((read = is.read(buf)) != -1) {
baos.write(buf,0,read);
if ((i % 100) == 0)
System.out.println("reading for the " + i + "th time");
i++;
}
long t4 = System.currentTimeMillis();
System.out.println("Time to read = " + (t4-t3) + "ms");
The output is as follows:
Time to get file test.jpg from S3: 2456ms
reading for the 0th time
reading for the 100th time
reading for the 200th time
reading for the 300th time
reading for the 400th time
reading for the 500th time
reading for the 600th time
reading for the 700th time
reading for the 800th time
reading for the 900th time
reading for the 1000th time
reading for the 1100th time
reading for the 1200th time
reading for the 1300th time
reading for the 1400th time
Time to read = 14471ms
The amount of time taken to read the stream changes from run to run. Sometimes it takes 60 seconds, sometimes 15 seconds. It doesn't get faster than 15 sec. The read loop still loops through 1400+ times on each test run of the program, even though I think it should only be 5 times, like the local file example.
Is this how inputstream works when the source is through the network, even though we had finished getting the file from the network source? Thanks in advance for your help.
I don't think it's specific to java. When you read from the network, the actual read call to the operating system will return a packet of data at a time, no matter how big is the buffer you allocated. If you check the size of the read data (your read variable), it should show the size of the network packet used.
This is one of the reason why people use a separate thread to read from network and avoid blocking by using async i/o technique.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With