I am using my own algorithm and loading data in json format from s3. Because of the huge size of data, I need to setup pipe mode. I have followed the instructions as given in: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/pipe_bring_your_own/train.py. As a result, I am able to setup pipe and read data successfully. Only problem is that fifo pipe is not reading the specified amount of bytes. For example, given path to s3-fifo-channel,
number_of_bytes_to_read = 555444333
with open(fifo_path, "rb", buffering=0) as fifo:
while True:
data = fifo.read(number_of_bytes_to_read)
The length of data should be 555444333 bytes, but it is always less 12,123,123 bytes or so. Data in S3 looks as following:
s3://s3-bucket/1122/part1.json
s3://s3-bucket/1122/part2.json
s3://s3-bucket/1133/part1.json
s3://s3-bucket/1133/part2.json
and so. Is there any way to enforce the number of bytes to be read? Any suggestion will be helpful. Thanks.
We just needed to add some positive value in the buffering and the problem was solved. Code will buffer 555444333 Bytes and then process 111222333 bytes each time. Since our files are in Json, we can easily convert incoming bytes to string and then clean strings by removing incomplete json parts. Final code looks like:
number_of_bytes_to_read = 111222333
number_of_bytes_to_buffer = 555444333
with open(fifo_path, "rb", buffering=number_of_bytes_to_buffer) as fifo:
while True:
data = fifo.read(number_of_bytes_to_read)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With