s3 - how to get fast line count of file? wc -l is too slow

Question

Does anyone have a quick way of getting the line count of a file hosted in S3? Preferably using the CLI, s3api but I am open to python/boto as well. Note: solution must run non-interactively, ie in an overnight batch.

Right no i am doing this, it works but takes around 10 minutes for a 20GB file:

 aws cp s3://foo/bar - | wc -l

Right no i am doing this, it works but takes around 10 minutes for a 20GB file:

 aws cp s3://foo/bar - | wc -l

John Rotenstein · Accepted Answer

Here's two methods that might work for you...

Amazon S3 has a new feature called S3 Select that allows you to query files stored on S3.

You can perform a count of the number of records (lines) in a file and it can even work on GZIP files. Results may vary depending upon your file format.

S3 Select

Amazon Athena is also a similar option that might be suitable. It can query files stored in Amazon S3.

Soujanya G · Answer

Yes, Amazon S3 is having the SELECT feature, also keep an eye on the cost while executing any query from SELECT tab.. For example, here is the price @Jun2018 (This may varies) S3 Select pricing is based on the size of the input, the output, and the data transferred. Each query will cost 0.002 USD per GB scanned, plus 0.0007 USD per GB returned.

s3 - how to get fast line count of file? wc -l is too slow

Tags:

amazon-web-services

amazon-s3

aws-cli

boto3

boto

tooptoop4

2 Answers

John Rotenstein

Soujanya G

Recent Activity

Donate For Us

s3 - how to get fast line count of file? wc -l is too slow

Tags:

amazon-web-services

amazon-s3

aws-cli

boto3

boto

tooptoop4

2 Answers

John Rotenstein

Soujanya G

Related questions

Recent Activity

Donate For Us