Is it possible to loop through the file/key in Amazon S3 bucket, read the contents and count the number of lines using Python?
For Example:
1. My bucket: "my-bucket-name"
2. File/Key : "test.txt"
I need to loop through the file "test.txt" and count the number of line in the raw file.
Sample Code:
for bucket in conn.get_all_buckets():
if bucket.name == "my-bucket-name":
for file in bucket.list():
#need to count the number lines in each file and print to a log.
Open the AWS S3 console and click on your bucket's name. In the Objects tab, click the top row checkbox to select all files and folders or select the folders you want to count the files for. Click on the Actions button and select Calculate total size.
Amazon S3 is object storage built to store and retrieve any amount of data from anywhere. It's a simple storage service that offers industry leading durability, availability, performance, security, and virtually unlimited scalability at very low costs.
Amazon S3 is only a storage service. You must get the file in order to perform actions on it (e.g. reading number of files). Show activity on this post. You can loops through a bucket using boto3 list_objects_v2.
To store your data in Amazon S3, you work with resources known as buckets and objects. A bucket is a container for objects. An object is a file and any metadata that describes that file. To store an object in Amazon S3, you create a bucket and then upload the object to a bucket.
Invoke the list_objects_v2 () method with the bucket name to list all the objects in the S3 bucket. It returns the dictionary object with the object details. Iterate the returned dictionary and display the object names using the obj [key].
When you enable an S3 Bucket Key for your bucket, new objects that you upload to the bucket use an S3 Bucket Key for server-side encryption using AWS KMS. If you upload, modify, or copy an object in a bucket that has an S3 Bucket Key enabled, the S3 Bucket Key settings for that object might be updated to align with bucket configuration.
Using boto3
you can do the following:
import boto3
# create the s3 resource
s3 = boto3.resource('s3')
# get the file object
obj = s3.Object('bucket_name', 'key')
# read the file contents in memory
file_contents = obj.get()["Body"].read()
# print the occurrences of the new line character to get the number of lines
print file_contents.count('\n')
If you want to do this for all objects in a bucket, you can use the following code snippet:
bucket = s3.Bucket('bucket_name')
for obj in bucket.objects.all():
file_contents = obj.get()["Body"].read()
print file_contents.count('\n')
Here is the reference to boto3 documentation for more functionality: http://boto3.readthedocs.io/en/latest/reference/services/s3.html#object
Update: (Using boto 2)
import boto
s3 = boto.connect_s3() # establish connection
bucket = s3.get_bucket('bucket_name') # get bucket
for key in bucket.list(prefix='key'): # list objects at a given prefix
file_contents = key.get_contents_as_string() # get file contents
print file_contents.count('\n') # print the occurrences of the new line character to get the number of lines
Reading large files to memory sometimes is far from ideal. Instead you may find the following more of use:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucketname', Key=fileKey)
nlines = 0
for _ in obj['Body'].iter_lines(): nlines+=1
print (nlines)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With