I have a text file on S3 with around 300 million lines. I'm looking to split this file into smaller files of 1,000 lines each (with the last file containing the remainder), which I'd then like to put into another folder or bucket on S3. So far, I've been running this on my local drive using the linux command: <pre class="prettyprint"><code>split -l 1000 file </code></pre> which splits the original file into smaller files of 1,000 lines. However, with a larger file like this, it seems inefficient to download and then re-upload from my local drive back up to S3. What would be the most efficient way to split this S3 file, ideally using Python (in a Lambda function) or using other S3 commands? Is it faster to just run this on my local drive?

Your method seems sound (download, split, upload). You should run the commands from an Amazon EC2 instance in the same region as the Amazon S3 bucket. Use the AWS Command-Line Interface (CLI) to download/upload the files: <pre class="prettyprint"><code>aws s3 cp s3://my-bucket/big-file.txt . aws s3 cp --recursive folder-with-files s3://my-bucket/destination-folder/ </code></pre>

Split S3 file into smaller files of 1000 lines

Tags:

python

amazon-web-services

amazon-s3

aws-lambda

I have a text file on S3 with around 300 million lines. I'm looking to split this file into smaller files of 1,000 lines each (with the last file containing the remainder), which I'd then like to put into another folder or bucket on S3.

So far, I've been running this on my local drive using the linux command:

split -l 1000 file

which splits the original file into smaller files of 1,000 lines. However, with a larger file like this, it seems inefficient to download and then re-upload from my local drive back up to S3.

What would be the most efficient way to split this S3 file, ideally using Python (in a Lambda function) or using other S3 commands? Is it faster to just run this on my local drive?

757

asked May 14 '19 23:05

octothorpe_not_hashtag

2 Answers

Anything that you do will have to download the file, split it, and re-upload it. The only question is where, and whether local disk is involved.

John Rotenstein gave you an example using local disk on an EC2 instance. This has the benefit of running in the AWS datacenters, so it gets a high-speed connection, but has the limitations that (1) you need disk space to store the original file and its pieces, and (2) you need an EC2 instance where you can do this.

One small optimization is to avoid the local copy of the big file, by using a hyphen as the destination of the s3 cp: this will send the output to standard out, and you can then pipe it into split (here I'm also using a hyphen to tell split to read from standard input):

aws s3 cp s3://my-bucket/big-file.txt - | split -l 1000 - output.
aws s3 cp output.* s3://dest-bucket/

Again, this requires an EC2 instance to run it on, and the storage space for the output files. There is, however, a flag to split that will let you run a shell command for each file in the split:

aws s3 cp s3://src-bucket/src-file - | split -b 1000 --filter 'aws s3 cp - s3://dst-bucket/result.$FILE' -

So now you've eliminated the issue of local storage, but are left with the issue of where to run it. My recommendation would be AWS Batch, which can spin up an EC2 instance for just the time needed to perform the command.

You can, of course, write a Python script to do this on Lambda, and that would have the benefit of being triggered automatically when the source file has been uploaded to S3. I'm not that familiar with the Python SDK (boto), but it appears that get_object will return the original file's body as a stream of bytes, which you can then iterate over as lines, accumulating however many lines you want into each output file.

113

answered Nov 23 '22 05:11

guest

Your method seems sound (download, split, upload).

You should run the commands from an Amazon EC2 instance in the same region as the Amazon S3 bucket.

Use the AWS Command-Line Interface (CLI) to download/upload the files:

aws s3 cp s3://my-bucket/big-file.txt .

aws s3 cp --recursive folder-with-files s3://my-bucket/destination-folder/

answered Nov 23 '22 05:11

John Rotenstein

Related questions
                            
                                Pandas create date range at certain dates
                            
                                Python Script to Convert CSV to GeoJSON
                            
                                NLTK. Detecting whether a sentence is Interogative or Not?
                            
                                How to install tesseract for python on anaconda
                            
                                Using for loop to define multiple functions - Python
                            
                                How to fix "polyfit maybe poorly conditioned" in numpy?
                            
                                Extract file name from read_csv - Python
                            
                                python how to re-raise an exception which is already caught?
                            
                                AttributeError: 'Model' object has no attribute 'name'
                            
                                Decorator for timeit.timeit method?
                            
                                Dropping multiple Pandas columns by Index
                            
                                The number of times a function gets called
                            
                                TypeError: Object of type ResultProxy is not JSON serializable: result in sqlalchemy query?
                            
                                bisect.insort complexity not as expected
                            
                                What is the best way to store login credentials on Airflow?
                            
                                AttributeError: 'tuple' object has no attribute 'log_softmax'
                            
                                Generic function typing in Python
                            
                                Schedule Asyncio task to execute every X seconds?
                            
                                How to implement polynomial logistic regression in scikit-learn?
                            
                                Is there a way to create gantt charts in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With