Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read and parse CSV file in S3 without downloading the entire file

Using node.js, with the intention of running this module as an AWS Lambda function.

Using s3.getObject() from aws-sdk, I am able to successfully pick up a very large CSV file from Amazon S3. The intention is to read each line in the file and emit an event with the body of each line.

In all examples I could find, it looks like the entire CSV file in S3 has to be buffered or streamed, converted to a string and then read line by line.

s3.getObject(params, function(err, data) {
   var body = data.Body.toString('utf-8');
}

This operation takes a very long time, given the size of the source CSV file. Also, the CSV rows are of varying length, and I'm not certain if I can use the buffer size as an option.

Question

Is there a way to pick up the S3 file in node.js and read/transform it line by line, which avoids stringifying the entire file in memory first?

Ideally, I'd prefer to use the better capabilities of fast-csv and/or node-csv, instead of looping manually.

like image 692
changingrainbows Avatar asked Oct 04 '16 20:10

changingrainbows


People also ask

Can I read S3 file without downloading?

Reading objects without downloading them Similarly, if you want to upload and read small pieces of textual data such as quotes, tweets, or news articles, you can do that using the S3 resource method put(), as demonstrated in the example below (Gist).


2 Answers

You should just be able to use the createReadStream method and pipe it into fast-csv:

const s3Stream = s3.getObject(params).createReadStream()
require('fast-csv').fromStream(s3Stream)
  .on('data', (data) => {
    // do something here
  })
like image 107
idbehold Avatar answered Sep 29 '22 17:09

idbehold


I do not have enough reputation to comment but as of now the accepted answer method "fromStream" does not exist for 'fast-csv'. Now you'll need to use the parseStream method:

const s3Stream = s3.getObject(params).createReadStream()
require('fast-csv').parseStream(s3Stream)
  .on('data', (data) => {
    // use rows
  })
like image 25
Kai Durai Avatar answered Sep 29 '22 19:09

Kai Durai