Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DataPipeline: Use only first 4 values from CSV in pipeline

I have a CSV, which has a variable structure, which I only want to take the first 4 values from. The CSV stored in S3 has between 7 and 8 fields in it, and I would like to take just the first 4. I have attempted to use the following prepared statement:

INSERT INTO locations (timestamp, item_id, latitude, longitude) VALUES (?, ?, ?, ?);

However I am getting:

Parameter index out of range (5 > number of parameters, which is 4).

Which I believe means that it is attempting to load in the other variables in the CSV. Is it possible to just take the first 4 values? Or otherwise deal with a variable length CSV?

like image 421
dojogeorge Avatar asked Dec 06 '25 05:12

dojogeorge


1 Answers

Use transformSql option. You didn't mention what you are loading into, from redshift docs :

The SQL SELECT expression used to transform the input data. When you copy data from DynamoDB or Amazon S3, AWS Data Pipeline creates a table called staging and initially loads it in there. Data from this table is used to update the target table. If the transformSql option is specified, a second staging table is created from the specified SQL statement. The data from this second staging table is then updated in the final target table. transformSql must be run on the table named staging and the output schema of transformSql must match the final target table's schema.

like image 72
Stormbringer Avatar answered Dec 07 '25 18:12

Stormbringer



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!