I am trying to query data from json data in s3-select.
{
person = [
{
"Id": 1,
"Name": "Anshu",
"Address": "Templestowe",
"Car": "Jeep"
}
{
"Id": 2,
"Name": "Ben Mostafa",
"Address": "Las Vegas",
"Car": "Mustang"
}
{
"Id": 3,
"Name": "Rohan Wood",
"Address": "Wooddon",
"Car": "VW"
}
]
}
QUERY = "select * from S3Object s"
QUERY = "select s.person from S3Object s"
QUERY = "select s.person[0] from S3Object s"
QUERY = "select s.person[0].Name from S3Object s"
All these queries work fine and returns the respective object as desired but when i am trying to search data on name/Car, it doesn't work.
QUERY = "select * from S3Object s where s.person.Name = \"Anshu\" "
error: com.amazonaws.services.s3.model.AmazonS3Exception: The column index at line 1, column 32 is invalid.
There is not much related content available on s3-select online. Wondering whether we can query on field name or not! There are no examples of select query with where clause for s3-select given in the documentation
Amazon S3 Select and Amazon S3 Glacier Select enable customers to run structured query language SQL queries directly on data stored in S3 and Amazon S3 Glacier. With S3 Select, you simply store your data on S3 and query using SQL statements to filter the contents of S3 objects, retrieving only the data that you need.
In the Amazon S3 console, choose your S3 bucket, choose the file that you want to open or download, choose Actions, and then choose Open or Download. If you are downloading an object, specify where you want to save it. The procedure for saving the object depends on the browser and operating system that you are using.
Amazon S3 Select does not support the DISTINCT operator. S3 Select is not a full query engine. Instead, it is a quick way to filter the contents of a file.
Athena supports more file formats and more forms of file compression than S3-Select. For instance, S3-select supports only CSV, JSON, and Parquet, while Athena additionally allows TSV, ORC files, and more. S3-select works only with the S3 API (ex.
I can't find this in any of AWS documentations, but I was just playing around and discovered a working syntax:
QUERY = "select * from S3Object s where 'Anshu' in s.person[*].Name"
Based on some deductions:
Proof with Python and Boto3:
import boto3
S3_BUCKET = 'your-bucket-name'
s3 = boto3.client('s3')
r = s3.select_object_content(
Bucket=S3_BUCKET,
Key='your-file-name.json',
ExpressionType='SQL',
Expression="select * from s3object s where 'Anshu' in s.person[*].Name",
InputSerialization={'JSON': {"Type": "Lines"}},
OutputSerialization={'JSON': {}}
)
for event in r['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
Weird, I know. Remember to set [default] credentials in ~/.aws/credentials file.
After reading the AWS doc, I find following SQL works fine.
select * from S3Object[*].person[*] as p where p.Name='Anshu'
This SQL will give you all the person whose name is 'Anshu', like:
{
"Id": 1,
"Name": "Anshu",
"Address": "Templestowe",
"Car": "Jeep"
}
When you see [*]
, it means a json array.
Amazon S3 Select always treats a JSON document as an array of root-level values, so we use S3Object[*]
in the SQL. And person
value is a array, so we use person[*]
in the SQL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With