Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

s3-select querying data on field name

I am trying to query data from json data in s3-select.

   {
    person = [
    {
            "Id": 1,
            "Name": "Anshu",
            "Address": "Templestowe",
            "Car": "Jeep"
    }
    {
            "Id": 2,
            "Name": "Ben Mostafa",
            "Address": "Las Vegas",
            "Car": "Mustang"
    }
    {
                    "Id": 3,
                    "Name": "Rohan Wood",
                    "Address": "Wooddon",
                    "Car": "VW"
    }
]
}

QUERY = "select * from S3Object s"
QUERY = "select s.person from S3Object s"
QUERY = "select s.person[0] from S3Object s"
QUERY = "select s.person[0].Name from S3Object s"

All these queries work fine and returns the respective object as desired but when i am trying to search data on name/Car, it doesn't work.

QUERY = "select * from S3Object s where s.person.Name = \"Anshu\" "

error: com.amazonaws.services.s3.model.AmazonS3Exception: The column index at line 1, column 32 is invalid.

There is not much related content available on s3-select online. Wondering whether we can query on field name or not! There are no examples of select query with where clause for s3-select given in the documentation

like image 844
yogas Avatar asked May 14 '18 06:05

yogas


People also ask

Can you query data from S3 bucket?

Amazon S3 Select and Amazon S3 Glacier Select enable customers to run structured query language SQL queries directly on data stored in S3 and Amazon S3 Glacier. With S3 Select, you simply store your data on S3 and query using SQL statements to filter the contents of S3 objects, retrieving only the data that you need.

How do I pull data from AWS S3?

In the Amazon S3 console, choose your S3 bucket, choose the file that you want to open or download, choose Actions, and then choose Open or Download. If you are downloading an object, specify where you want to save it. The procedure for saving the object depends on the browser and operating system that you are using.

Does S3 SELECT support distinct?

Amazon S3 Select does not support the DISTINCT operator. S3 Select is not a full query engine. Instead, it is a quick way to filter the contents of a file.

Does S3 SELECT use Athena?

Athena supports more file formats and more forms of file compression than S3-Select. For instance, S3-select supports only CSV, JSON, and Parquet, while Athena additionally allows TSV, ORC files, and more. S3-select works only with the S3 API (ex.


2 Answers

I can't find this in any of AWS documentations, but I was just playing around and discovered a working syntax:

QUERY = "select * from S3Object s where 'Anshu' in s.person[*].Name"

Based on some deductions:

  1. I know that syntax like WHERE ('blah' in s.tags) work when tags property is an array of string.
  2. AWS documentation also say that s.person[#] should work when # is a valid index/digit. Based on this, I discovered that using star (*) between square bracket, as in s.person[*].Name, also work. This is after failed testing of various syntax like s.Person[], s.Person[#], s.Person[?], etc...

Proof with Python and Boto3:

import boto3

S3_BUCKET = 'your-bucket-name'

s3 = boto3.client('s3')

r = s3.select_object_content(
        Bucket=S3_BUCKET,
        Key='your-file-name.json',
        ExpressionType='SQL',
        Expression="select * from s3object s where 'Anshu' in s.person[*].Name",
        InputSerialization={'JSON': {"Type": "Lines"}},
        OutputSerialization={'JSON': {}}
)

for event in r['Payload']:
    if 'Records' in event:
        records = event['Records']['Payload'].decode('utf-8')
        print(records)

Weird, I know. Remember to set [default] credentials in ~/.aws/credentials file.

like image 181
Noogen Avatar answered Sep 30 '22 13:09

Noogen


After reading the AWS doc, I find following SQL works fine.

select * from S3Object[*].person[*] as p where p.Name='Anshu'

This SQL will give you all the person whose name is 'Anshu', like:

{
    "Id": 1,
    "Name": "Anshu",
    "Address": "Templestowe",
    "Car": "Jeep"
}

When you see [*], it means a json array.

Amazon S3 Select always treats a JSON document as an array of root-level values, so we use S3Object[*] in the SQL. And person value is a array, so we use person[*] in the SQL.

like image 45
loic Avatar answered Sep 30 '22 13:09

loic