Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Amazon Athena not parsing cloudfront logs

I'm following the Athena getting started guide and trying to parse my own Cloudfront logs. However, the fields are not being parsed.

I used a small test file, as follows:

#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type
2016-02-02  07:57:45    LHR5    5001    86.177.253.38   GET d3g47gpj5mj0b.cloudfront.net    /foo    404 -   Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36   -   -   Error   -tHYQ3YpojqpR8yFHCUg5YW4OC_yw7X0VWvqwsegPwDqDFkIqhZ_gA==    d3g47gpj5mj0b.cloudfront.net    https421    0.076   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error
2016-02-02  07:57:45    LHR5    1158241 86.177.253.38   GET d3g47gpj5mj0b.cloudfront.net    /images/posts/cover/404.jpg 200 https://d3g47gpj5mj0b.cloudfront.net/foo    Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36   -   -   Miss    oUdDIjmA1ON1GjWmFEKlrbNzZx60w6EHxzmaUdWEwGMbq8V536O4WA==    d3g47gpj5mj0b.cloudfront.net    https   419 0.440   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss

And created the table with this SQL:

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
  `Date` DATE,
  Time STRING,
  Location STRING,
  Bytes INT,
  RequestIP STRING,
  Method STRING,
  Host STRING,
  Uri STRING,
  Status INT,
  Referrer STRING,
  os STRING,
  Browser STRING,
  BrowserVersion STRING
  ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
  WITH SERDEPROPERTIES (
  "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"
  ) LOCATION 's3://test/athena-csv/'

But no data comes back:

athena screen shot with no data

I can see it returns 4 rows, but the first 2 should be excluded because they start with a #, so it's like the regex isn't being parsed correctly.

Am I doing something wrong? Or is the regex wrong (seems unlikely, as it's in the docs, and looks fine to me)?

like image 788
andrewrjones Avatar asked Mar 04 '17 21:03

andrewrjones


1 Answers

This is what I ended up with:

CREATE EXTERNAL TABLE logs (
  `date` date,
  `time` string,
  `location` string,
  `bytes` int,
  `request_ip` string,
  `method` string,
  `host` string,
  `uri` string,
  `status` int,
  `referer` string,
  `useragent` string,
  `uri_query` string,
  `cookie` string,
  `edge_type` string,
  `edget_requiest_id` string,
  `host_header` string,
  `cs_protocol` string,
  `cs_bytes` int,
  `time_taken` string,
  `x_forwarded_for` string,
  `ssl_protocol` string,
  `ssl_cipher` string,
  `result_type` string,
  `protocol` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex' = '^(?!#.*)(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s*(\\S*)'
) LOCATION 's3://logs'

Note the double backslashes are intentional.

The format of the cloudfront logs changed at some point to add the protocol. This handles older and newer files.

like image 116
andrewrjones Avatar answered Oct 15 '22 12:10

andrewrjones