Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load compressed files into BigQuery

I want to upload the wikipedia pageviews from http://dumps.wikimedia.org/other/pagecounts-raw/ into BigQuery, what's the fastest way?

like image 609
Felipe Hoffa Avatar asked Jun 10 '14 01:06

Felipe Hoffa


1 Answers

This is a classic demo I do to show how easy is to load data into BigQuery.

To get an hour of pageviews from Wikipedia, just wget the file:

wget http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-06/pagecounts-20140602-180000.gz

Then load it into BigQuery (should take less or around 5 minutes):

bq load -F" " --quote "" fh-bigquery:wikipedia.pagecounts_20140602_18 pagecounts-20140602-180000.gz language,title,requests:integer,content_size:integer

Note that this file weights around 100MB compressed (gz), and you don't need to decompress files this size to load them into BigQuery. It contains around 8 million rows of hourly pageviews counts.

  • -F" ": This file is separated by spaces, not comma.
  • --quote "": This file doesn't use quotes for strings.
  • fh-bigquery:wikipedia.pagecounts_20140602_18: The name of the BigQuery dataset where the file will be loaded.
  • pagecounts-20140602-180000.gz: The name of the gz file downloaded from wikimedia.
  • language,title,requests:integer,content_size:integer: The columns. By default everything is loaded as a string, except when we explicitly cast it to a different type (integers in this case).

(Requires previous installation of the bq tool)

I made this dataset public, you can find this table at https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18.

I also left all the August pageviews at https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.wikipedia_views_201308 (53 billion pageviews if you SELECT SUM(requests) FROM [fh-bigquery:wikipedia.wikipedia_views_201308].

like image 189
Felipe Hoffa Avatar answered Oct 02 '22 06:10

Felipe Hoffa