I want to upload the wikipedia pageviews from http://dumps.wikimedia.org/other/pagecounts-raw/ into BigQuery, what's the fastest way?
This is a classic demo I do to show how easy is to load data into BigQuery.
To get an hour of pageviews from Wikipedia, just wget the file:
wget http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-06/pagecounts-20140602-180000.gz
Then load it into BigQuery (should take less or around 5 minutes):
bq load -F" " --quote "" fh-bigquery:wikipedia.pagecounts_20140602_18 pagecounts-20140602-180000.gz language,title,requests:integer,content_size:integer
Note that this file weights around 100MB compressed (gz), and you don't need to decompress files this size to load them into BigQuery. It contains around 8 million rows of hourly pageviews counts.
-F" "
: This file is separated by spaces, not comma.--quote ""
: This file doesn't use quotes for strings.language,title,requests:integer,content_size:integer
: The columns. By default everything is loaded as a string, except when we explicitly cast it to a different type (integers in this case).(Requires previous installation of the bq tool)
I made this dataset public, you can find this table at https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18.
I also left all the August pageviews at https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.wikipedia_views_201308 (53 billion pageviews if you SELECT SUM(requests) FROM [fh-bigquery:wikipedia.wikipedia_views_201308]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With