The dataproc page describing druid support has no section on how to load data into the cluster. I've been trying to do this using GC Storage, but don't know how to set up a spec for it that works. I'd expect the "firehose" section to have some google specific references to a bucket, but there are no examples how to do this.
What is the method to load data into Druid, running on GCP dataproc straight out of the box?
I haven't used Dataproc version of Druid, but have a small cluster running in Google Compute VM. The way I ingest data to it from GCS is by using Google Cloud Storage Druid extension - https://druid.apache.org/docs/latest/development/extensions-core/google.html
To enable extension you need to add it to a list of extension in your Druid common.properties
file:
druid.extensions.loadList=["druid-google-extensions", "postgresql-metadata-storage"]
To ingest data from GCS I send HTTP POST request to http://druid-overlord-host:8081/druid/indexer/v1/task
The POST request body contains JSON file with ingestion spec(see ["ioConfig"]["firehose"] section):
{
"type": "index_parallel",
"spec": {
"dataSchema": {
"dataSource": "daily_xport_test",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "MONTH",
"queryGranularity": "NONE",
"rollup": false
},
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "dateday",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [{
"type": "string",
"name": "id",
"createBitmapIndex": true
},
{
"type": "long",
"name": "clicks_count_total"
},
{
"type": "long",
"name": "ctr"
},
"deleted",
"device_type",
"target_url"
]
}
}
}
},
"ioConfig": {
"type": "index_parallel",
"firehose": {
"type": "static-google-blobstore",
"blobs": [{
"bucket": "data-test",
"path": "/sample_data/daily_export_18092019/000000000000.json.gz"
}],
"filter": "*.json.gz$"
},
"appendToExisting": false
},
"tuningConfig": {
"type": "index_parallel",
"maxNumSubTasks": 1,
"maxRowsInMemory": 1000000,
"pushTimeout": 0,
"maxRetry": 3,
"taskStatusCheckPeriodMs": 1000,
"chatHandlerTimeout": "PT10S",
"chatHandlerNumRetries": 5
}
}
}
Example cURL command to start ingestion task in Druid(spec.json
contains JSON from the previous section):
curl -X 'POST' -H 'Content-Type:application/json' -d @spec.json http://druid-overlord-host:8081/druid/indexer/v1/task
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With