Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to copy a file from a GCS bucket in Dataproc to HDFS using google cloud?

I had uploaded the data file to the GCS bucket of my project in Dataproc. Now I want to copy that file to HDFS. How can I do that?

like image 992
DivyaMishra Avatar asked Jan 29 '19 21:01

DivyaMishra


1 Answers

For a single "small" file

You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:

hdfs dfs -cp gs://<bucket>/<object> <hdfs path>

This works because hdfs://<master node> is the default filesystem. You can explicitly specify the scheme and NameNode if desired:

hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>

Note that GCS objects use the gs: scheme. Paths should appear the same as they do when you use gsutil.

For a "large" file or large directory of files

When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:

hadoop distcp  gs://<bucket>/<directory> <HDFS target directory>

Consult the DistCp documentation for details.

Consider leaving data on GCS

Finally, consider leaving your data on GCS. Because the GCS connector implements Hadoop's distributed filesystem interface, it can be used as a drop-in replacement for HDFS in most cases. Notable exceptions are when you rely on (most) atomic file/directory operations or want to use a latency-sensitive application like HBase. The Dataproc HDFS migration guide gives a good overview of data migration.

like image 129
Ben Sidhom Avatar answered Oct 01 '22 23:10

Ben Sidhom