I had uploaded the data file to the GCS bucket of my project in Dataproc. Now I want to copy that file to HDFS. How can I do that?
You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs
copy command. Note that you need to run this from a node within the cluster:
hdfs dfs -cp gs://<bucket>/<object> <hdfs path>
This works because hdfs://<master node>
is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>
Note that GCS objects use the gs:
scheme. Paths should appear the same as they do when you use gsutil
.
When you use hdfs dfs
, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:
hadoop distcp gs://<bucket>/<directory> <HDFS target directory>
Consult the DistCp documentation for details.
Finally, consider leaving your data on GCS. Because the GCS connector implements Hadoop's distributed filesystem interface, it can be used as a drop-in replacement for HDFS in most cases. Notable exceptions are when you rely on (most) atomic file/directory operations or want to use a latency-sensitive application like HBase. The Dataproc HDFS migration guide gives a good overview of data migration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With