Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to save data in HDFS with spark?

I want to using Spark Streaming to retrieve data from Kafka. Now, I want to save my data in a remote HDFS. I know that I have to use the function saveAsText. However, I don't know precisely how to specify the path.

Is that correct if I write this:

myDStream.foreachRDD(frm->{
    frm.saveAsTextFile("hdfs://ip_addr:9000//home/hadoop/datanode/myNewFolder");
});

where ip_addr is the ip address of my hdfs remote server. /home/hadoop/datanode/ is the DataNode HDFS directory created when I installed hadoop (I don't know if I have to specify this directory). And, myNewFolder is the folder where I want to save my data.

Thanks in advance.

Yassir

like image 727
Yassir S Avatar asked Sep 15 '25 09:09

Yassir S


1 Answers

The path has to be a directory in HDFS.

For example, if you want to save the files inside a folder named myNewFolder under the root / path in HDFS.

The path to use would be hdfs://namenode_ip:port/myNewFolder/

On execution of the spark job this directory myNewFolder will be created.

The datanode data directory which is given for the dfs.datanode.data.dir in hdfs-site.xml is used to store the blocks of the files you store in HDFS, should not be referenced as HDFS directory path.

like image 85
franklinsijo Avatar answered Sep 18 '25 09:09

franklinsijo