HDFS is the heart of Hadoop, I get that. But what if I don't want to store my data on HDFS. Instead, I want to analyze and run Hadoop jobs on data stored on a remote server accessible via the NFS protocol? How do I do that?
For example, I want to run Teragen using the data on the NFS server like below:
hadoop jar hadoop-mapreduce-examples.jar teragen 1000000000 nfs://IP/some/path
I am just looking for ideas on how to do this and I do understand the repercussions of all this (HDFS vs NFS). So, while I appreciate anyone telling me that it's a bad idea, I still want to do it for some experiment that I am trying.
I can maybe code something to make this happen but any pointers where I need to start will be helpful and much appreciated. I also don't want to reinvent the wheel. So, if something like this already exists that I am unaware of, please do comment and let me know. Anything that I build will be made open-source so that others can benefit as well.
Do you know this site: https://blog.netapp.com/blogs/run-big-data-analytics-natively-on-nfs-data/
It looks like you can exchange HDFS with NFS at the bottom, while at a higher abstraction layer everything works as before as MapReduce/YARN will take care of everything for you.
I can not tell anything about whether or not this works, as we are currently preparing to set up such a "native NFS hadoop". I will come back to you with more details in some months.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With