Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a text file from HDFS line by line in mapper

Tags:

java

hadoop

hdfs

Is the following code for Mappers, reading a text file from HDFS right? And if it is:

  1. What happens if two mappers in different nodes try to open the file at almost the same time?
  2. Isn't there a need to close the InputStreamReader? If so, how to do it without closing the filesystem?

My code is:

Path pt=new Path("hdfs://pathTofile");
FileSystem fs = FileSystem.get(context.getConfiguration());
BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt)));
String line;
line=br.readLine();
while (line != null){
System.out.println(line);
like image 329
nik686 Avatar asked Jan 28 '13 23:01

nik686


People also ask

How do I read an HDFS file?

To read a file from HDFS, a client needs to interact with namenode (master) as namenode is the centerpiece of Hadoop cluster (it stores all the metadata i.e. data about the data).

How do I open a text file in Hadoop?

You can use the Hadoop filesystem command to read any file. It supports the cat command to read the content.

How do I access HDFS files from browser?

To browse the HDFS file system in the HDFS NameNode UI, select Utilities > Browse the file system . The Browse Directory page is populated. Enter the directory path and click Go!.


1 Answers

This will work, with some amendments - i assume the code you've pasted is just truncated:

Path pt=new Path("hdfs://pathTofile");
FileSystem fs = FileSystem.get(context.getConfiguration());
BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt)));
try {
  String line;
  line=br.readLine();
  while (line != null){
    System.out.println(line);

    // be sure to read the next line otherwise you'll get an infinite loop
    line = br.readLine();
  }
} finally {
  // you should close out the BufferedReader
  br.close();
}

You can have more than one mapper reading the same file, but there is limit at which it makes more sense to use the Distributed Cache (not only reducing the load on the data nodes which host the blocks for the file but also will be more efficient if you have a job with a larger number of tasks than you have task nodes)

like image 165
Chris White Avatar answered Nov 11 '22 16:11

Chris White