Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how does hadoop read input file?

Tags:

csv

hadoop

i have a csv file to analyze with hadoop mapreduce. I am wondering if hadoop will parse it line by line? if yes, i want to use string split by comma to get the fields want to analyze. or is there other better method of parsing csv and feed it into hadoop? The file is 10 GB, comma delimited. I want to use java with hadoop. The parameter "value" of Tex type in the below map() method contains each line that is parsed in by Map/Reduce? - this is where I'm most confused about.

this is my code:

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    try {
       String[] tokens = value.toString().split(",");

       String crimeType = tokens[5].trim();      
       int year = Integer.parseInt(tokens[17].trim()); 

       context.write(crimeType, year);

     } catch (Exception e) {...}
 }
like image 658
TonyGW Avatar asked Oct 19 '13 19:10

TonyGW


2 Answers

Yes, by default Hadoop uses a Text Input reader that feeds the mapper line by line from the input file. The key in the mapper is the offset of the line read. Be careful with CSV files though, as single columns/fields can contain a line break. You might want to look for a CSV input reader like this one: https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java

like image 164
pumuckl Avatar answered Nov 15 '22 06:11

pumuckl


  • The parameter "value" of Tex type in the below map() method contains each line that is parsed in by Map/Reduce? - this is where I'm most confused about.

    Yes(assuming you are using the default InputFormat which is the TextInputFormat). The process is a bit more involved though. It is actually the RecordReader that decides how exactly the InputSplit created by the InputFormat will be sent to the mapper as records(or key/value pairs). The TextInputFormat uses LinerecordReader and the entire line is treated as a record. Remember, mapper doesn't process the entire InputSplit all at once. It is rather a discrete process wherein an InputSplit is sent to the mapper as Records in order to get processed.

  • I am wondering if hadoop will parse it line by line? if yes, i want to use string split by comma to get the fields want to analyze.

    I don't find anything wrong with your approach. This is how folks usually process csv files. Read in the lines as Text values, convert them into String and use split(). One minor suggestion though. Convert the Java types into appropriate MA types before you emit them using Context.write(), like crimeType to Text() and year to IntWritable.

Is this what you need?

like image 22
Tariq Avatar answered Nov 15 '22 05:11

Tariq