A sqoop query generates a java file that contains a class that contains the code to get access in mapreduce to the columns data for each row. (the Sqoop import was done in text without the --as-sequencefile option, and with 1 line per record and commas between the columns) But how do we actually use it?
I found a public method parse() in this class that takes Text as an input and populates all the members of the class , so to practice I modified the wordcount application to convert a line of text from the TextInputFormat in the mapper into an instnace of the class generated by sqoop. But that causes an "unreported exception.com.cloudera.sqoop.lib.RecordParser.ParseError; must be caught or declared to be thrown" when I call the parse() method.
Can it be done this way or is a custom InputFormat necessary to populate the class with the data from each record ?
Ok this seems obvious once you find out but as a java beginner this can take time.
First configure your project: just add the sqoop generated .java file in your source folder. I use eclipse to import it in my class source folder.
Then just make sure you configured your project's java build path correctly:
Add the following jar files in the project's properties/java build path/libraries/add external jar: (for hadoop cdh4+) :
/usr/lib/hadoop/hadoop-common.jar
/usr/lib/hadoop-[version]-mapreduce/hadoop-core.jar
/usr/lib/sqoop/sqoop-[sqoop-version]-cdh[cdh-version].jar
Then adapt your mapreduce source code: First configure it:
public int run(String [] args) throws exception
{
Job job = new Job(getConf());
job.setJarByClass(YourClass.class);
job.setMapperClass(SqoopImportMap.class);
job.setReducerClass(SqoopImprtReduce.class);
FileInputFormat.addInputPath((job,"hdfs_path_to_your_sqoop_imported_file"));
FileOutputFormat.setOutputPath((job,"hdfs_output_path"));
// I simply use text as output for the mapper but it can be any class you designed
// as long as you implement it as a Writable
job.setMapOutputKeyClass(Text.Class);
job.setMapOutputValueClass(Text.Class);
job.setOutputKeyClass(Text.Class);
job.setOutputValueClass(Text.Class);
...
Now configure your mapper class. Let's assume your sqoop imported java file is called Sqimp.java: and the table you imported had the following columns: id, name, age your mapper class should look like this:
public static class SqoopImportMap
extends Mapper<LongWritable, Text, Text, Text>
{
public void map(LongWritable k, Text v, Context context)
{
Sqimp s = new Sqimp();
try
{
// this is where the code generated by sqoop is used.
// it automatically casts one line of the imported data into an instance of the generated class,
// to let you access the data inside the columns easily
s.parse(v);
}
catch(ParseError pe) {// do something if there is an error.}
try
{
// now the imported data is accessible:
// e.g
if (s.age>30)
{
// submit the selected data to the mapper's output as a key value pair.
context.write(new Text(s.age),new Text(s.id));
}
}
catch(Exception ex)
{//do something about the error}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With