Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing json input in hadoop java

Tags:

java

hadoop

My input data is in hdfs. I am simply trying to do wordcount but there is slight difference. The data is in json format. So each line of data is:

{"author":"foo", "text": "hello"}
{"author":"foo123", "text": "hello world"}
{"author":"foo234", "text": "hello this world"}

I only want to do wordcount of words in "text" part.

How do I do this?

I tried the following variant so far:

public static class TokenCounterMapper
    extends Mapper<Object, Text, Text, IntWritable> {
    private static final Log log = LogFactory.getLog(TokenCounterMapper.class);
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context)
        throws IOException, InterruptedException {
        try {

            JSONObject jsn = new JSONObject(value.toString());

            //StringTokenizer itr = new StringTokenizer(value.toString());
            String text = (String) jsn.get("text");
            log.info("Logging data");
            log.info(text);
            StringTokenizer itr = new StringTokenizer(text);
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        } catch (JSONException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

But I am getting this error:

Error: java.lang.ClassNotFoundException: org.json.JSONException
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:247)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
    at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
like image 204
frazman Avatar asked May 29 '13 23:05

frazman


2 Answers

Seems you forgot to embed the JSon library in your Hadoop job jar. You can have a look there to see how you can build your job with the library: http://tikalk.com/build-your-first-hadoop-project-maven

like image 168
clement Avatar answered Sep 29 '22 11:09

clement


There are several ways to use external jars with your map reduce code:

  1. Include the referenced JAR in the lib subdirectory of the submittable JAR: The job will unpack the JAR from this lib subdirectory into the jobcache on the respective TaskTracker nodes and point your tasks to this directory to make the JAR available to your code. If the JARs are small, change often, and are job-specific this is the preferred method. This is what @clement suggested in his answer.

  2. Install the JAR on the cluster nodes. The easiest way is to place the JAR into $HADOOP_HOME/lib directory as everything from this directory is included when a Hadoop daemon starts. Note that a start stop will be needed to make this effective.

  3. TaskTrackers will be using the external JAR, so you can provide it by modifying HADOOP_TASKTRACKER_OPTS option in the hadoop-env.sh configuration file and make it point to the jar. The jar needs to be present at the same path on all the nodes where task-tracker runs.

  4. Include the JAR in the “-libjars” command line option of the hadoop jar … command. The jar will be placed in distributed cache and will be made available to all of the job’s task attempts. Your map-reduce code must use GenericOptionsParser. For more details read this blog post.

Comparison:

  • 1 is a legacy method but discouraged because it has a large negative performance cost.

  • 2 and #3 are good for private clusters but pretty lame practice as you cannot expect end users to do that.

  • 4 is the most recommended option.

Read the main post from Cloudera).

like image 44
Tejas Patil Avatar answered Sep 29 '22 11:09

Tejas Patil