Word Count program in Hive

Tags:

mapreduce

I'm trying to learn Hive. Surprisingly, I can't find an example of how to write a simple word count job. Is the following correct?

Let's say I have an input file input.tsv:

hello, world
this is an example input file

I create a splitter in Python to turn each line into words:

import sys

for line in sys.stdin:
 for word in line.split():
   print word

And then I have the following in my Hive script:

CREATE TABLE input (line STRING);
LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input;

-- temporary table to hold words...
CREATE TABLE words (word STRING);

add file splitter.py;

INSERT OVERWRITE TABLE words 
  SELECT TRANSFORM(text) 
    USING 'python splitter.py' 
    AS word
  FROM input;

SELECT word, count(*) AS count FROM words GROUP BY word;

I'm not sure if I'm missing something, or if it really is this complicated. (In particular, do I need the temporary words table, and do I need to write the external splitter function?)

436

asked Apr 06 '12 06:04

1 Answers

If you want a simple one see the following:

SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word;

I use a lateral view to enable the use of a table valued function (explode) which takes the list that comes out of split function and outputs a new row for every value. In practice I use a UDF that wraps IBM's ICU4J word breaker. I generally don't use transform scripts and use UDFs for everything. You don't need a temporary words table.

answered Dec 21 '22 05:12

Steve Severance

Related questions
                            
                                Full utilization of all cores in Hadoop pseudo-distributed mode
                            
                                how to prevent hadoop job to fail on corrupted input file
                            
                                Pig - ERROR 1045: AVG as multiple or none of them fit. Please use an explicit cast
                            
                                MongoDb Aggregation - Splitting into time buckets
                            
                                MongoDB incremental mapReduce, select only new documents, added after last mapReduce
                            
                                CouchDB Views: How much processing is acceptable in map reduce?
                            
                                Use mongodb aggregation framework to group by length of array
                            
                                How to run a promise-then chain using map or reduce on an arbitrary number of chain elements?
                            
                                Launch a mapreduce job from eclipse
                            
                                HDFS File Checksum
                            
                                Hive query stuck at 99%
                            
                                What is the principle of "code moving to data" rather than data to code?
                            
                                How to join MongoDB collections in Python?
                            
                                akka: pattern for combining messages from multiple children
                            
                                MongoDB MapReduce. $exists on nested field
                            
                                Why YARN java heap space memory error?
                            
                                Hive join set number of reducers
                            
                                Hadoop: job runs okay on smaller set of data but fails with large dataset
                            
                                Could not find or load main class when trying to format namenode; hadoop installation on MAC OS X 10.9.2
                            
                                Sorting Dates in CouchDB Views

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Word Count program in Hive

Tags:

hive

mapreduce

grautur

People also ask

1 Answers

Steve Severance

Recent Activity

Donate For Us