Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word Count program in Hive

Tags:

hive

mapreduce

I'm trying to learn Hive. Surprisingly, I can't find an example of how to write a simple word count job. Is the following correct?

Let's say I have an input file input.tsv:

hello, world
this is an example input file

I create a splitter in Python to turn each line into words:

import sys

for line in sys.stdin:
 for word in line.split():
   print word

And then I have the following in my Hive script:

CREATE TABLE input (line STRING);
LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input;

-- temporary table to hold words...
CREATE TABLE words (word STRING);

add file splitter.py;

INSERT OVERWRITE TABLE words 
  SELECT TRANSFORM(text) 
    USING 'python splitter.py' 
    AS word
  FROM input;

SELECT word, count(*) AS count FROM words GROUP BY word;

I'm not sure if I'm missing something, or if it really is this complicated. (In particular, do I need the temporary words table, and do I need to write the external splitter function?)

like image 436
grautur Avatar asked Apr 06 '12 06:04

grautur


People also ask

How to perform word count in Hive?

If you want a simple one see the following: SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word; I use a lateral view to enable the use of a table valued function (explode) which takes the list that comes out of split function and outputs a new row for every value.

How do I count strings in hive?

You can split string by '(^|\\|)REG:' - this regex means 'REG:' at the beginning of the string or '|REG:' , then take size of array -1. Using this method you do not need to explode array and aggregate count.

Does Count work with hive?

Simple Examples In order to count the number of rows in a table: SELECT COUNT(*) FROM table2; Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

How do I count the number of columns in hive?

At the end of the SHOW COLUMNS command, it shows the number of rows returned, which indicates the number of columns, so this answer is correct. Even 'Describe db_name. table_name;' will give the count in similar way.


1 Answers

If you want a simple one see the following:

SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word;

I use a lateral view to enable the use of a table valued function (explode) which takes the list that comes out of split function and outputs a new row for every value. In practice I use a UDF that wraps IBM's ICU4J word breaker. I generally don't use transform scripts and use UDFs for everything. You don't need a temporary words table.

like image 88
Steve Severance Avatar answered Dec 21 '22 05:12

Steve Severance