Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Pig: Load a file that shows fine using hadoop fs -text

I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I can view them using hadoop fs -text part-r-00000 but can't get them loaded using pig.

What I've tried:

x = load 'part-r-00000';
dump x;
x = load 'part-r-00000' using TextLoader();
dump x;

but that only gives me garbage. How can I view the file using pig?

What might be of relevance is that my hdfs is still using CDH-2 at the moment. Furthermore, if I download the file to local and run file part-r-00000 it says part-r-00000: data, I don't know how to unzip it locally.

like image 815
exic Avatar asked Sep 05 '12 17:09

exic


People also ask

How do you load a TXT file in pig?

Now load the data from the file student_data. txt into Pig by executing the following Pig Latin statement in the Grunt shell. grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

How do users interact with HDFS in Apache Pig?

How do users interact with the shell in Apache Pig? Using Grunt i.e. Apache Pig's interactive shell, users can interact with HDFS or the local file system. To start Grunt, users should use pig –x local command . This command will prompt Grunt shell.


2 Answers

According to HDFS Documentation, hadoop fs -text <file> can be used on "zip and TextRecordInputStream" data, so your data may be in one of these formats.

If the file was compressed, normally Hadoop would add the extension when outputting to HDFS, but if this was missing, you could try testing by unzipping/ungzipping/unbzip2ing/etc locally. It appears Pig should do this decompressing automatically, but may require the file extension be present (e.g. part-r-00000.zip) -- more info.

I'm not too sure on the TextRecordInputStream.. it sounds like it would just be the default method of Pig, but I could be wrong. I didn't see any mention of LOAD'ing this data via Pig when I did a quick Google.

Update: Since you've discovered it is a sequence file, here's how you can load it using PiggyBank:

-- using Cloudera directory structure:
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar
--REGISTER /home/hadoop/lib/pig/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();


-- Sample job: grab counts of tweets by day
A = LOAD 'mydir/part-r-000{00..99}' # not sure if pig likes the {00..99} syntax, but worth a shot 
    USING SequenceFileLoader AS (key:long, val:long, etc.);
like image 138
Dolan Antenucci Avatar answered Nov 16 '22 02:11

Dolan Antenucci


If you want to manipulate (read/write) sequence files with Pig then you can give a try to Twitter's Elephant-Bird as well.

You can find here examples how to read/write them.

If you use custom Writables in you sequence file then you can implement a custom converter by extending AbstractWritableConverter .

Note, that Elephant-Bird needs to have an installed Thrift in your machine. Before building it, make sure that it is using the correct Thrift version you have and also provide the correct path of the Thrift executable in its pom.xml:

<plugin>
  <groupId>org.apache.thrift.tools</groupId>
  <artifactId>maven-thrift-plugin</artifactId>
  <version>0.1.10</version>
  <configuration>
    <thriftExecutable>/path_to_thrift/thrift</thriftExecutable>
  </configuration>
</plugin>
like image 44
Lorand Bendig Avatar answered Nov 16 '22 03:11

Lorand Bendig