I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I can view them using hadoop fs -text part-r-00000
but can't get them loaded using pig.
What I've tried:
x = load 'part-r-00000';
dump x;
x = load 'part-r-00000' using TextLoader();
dump x;
but that only gives me garbage. How can I view the file using pig?
What might be of relevance is that my hdfs is still using CDH-2 at the moment.
Furthermore, if I download the file to local and run file part-r-00000
it says part-r-00000: data
, I don't know how to unzip it locally.
Now load the data from the file student_data. txt into Pig by executing the following Pig Latin statement in the Grunt shell. grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
How do users interact with the shell in Apache Pig? Using Grunt i.e. Apache Pig's interactive shell, users can interact with HDFS or the local file system. To start Grunt, users should use pig –x local command . This command will prompt Grunt shell.
According to HDFS Documentation, hadoop fs -text <file>
can be used on "zip and TextRecordInputStream" data, so your data may be in one of these formats.
If the file was compressed, normally Hadoop would add the extension when outputting to HDFS, but if this was missing, you could try testing by unzipping/ungzipping/unbzip2ing/etc locally. It appears Pig should do this decompressing automatically, but may require the file extension be present (e.g. part-r-00000.zip) -- more info.
I'm not too sure on the TextRecordInputStream.. it sounds like it would just be the default method of Pig, but I could be wrong. I didn't see any mention of LOAD'ing this data via Pig when I did a quick Google.
Update: Since you've discovered it is a sequence file, here's how you can load it using PiggyBank:
-- using Cloudera directory structure:
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar
--REGISTER /home/hadoop/lib/pig/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
-- Sample job: grab counts of tweets by day
A = LOAD 'mydir/part-r-000{00..99}' # not sure if pig likes the {00..99} syntax, but worth a shot
USING SequenceFileLoader AS (key:long, val:long, etc.);
If you want to manipulate (read/write) sequence files with Pig
then you can give a try to Twitter's Elephant-Bird as well.
You can find here examples how to read/write them.
If you use custom Writables in you sequence file then you can implement a custom converter by extending AbstractWritableConverter .
Note, that Elephant-Bird
needs to have an installed Thrift in your machine.
Before building it, make sure that it is using the correct Thrift version you have and also provide the correct path of the Thrift executable in its pom.xml:
<plugin>
<groupId>org.apache.thrift.tools</groupId>
<artifactId>maven-thrift-plugin</artifactId>
<version>0.1.10</version>
<configuration>
<thriftExecutable>/path_to_thrift/thrift</thriftExecutable>
</configuration>
</plugin>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With