Apache Pig: Load a file that shows fine using hadoop fs -text

Tags:

I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I can view them using hadoop fs -text part-r-00000 but can't get them loaded using pig.

What I've tried:

Click to copy

x = load 'part-r-00000';
dump x;
x = load 'part-r-00000' using TextLoader();
dump x;

but that only gives me garbage. How can I view the file using pig?

What might be of relevance is that my hdfs is still using CDH-2 at the moment. Furthermore, if I download the file to local and run file part-r-00000 it says part-r-00000: data, I don't know how to unzip it locally.

815

asked Sep 05 '12 17:09

exic

2 Answers

According to HDFS Documentation, hadoop fs -text <file> can be used on "zip and TextRecordInputStream" data, so your data may be in one of these formats.

If the file was compressed, normally Hadoop would add the extension when outputting to HDFS, but if this was missing, you could try testing by unzipping/ungzipping/unbzip2ing/etc locally. It appears Pig should do this decompressing automatically, but may require the file extension be present (e.g. part-r-00000.zip) -- more info.

I'm not too sure on the TextRecordInputStream.. it sounds like it would just be the default method of Pig, but I could be wrong. I didn't see any mention of LOAD'ing this data via Pig when I did a quick Google.

Update: Since you've discovered it is a sequence file, here's how you can load it using PiggyBank:

Click to copy

-- using Cloudera directory structure:
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar
--REGISTER /home/hadoop/lib/pig/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();


-- Sample job: grab counts of tweets by day
A = LOAD 'mydir/part-r-000{00..99}' # not sure if pig likes the {00..99} syntax, but worth a shot 
    USING SequenceFileLoader AS (key:long, val:long, etc.);

138

answered Nov 16 '22 02:11

Dolan Antenucci

If you want to manipulate (read/write) sequence files with Pig then you can give a try to Twitter's Elephant-Bird as well.

You can find here examples how to read/write them.

If you use custom Writables in you sequence file then you can implement a custom converter by extending AbstractWritableConverter .

Note, that Elephant-Bird needs to have an installed Thrift in your machine. Before building it, make sure that it is using the correct Thrift version you have and also provide the correct path of the Thrift executable in its pom.xml:

Click to copy

<plugin>
  <groupId>org.apache.thrift.tools</groupId>
  <artifactId>maven-thrift-plugin</artifactId>
  <version>0.1.10</version>
  <configuration>
    <thriftExecutable>/path_to_thrift/thrift</thriftExecutable>
  </configuration>
</plugin>

answered Nov 16 '22 03:11

Lorand Bendig

Related questions
                            
                                How to route TCP/IP responses through a different interface?
                            
                                how to sort based on a column but uniq based on another column?
                            
                                how do i find the count of multiple words in a text file?
                            
                                Is int safe to read from multiple threads?
                            
                                Standard format for yes/no questions in the terminal?
                            
                                Pixel-based graphics in linux terminal application
                            
                                Install binaries into /bin, /sbin, /usr/bin and /usr/sbin, interactions with --prefix and DESTDIR
                            
                                Parse score reports and extract point sums and averages
                            
                                how to check if emacs in frame or in terminal?
                            
                                How to wait until the file is closed
                            
                                Wait for user input when running an R script in Linux
                            
                                How do I disassemble raw MIPS code?
                            
                                Sending a process to the background and returning control to my shell
                            
                                The address in Kernel
                            
                                Copying an image to the clipboard from command line
                            
                                Linux device driver to allow an FPGA to DMA directly to CPU RAM
                            
                                FileSystemWatcher with Samba on Linux
                            
                                glibc error while building linux from scratch
                            
                                How to make git not ask for password at pull?
                            
                                SFTP failing with "Match Group" clause

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Pig: Load a file that shows fine using hadoop fs -text

Tags:

linux

hadoop

apache-pig

cloudera

exic

People also ask

2 Answers

Dolan Antenucci

Lorand Bendig

Recent Activity

Donate For Us