I'm attempting to get Apache Pig up and running on my Hadoop cluster, and am encountering a permissions problem. Pig itself is launching and connecting to the cluster just fine- from within the Pig shell, I can ls
through and around my HDFS directories. However, when I try and actually load data and run Pig commands, I run into permissions-related errors:
grunt> A = load 'all_annotated.txt' USING PigStorage() AS (id:long, text:chararray, lang:chararray);
grunt> DUMP A;
2011-08-24 18:11:40,961 [main] ERROR org.apache.pig.tools.grunt.Grunt - You don't have permission to perform the operation. Error from the server: org.apache.hadoop.security.AccessControlException: Permission denied: user=steven, access=WRITE, inode="":hadoop:supergroup:r-xr-xr-x
2011-08-24 18:11:40,977 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias A
Details at logfile: /Users/steven/Desktop/Hacking/hadoop/pig/pig-0.9.0/pig_1314230681326.log
grunt>
In this case, all_annotated.txt
is a file in my HDFS home directory that I created, and most definitely have permissions to; the same problem occurs no matter what file I try to load
. However, I don't think that's the problem, as the error itself indicates Pig is trying to write somewhere. Googling around, I found a few mailing list posts suggesting that certain Pig Latin statements (order
, etc.) need write access to a temporary directory on the HDFS file system whose location is controlled by the hadoop.tmp.dir
property in hdfsd-site.xml. I don't think load
falls into that category, but just to be sure, I changed hadoop.tmp.dir
to point to a directory within my HDFS home directory, and the problem persisted.
So, anybody out there have any ideas as to what might be going on?
The answer is 'Yes, there is, and that is with Apache Pig'. Apache Pig is a high-level platform for creating programs that run on Hadoop. The language for this platform is called Pig Latin. Pig Latin is a SQL like scripting language, that abstracts the programming concepts of MapReduce.
Apache Pig is 36% faster than Apache Hive for join operations on datasets. Apache Pig is 46% faster than Apache Hive for arithmetic operations. Apache Pig is 10% faster than Apache Hive for filtering 10% of the data. Apache Pig is 18% faster than Apache Hive for filtering 90% of the data.
Apache Pig reduces the time of development using the multi-query approach. Also, Pig is beneficial for programmers who are not from Java background. 200 lines of Java code can be written in only 10 lines using the Pig Latin language. Programmers who have SQL knowledge needed less effort to learn Pig Latin.
Probably your pig.temp.dir setting. It defaults to /tmp on hdfs. Pig will write temporary result there. If you don't have permission to /tmp, Pig will complain. Try to override it by -Dpig.temp.dir.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With