Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I debug a pig script

If while running a simple group by script in pig for large terabytes of data, the script got stuck at say 70%, then what can be done to diagnose the problem?

like image 997
Manish Avatar asked May 12 '15 18:05

Manish


2 Answers

There are several method to debug a pig script. Simple method is step by step execution of a relation and then verify the result. These commands are useful to debug a pig script.

DUMP - Use the DUMP operator to run (execute) Pig Latin statements and display the results to your screen.

ILLUSTRATE - Use the ILLUSTRATE operator to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE allows you to test your programs on small datasets and get faster turnaround times.

EXPLAIN - Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the specified relationship.

DESCRIBE - Use the DESCRIBE operator to view the schema of a relation. You can view outer relations as well as relations defined in a nested FOREACH statement.

More detail about these commands are available on this link. Also please refer developing and testing a pig script. to know more detail.

If you want to debug whole script during execution then you need to write below code at top of your script

-- set the debug mode on 
SET debug 'on'
-- set a job name of your job.
SET job.name 'my job'

This will allow to run your script into debug mode. mode detail on about SET command is available on this link

like image 134
Sandeep Singh Avatar answered Nov 17 '22 02:11

Sandeep Singh


When you say the script is stuck at 70%, I assume you mean the MR job is 70% complete.

It's best to look at MR and YARN logs (and if needed, HDFS logs) at that point for more information about what MR/YARN is doing. Logs can be typically found under /var/log/hadoop-mapreduce and /var/log/hadoop-hdfs in Cloudera Manager managed clusters. You may need to examine logs from multiple nodes in the cluster where YARN NodeManagers are running.

In case your script is stuck with a Pig issue (i.e. issue in Pig code, not MR/HDFS code), it is useful to increase the log4j logging level in Pig: pig -d DEBUG is the command line option to set the logging level to DEBUG for example.

like image 44
user3730028 Avatar answered Nov 17 '22 00:11

user3730028