Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to print rdd in python in spark

I have two files on HDFS and I just want to join these two files on a column say employee id.

I am trying to simply print the files to make sure we are reading that correctly from HDFS.

lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()

I have tried foreach and println functions as well and I am not able to display file data. I am working in python and totally new to both python and spark as well.

like image 289
yguw Avatar asked Oct 09 '15 00:10

yguw


People also ask

How do I print an RDD object in Python?

To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd. collect(). foreach(println).

How do I print a RDD from Spark?

To print RDD contents, we can use RDD collect action or RDD foreach action. RDD. collect() returns all the elements of the dataset as an array at the driver program, and using for loop on this array, we can print elements of RDD. RDD foreach(f) runs a function f on each element of the dataset.

How do you use RDD in Python?

Output a Python RDD of key-value pairs (of form RDD[(K, V)] ) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package). Save this RDD as a SequenceFile of serialized objects. Output a Python RDD of key-value pairs (of form RDD[(K, V)] ) to any Hadoop file system, using the “org. apache.

Does Python support RDD?

Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.


1 Answers

This is really easy just do a collect You must be sure that all the data fits the memory on your master

my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()

If that is not the case You must just take a sample by using take method.

# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)

Another example using .ipynb:

like image 180
Alberto Bonsanto Avatar answered Oct 08 '22 06:10

Alberto Bonsanto