I have two files on HDFS and I just want to join these two files on a column say employee id. I am trying to simply print the files to make sure we are reading that correctly from HDFS. <pre class="prettyprint"><code>lines = sc.textFile("hdfs://ip:8020/emp.txt") print lines.count() </code></pre> I have tried foreach and println functions as well and I am not able to display file data. I am working in python and totally new to both python and spark as well.

This is really easy just do a collect You must be sure that all the data fits the memory on your master <pre class="prettyprint"><code>my_rdd = sc.parallelize(xrange(10000000)) print my_rdd.collect() </code></pre> If that is not the case You must just take a sample by using take method. <pre class="prettyprint"><code># I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work my_rdd = sc.parallelize(xrange(100000000000000000)) print my_rdd.take(100) </code></pre> Another example using .ipynb: <img src="https://i.imgur.com/bKOVRju.png?1">

How to print rdd in python in spark

Tags:

python

apache-spark

apache-spark-sql

pyspark

I have two files on HDFS and I just want to join these two files on a column say employee id.

I am trying to simply print the files to make sure we are reading that correctly from HDFS.

lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()

I have tried foreach and println functions as well and I am not able to display file data. I am working in python and totally new to both python and spark as well.

289

asked Oct 09 '15 00:10

yguw

1 Answers

This is really easy just do a collect You must be sure that all the data fits the memory on your master

my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()

If that is not the case You must just take a sample by using take method.

# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)

Another example using .ipynb:

180

answered Oct 08 '22 06:10

Alberto Bonsanto

Related questions
                            
                                Add months to a datetime column in pandas
                            
                                Is manage.py collectstatic needed on each edit of static file?
                            
                                Local variable referenced before assignment, using multi-threading
                            
                                using variable in a url in python
                            
                                Error with encrypt message with RSA python
                            
                                Attribute error when attempting to get a value for field
                            
                                Replace NaN in a dataframe with random values
                            
                                How do I get more than 10 entries from Users or Interests using Mailchimp API 3.0?
                            
                                How to properly join same table multiple times using sqlalchemy core api?
                            
                                convert .data file to .csv
                            
                                How to sort list of strings by count of a certain character?
                            
                                Protobuf doesn't serialize default values
                            
                                python-linkedin api - how do I use it?
                            
                                Saving pandas dataframe to file using bcolz
                            
                                function which allows only name arguments
                            
                                What does float' object has no attribute 'replace' when I try locale.atof in Pandas?
                            
                                How do I make ipython aware of changes made to a selfwritten module?
                            
                                Efficient Numpy computation of pairwise squared differences
                            
                                Sum all values of a counter in Python 2 [duplicate]
                            
                                How to check which database is being used in a Django project

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With