Sampling a large distributed data set using pyspark / spark

Tags:

hadoop

I have a file in hdfs which is distributed across the nodes in the cluster.

I'm trying to get a random sample of 10 lines from this file.

in the pyspark shell, I read the file into an RDD using:

>>> textFile = sc.textFile("/user/data/myfiles/*")

and then I want to simply take a sample... the cool thing about Spark is that there are commands like takeSample, unfortunately I think I'm doing something wrong because the following takes a really long time:

>>> textFile.takeSample(False, 10, 12345)

so I tried creating a partition on each node, and then instructing each node to sample that partition using the following command:

>>> textFile.partitionBy(4).mapPartitions(lambda blockOfLines: blockOfLines.takeSample(False, 10, 1234)).first()

but this gives an error ValueError: too many values to unpack :

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/worker.py", line 77, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/serializers.py", line 117, in dump_stream
    for obj in iterator:
  File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/rdd.py", line 821, in add_shuffle_key
    for (k, v) in iterator:
ValueError: too many values to unpack

How can I sample 10 lines from a large distributed data set using spark or pyspark?

896

asked Jul 17 '14 14:07

mgoldwasser

1 Answers

Try using textFile.sample(false,fraction,seed) instead. takeSample will generally be very slow because it calls count() on the RDD. It needs to do this because otherwise it wouldn't take evenly from each partition, basically it uses the count along with the sample size you asked for to compute the fraction and calls sample internally. sample is fast because it just uses a random boolean generator that returns true fraction percent of the time and thus doesn't need to call count.

In addition, I don't think this is happening to you but if the sample size returned is not big enough it calls sample again which can obviously slow it down. Since you should have some idea of the size of your data I would recommend calling sample and then cutting the sample down to size yourself, since you know more about your data than spark does.

154

answered Sep 16 '22 19:09

aaronman

Related questions
                            
                                Skipping the header while loading the text file using Piglatin
                            
                                copyFromLocal: `/user/hduser/gutenberg': No such file or directory
                            
                                HBase getting all timestamped values for a cell
                            
                                how to sort numerically in hadoop's shuffle/sort phase?
                            
                                Hadoop native libraries not found on OS/X
                            
                                Is there any Conditional IF like operator in Apache PIG?
                            
                                Python Connection to Hive
                            
                                How to read a .deflate file in hadoop
                            
                                Why we need Avro schema evolution
                            
                                Hive: Table creation with multi-files with multiple directories
                            
                                Hive throws: WstxParsingException: Illegal character entity: expansion character (code 0x8)
                            
                                NotSerializableException on anonymous class
                            
                                Why does "hadoop fs -mkdir" fail with Permission Denied?
                            
                                Sqoop Import --password-file function not working properly in sqoop 1.4.4
                            
                                Hadoop “Unable to load native-hadoop library for your platform” error on docker-spark?
                            
                                Hive enforces schema during read time?
                            
                                Hadoop 2.2.0 fails running start-dfs.sh with Error: JAVA_HOME is not set and could not be found
                            
                                Hadoop: How to unit test FileSystem
                            
                                Getting the following error "Datanode denied communication with namenode" while configuring hadoop 0.23.8
                            
                                Type mismatch in value from map: expected org.apache.hadoop.io.NullWritable, recieved org.apache.hadoop.io.Text

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With