I have a bucket on S3 that contains 1000 files. Each is about 1GB. I would like to read random sample of this files. Let's say 5% of all files. This is how I do it
fileDF = sqlContext.jsonRDD(self.sc.textFile(self.path).sample(withReplacement=False, fraction=0.05, seed=42).repartition(160))
But it seems above code will read all the files and then take samples. While I want to take samples of files and read them. Could somebody help?
Use your favorite method to list the files under the path, take a sample of the names, and then use RDD union:
import pyspark
import random
sc = pyspark.SparkContext(appName = "Sampler")
file_list = list_files(path)
desired_pct = 5
file_sample = random.sample(file_list, int(len(file_list) * desired_pct / 100))
file_sample_rdd = sc.emptyRDD()
for f in file_sample:
file_sample_rdd = file_sample_rdd.union(sc.textFile(f))
sample_data_rdd = file_sample_rdd.repartition(160)
Here's one possible quick and dirty implementation of "list_files" that will list the files under a "directory" on S3:
import os
def list_files(path, profile = None):
if not path.endswith("/"):
raise Exception("not handled...")
command = 'aws s3 ls %s' % path
if profile is not None:
command = 'aws --profile %s s3 ls %s' % (profile, path)
result = os.popen(command)
_r = result.read().strip().split('\n')
_r = [path + i.strip().split(' ')[-1] for i in _r]
return _r
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With