Read random sample of files on S3 with Pyspark

Question

I have a bucket on S3 that contains 1000 files. Each is about 1GB. I would like to read random sample of this files. Let's say 5% of all files. This is how I do it

fileDF = sqlContext.jsonRDD(self.sc.textFile(self.path).sample(withReplacement=False, fraction=0.05, seed=42).repartition(160))

But it seems above code will read all the files and then take samples. While I want to take samples of files and read them. Could somebody help?

E.F.Walker · Accepted Answer

Use your favorite method to list the files under the path, take a sample of the names, and then use RDD union:

import pyspark
import random

sc = pyspark.SparkContext(appName = "Sampler")
file_list = list_files(path)
desired_pct = 5
file_sample = random.sample(file_list, int(len(file_list) * desired_pct / 100))
file_sample_rdd = sc.emptyRDD()
for f in file_sample:
    file_sample_rdd = file_sample_rdd.union(sc.textFile(f))
sample_data_rdd = file_sample_rdd.repartition(160)

Here's one possible quick and dirty implementation of "list_files" that will list the files under a "directory" on S3:

import os
def list_files(path, profile = None):
    if not path.endswith("/"):
        raise Exception("not handled...")
    command = 'aws s3 ls %s' % path
    if profile is not None:
        command = 'aws --profile %s s3 ls %s' % (profile, path)
    result = os.popen(command)
    _r = result.read().strip().split('
')
    _r = [path + i.strip().split(' ')[-1] for i in _r]
    return _r

Read random sample of files on S3 with Pyspark

Tags:

python

amazon-s3

apache-spark

pyspark

amazon-emr

neikusc

1 Answers

E.F.Walker

Recent Activity

Donate For Us

Read random sample of files on S3 with Pyspark

Tags:

python

amazon-s3

apache-spark

pyspark

amazon-emr

neikusc

1 Answers

E.F.Walker

Related questions

Recent Activity

Donate For Us