How can I load a bunch of files from a S3 bucket into a single PySpark dataframe? I'm running on an EMR instance. If the file is local, I can use the SparkContext textFile method. But when the file is on S3, how can I use boto3 to load multiple files of various types (CSV, JSON, ...) into a single dataframe for processing?
Spark natively reads from S3 using Hadoop APIs, not Boto3. And textFile is for reading RDD, not DataFrames. Also do not try to load two different formats into a single dataframe as you won't be able to consistently parse them
I would suggest using
csvDf = spark.read.csv("s3a://path/to/files/*.csv")
jsonDf = spark.read.json("s3a://path/to/files/*.json")
And from there, you can filter and join the dataframes using SparkSQL.
Note: JSON files need to contain single JSON objects each on their own line
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With