I have data that I would like to do a lot of analytic queries on and I'm trying to figure out if there is a mechanism I can use to store it so that Spark can efficiently do joins on it. I have a solution using RedShift, but would ideally prefer to have something that is based on files in S3 instead of having a whole RedShift cluster up 24/7.
This is a simplified example. We have 2 initial CSV files.
The two tables are linked via the person_id field. person_id is unique in the Person table. Events have a many-to-one relationship with person.
I'd like to understand how to set up the data so I can efficiently perform the following query. I will need to perform many queries like this (all queries are evaluated on a per person basis):
The query is to produce a data frame with 4 columns, and 1 row for every person.
All current solutions I have with Spark to this problem involve reshuffling all the data, which ends up making the process slow for large amounts (hundreds of millions of people). I am happy with a solution that requires me to reshuffle the data and write it to a different format once if that can then speed up later queries.
I can accomplish this solution using RedShift in a fairly straightforward way:
Each both files are loaded in as RedShift tables, with DISTKEY person_id, SORTKEY person_id. This distributes the data so that all the data for a person is on a single node. The following query will produce the desired data frame:
select person_id, age, e.cost from person
left join (select person_id, sum(cost) as cost from events
where date between '2013-06-01' and '2013-06-30'
group by person_id) as e using (person_id)
I have thought of several potential ways to handle this in Spark, but none accomplishes what I need. My ideas and the issues are listed below:
For my use cases, Spark has advantages over RedShift which aren't obvious in this simple example, so I'd prefer to do this with Spark. Please let me know if I am missing something and there is a good approach to this.
Edited per comment.
Assumptions:
Here's what I would try:
val eventAgg = spark.sql("""select person_id, sum(cost) as cost
from events
where date between '2013-06-01' and '2013-06-30'
group by person_id""")
eventAgg.cache.count
val personDF = spark.sql("""SELECT person_id, age from person""")
personDF.cache.count // cache is less important here, so feel free to omit
eventAgg.join(personDF, "person_id", "left")
I just did this with some of my data and here's how it went (9 node/140 vCPUs cluster, ~600GB RAM):
27,000,000,000 "events" (aggregated to 14,331,487 "people")
64,000,000 "people" (~20 columns)
aggregated events building and caching took ~3 min
people caching took ~30 seconds (pulling from network, not parquet)
left joining took several seconds
Not caching the "people" led to the join taking a few seconds longer. Then forcing spark to broadcast the couple hundred MB aggregated events made the join take under 1 second.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With