Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How do explicit table partitions in Databricks affect write performance?

We have the following scenario:

  • We have an existing table containing approx. 15 billion records. It was not explicitly partitioned on creation.
  • We are creating a copy of this table with partitions, hoping for faster read time on certain types of queries.
  • Our tables are on Databricks Cloud, and we use Databricks Delta.
  • We commonly filter by two columns, one of which is the ID of an entity (350k distinct values) and one of which is the date at which an event occurred (31 distinct values so far, but increasing every day!).

So, in creating our new table, we ran a query like this:

CREATE TABLE the_new_table
PARTITIONED BY (entity_id, date)
  from_unixtime(timestamp) AS timestamp,
  CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table

This query has run for 48 hours and counting. We know that it is making progress, because we have found around 250k prefixes corresponding to the first partition key in the relevant S3 prefix, and there are certainly some big files in the prefixes that exist.

However, we're having some difficulty monitoring exactly how much progress has been made, and how much longer we can expect this to take.

While we waited, we tried out a query like this:

CREATE TABLE a_test_table (
  entity_id STRING,
  another_id STRING,
  timestamp TIMESTAMP,
  date DATE

INSERT INTO a_test_table
  from_unixtime(timestamp) AS timestamp,
  CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
  WHERE CAST(from_unixtime(timestamp) AS DATE) = '2018-12-01'

Notice the main difference in the new table's schema here is that we partitioned only on date, not on entity id. The date we chose contains almost exactly four percent of the old table's data, which I want to point out because it's much more than 1/31. Of course, since we are selecting by a single value that happens to be the same thing we partitioned on, we are in effect only writing one partition, vs. the probably hundred thousand or so.

The creation of this test table took 16 minutes using the same number of worker-nodes, so we would expect (based on this) that the creation of a table 25x larger would only take around 7 hours.

This answer appears to partially acknowledge that using too many partitions can cause the problem, but the underlying causes appear to have greatly changed in the last couple of years, so we seek to understand what the current issues might be; the Databricks docs have not been especially illuminating.

Based on the posted request rate guidelines for S3, it seems like increasing the number of partitions (key prefixes) should improve performance. The partitions being detrimental seems counter-intuitive.

In summary: we are expecting to write many thousands of records in to each of many thousands of partitions. It appears that reducing the number of partitions dramatically reduces the amount of time it takes to write the table data. Why would this be true? Are there any general guidelines on the number of partitions that should be created for data of a certain size?

like image 297
Jesse Amano Avatar asked Feb 23 '19 02:02

Jesse Amano

1 Answers

You should partition your data by date because it sounds like you are continually adding data as time passes chronologically. This is the generally accepted approach to partitioning time series data. It means that you will be writing to one date partition each day, and your previous date partitions are not updated again (a good thing).

You can of course use a secondary partition key if your use case benefits from it (i.e. PARTITIONED BY (date, entity_id))

Partitioning by date will necessitate that your reading of this data will always be made by date as well, to get the best performance. If this is not your use case, then you would have to clarify your question.

How many partitions?

No one can give you answer on how many partitions you should use because every data set (and processing cluster) is different. What you do want to avoid is "data skew", where one worker is having to process huge amounts of data, while other workers are idle. In your case that would happen if one clientid was 20% of your data set, for example. Partitioning by date has to assume that each day has roughly the same amount of data, so each worker is kept equally busy.

I don't know specifically about how Databricks writes to disk, but on Hadoop I would want to see each worker node writing it's own file part, and therefore your write performance is paralleled at this level.

like image 113
Brad Avatar answered Sep 18 '22 14:09
