Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient approach for mass update of a hierarchical table

I have a database table that represents a hierarchy of files and directories, with the following structure (simplified):

ItemId        int
Path          text
Type          int        (0 for files, 1 for directories)
ParentId      int
BackupTime    datetime

Currently the BackupTime column is only used for files, it is set to null for directories.

Now I need to fill this column for directories as well: it must be the minimum BackupTime of all descendants (files and directories).

This (naive and inefficient) query illustrates what I want to do:

update Items i
set BackupTime = (select min(BackupTime)
                  from Items d
                  where d.Path like i.Path || '%'
                  and d.Type = 0)
where i.Type = 1

My problem is that I can't seem to find an efficient approach. The query above takes much too long on large volumes of data (this table often contains more than 100K rows)

It would probably be faster to search the min(BackupTime) only on direct children:

update Items i
set BackupTime = (select min(BackupTime)
                  from Items d
                  where d.ParentId = i.ItemId)
where i.Type = 1

But for this to work, I must ensure that descendants will be updated before their ancestors, so I must walk the hierarchy recursively from bottom up. The problem is that I have no easy way of knowing which items are the deepest in the hierarchy. I'm using SQLite, so I can't use hierarchical queries.

Any idea on how to do this efficiently?

Ideally, I'd prefer to be able to do it in a single UPDATE query, but if it's not possible I'm open to other options, as long as they're efficient

like image 455
Thomas Levesque Avatar asked Nov 14 '22 06:11

Thomas Levesque


1 Answers

This is a shot in the dark, but it might work. It's an attempt to handle the bottom-up issue manually. (I don't know sqlite's limitations, but this is probably standard SQL-92 and hopefully ok.)

Step 1: Decide how you want to handle empty directories. I think the solution here works only if there are no empty directories or if empty directories are initially updated so they have an artificial non-NULL BackupTime. (What that artificial BackupTime should be may be important, depending on how you maintain the BackupDate column when there are changes to your data. Using the current date or an artificial future date might work, but you should think about it.)

Step 2. Execute the following query repeatedly until no more rows are affected:

  update Items i set
    BackupTime = (
      select min(BackupTime)
      from Items d
      where d.ParentId = i.ItemId
    )
  where i.Type = 1
  and i.BackupTime is null
  and not exists (
    select *
    from Items d
    where d.ParentId = i.ItemId
    and d.Type = 1
    and d.BackupTime is null
  )

In other words, update the BackupTime for directories when you need to and also have all the information: when their BackupTime is null and they contain no subdirectories whose BackupTime value is also null.

So the first time you run this, it will set the BackupTime for all directories that contain no subdirectories, only files. The second time, it will set the BackupTime for directories that contain subdirectories, but no sub-subdirectories.

You might be able to handle the empty directory problem by setting BackupTime to coalesce((select...),current_timestamp).

like image 138
Steve Kass Avatar answered Dec 10 '22 07:12

Steve Kass