I have a database table that represents a hierarchy of files and directories, with the following structure (simplified):
ItemId int Path text Type int (0 for files, 1 for directories) ParentId int BackupTime datetime
Currently the BackupTime
column is only used for files, it is set to null for directories.
Now I need to fill this column for directories as well: it must be the minimum BackupTime
of all descendants (files and directories).
This (naive and inefficient) query illustrates what I want to do:
update Items i
set BackupTime = (select min(BackupTime)
from Items d
where d.Path like i.Path || '%'
and d.Type = 0)
where i.Type = 1
My problem is that I can't seem to find an efficient approach. The query above takes much too long on large volumes of data (this table often contains more than 100K rows)
It would probably be faster to search the min(BackupTime)
only on direct children:
update Items i
set BackupTime = (select min(BackupTime)
from Items d
where d.ParentId = i.ItemId)
where i.Type = 1
But for this to work, I must ensure that descendants will be updated before their ancestors, so I must walk the hierarchy recursively from bottom up. The problem is that I have no easy way of knowing which items are the deepest in the hierarchy. I'm using SQLite, so I can't use hierarchical queries.
Any idea on how to do this efficiently?
Ideally, I'd prefer to be able to do it in a single UPDATE query, but if it's not possible I'm open to other options, as long as they're efficient
This is a shot in the dark, but it might work. It's an attempt to handle the bottom-up issue manually. (I don't know sqlite's limitations, but this is probably standard SQL-92 and hopefully ok.)
Step 1: Decide how you want to handle empty directories. I think the solution here works only if there are no empty directories or if empty directories are initially updated so they have an artificial non-NULL BackupTime. (What that artificial BackupTime should be may be important, depending on how you maintain the BackupDate column when there are changes to your data. Using the current date or an artificial future date might work, but you should think about it.)
Step 2. Execute the following query repeatedly until no more rows are affected:
update Items i set
BackupTime = (
select min(BackupTime)
from Items d
where d.ParentId = i.ItemId
)
where i.Type = 1
and i.BackupTime is null
and not exists (
select *
from Items d
where d.ParentId = i.ItemId
and d.Type = 1
and d.BackupTime is null
)
In other words, update the BackupTime for directories when you need to and also have all the information: when their BackupTime is null and they contain no subdirectories whose BackupTime value is also null.
So the first time you run this, it will set the BackupTime for all directories that contain no subdirectories, only files. The second time, it will set the BackupTime for directories that contain subdirectories, but no sub-subdirectories.
You might be able to handle the empty directory problem by setting BackupTime to coalesce((select...),current_timestamp).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With