I have a data set of books and authors, with a many-to-many relationship.
There are about 10^6 books and 10^5 authors, with an average of 10 authors per book.
I need to perform a series of operations on the data set, such as counting the number of books by each author or deleting all books by a certain author from the set.
What would be a good data structure that will allow fast handling?
I'm hoping for some ready made module that can provide methods along the lines of:
obj.books.add(book1)
# linking
obj.books[n].author = author1
obj.authors[m].author = book1
# deleting
obj.remove(author1) # should automatically remove all links to the books by author1, but not the linked books
I should clarify that I prefer not to use a database for this, but to do it all in memory.
Thanks
sqlite3 (or any other good relational DB, but sqlite
comes with Python and is handier for such a reasonably small set of data) seems the right approach for your task. If you'd rather not learn SQL, SQLAlchemy is a popular "wrapper" over relational DBs, so to speak, that allows you to deal with them at any of several different abstraction levels of your choice.
And "doing it all in memory" is no problem at all (it's silly, mind you, since you'll needlessly pay the overhead of reading in all the data from somewhere more persistent on each and every run of your program, while keeping the DB on a disk file would save you that overhead -- but, that's a different issue;-). Just open your sqlite database as ':memory:'
and there you are -- a fresh, new relational DB living entirely in memory (for the duration of your process only), no disk involved in the procedure at all. So, why not?-)
Personally, I'd use SQL directly for this task -- it gives me excellent control of exactly what's going on, and easily lets me add or remove indices to tweak performance, etc. You'd use three tables: a Books
table (primary key ID, other fields such as Title &c), an Authors
table (primary key ID, other fields such as Name &c), and a "many-to-many relationship table", say BookAuthors
, with just two fields, BookID
and AuthorID
, and one record per author-book connection.
The two fields of the BookAuthors
table are what's known as "foreign keys", referring respectively to the ID fields of Books and Authors, and you can define them with an ON DELETE CASCADE
so that records referring to a book or author that gets deleted are automatically dropped in turn -- an example of the high semantic level at which even "bare" SQL lets you work, which no other existing data structure can come close to matching.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With