Right now I have a log parser reading through 515mb of plain-text files (a file for each day over the past 4 years). My code currently stands as this: http://gist.github.com/12978. I've used psyco (as seen in the code) and I'm also compiling it and using the compiled version. It's doing about 100 lines every 0.3 seconds. The machine is a standard 15" MacBook Pro (2.4ghz C2D, 2GB RAM)
Is it possible for this to go faster or is that a limitation on the language/database?
PostgreSQL attempts to do a lot of its work in memory, and spread out writing to disk to minimize bottlenecks, but on an overloaded system with heavy writing, it's easily possible to see heavy reads and writes cause the whole system to slow as it catches up on the demands.
Due to its C implementation, Psycopg2 is very fast and efficient. You can use Psycopg2 to fetch one or more rows from the database based on a SQL query. If you want to insert some data into the database, that's also possible with this library — with multiple options for single or batch inserting.
Ultimately, speed will depend on the way you're using the database. PostgreSQL is known to be faster for handling massive data sets, complicated queries, and read-write operations. Meanwhile, MySQL is known to be faster with read-only commands.
Don't waste time profiling. The time is always in the database operations. Do as few as possible. Just the minimum number of inserts.
Three Things.
One. Don't SELECT over and over again to conform the Date, Hostname and Person dimensions. Fetch all the data ONCE into a Python dictionary and use it in memory. Don't do repeated singleton selects. Use Python.
Two. Don't Update.
Specifically, Do not do this. It's bad code for two reasons.
cursor.execute("UPDATE people SET chats_count = chats_count + 1 WHERE id = '%s'" % person_id)
It be replaced with a simple SELECT COUNT(*) FROM ... . Never update to increment a count. Just count the rows that are there with a SELECT statement. [If you can't do this with a simple SELECT COUNT or SELECT COUNT(DISTINCT), you're missing some data -- your data model should always provide correct complete counts. Never update.]
And. Never build SQL using string substitution. Completely dumb.
If, for some reason the SELECT COUNT(*)
isn't fast enough (benchmark first, before doing anything lame) you can cache the result of the count in another table. AFTER all of the loads. Do a SELECT COUNT(*) FROM whatever GROUP BY whatever
and insert this into a table of counts. Don't Update. Ever.
Three. Use Bind Variables. Always.
cursor.execute( "INSERT INTO ... VALUES( %(x)s, %(y)s, %(z)s )", {'x':person_id, 'y':time_to_string(time), 'z':channel,} )
The SQL never changes. The values bound in change, but the SQL never changes. This is MUCH faster. Never build SQL statements dynamically. Never.
In the for loop, you're inserting into the 'chats' table repeatedly, so you only need a single sql statement with bind variables, to be executed with different values. So you could put this before the for loop:
insert_statement="""
INSERT INTO chats(person_id, message_type, created_at, channel)
VALUES(:person_id,:message_type,:created_at,:channel)
"""
Then in place of each sql statement you execute put this in place:
cursor.execute(insert_statement, person_id='person',message_type='msg',created_at=some_date, channel=3)
This will make things run faster because:
Note: The bind variable syntax I used is Oracle specific. You'll have to check the psycopg2 library's documentation for the exact syntax.
Other optimizations:
As Mark suggested, use binding variables. The database only has to prepare each statement once, then "fill in the blanks" for each execution. As a nice side effect, it will automatically take care of string-quoting issues (which your program isn't handling).
Turn transactions on (if they aren't already) and do a single commit at the end of the program. The database won't have to write anything to disk until all the data needs to be committed. And if your program encounters an error, none of the rows will be committed, allowing you to simply re-run the program once the problem has been corrected.
Your log_hostname, log_person, and log_date functions are doing needless SELECTs on the tables. Make the appropriate table attributes PRIMARY KEY or UNIQUE. Then, instead of checking for the presence of the key before you INSERT, just do the INSERT. If the person/date/hostname already exists, the INSERT will fail from the constraint violation. (This won't work if you use a transaction with a single commit, as suggested above.)
Alternatively, if you know you're the only one INSERTing into the tables while your program is running, then create parallel data structures in memory and maintain them in memory while you do your INSERTs. For example, read in all the hostnames from the table into an associative array at the start of the program. When want to know whether to do an INSERT, just do an array lookup. If no entry found, do the INSERT and update the array appropriately. (This suggestion is compatible with transactions and a single commit, but requires more programming. It'll be wickedly faster, though.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With