Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you make Python / PostgreSQL faster?

Right now I have a log parser reading through 515mb of plain-text files (a file for each day over the past 4 years). My code currently stands as this: http://gist.github.com/12978. I've used psyco (as seen in the code) and I'm also compiling it and using the compiled version. It's doing about 100 lines every 0.3 seconds. The machine is a standard 15" MacBook Pro (2.4ghz C2D, 2GB RAM)

Is it possible for this to go faster or is that a limitation on the language/database?

like image 977
Ryan Bigg Avatar asked Sep 25 '08 23:09

Ryan Bigg


People also ask

Why is PostgreSQL so slow?

PostgreSQL attempts to do a lot of its work in memory, and spread out writing to disk to minimize bottlenecks, but on an overloaded system with heavy writing, it's easily possible to see heavy reads and writes cause the whole system to slow as it catches up on the demands.

Is Psycopg2 fast?

Due to its C implementation, Psycopg2 is very fast and efficient. You can use Psycopg2 to fetch one or more rows from the database based on a SQL query. If you want to insert some data into the database, that's also possible with this library — with multiple options for single or batch inserting.

Is PostgreSQL faster?

Ultimately, speed will depend on the way you're using the database. PostgreSQL is known to be faster for handling massive data sets, complicated queries, and read-write operations. Meanwhile, MySQL is known to be faster with read-only commands.


3 Answers

Don't waste time profiling. The time is always in the database operations. Do as few as possible. Just the minimum number of inserts.

Three Things.

One. Don't SELECT over and over again to conform the Date, Hostname and Person dimensions. Fetch all the data ONCE into a Python dictionary and use it in memory. Don't do repeated singleton selects. Use Python.

Two. Don't Update.

Specifically, Do not do this. It's bad code for two reasons.

cursor.execute("UPDATE people SET chats_count = chats_count + 1 WHERE id = '%s'" % person_id)

It be replaced with a simple SELECT COUNT(*) FROM ... . Never update to increment a count. Just count the rows that are there with a SELECT statement. [If you can't do this with a simple SELECT COUNT or SELECT COUNT(DISTINCT), you're missing some data -- your data model should always provide correct complete counts. Never update.]

And. Never build SQL using string substitution. Completely dumb.

If, for some reason the SELECT COUNT(*) isn't fast enough (benchmark first, before doing anything lame) you can cache the result of the count in another table. AFTER all of the loads. Do a SELECT COUNT(*) FROM whatever GROUP BY whatever and insert this into a table of counts. Don't Update. Ever.

Three. Use Bind Variables. Always.

cursor.execute( "INSERT INTO ... VALUES( %(x)s, %(y)s, %(z)s )", {'x':person_id, 'y':time_to_string(time), 'z':channel,} )

The SQL never changes. The values bound in change, but the SQL never changes. This is MUCH faster. Never build SQL statements dynamically. Never.

like image 181
S.Lott Avatar answered Oct 24 '22 23:10

S.Lott


In the for loop, you're inserting into the 'chats' table repeatedly, so you only need a single sql statement with bind variables, to be executed with different values. So you could put this before the for loop:

insert_statement="""
    INSERT INTO chats(person_id, message_type, created_at, channel)
    VALUES(:person_id,:message_type,:created_at,:channel)
"""

Then in place of each sql statement you execute put this in place:

cursor.execute(insert_statement, person_id='person',message_type='msg',created_at=some_date, channel=3)

This will make things run faster because:

  1. The cursor object won't have to reparse the statement each time
  2. The db server won't have to generate a new execution plan as it can use the one it create previously.
  3. You won't have to call santitize() as special characters in the bind variables won't part of the sql statement that gets executed.

Note: The bind variable syntax I used is Oracle specific. You'll have to check the psycopg2 library's documentation for the exact syntax.

Other optimizations:

  1. You're incrementing with the "UPDATE people SET chatscount" after each loop iteration. Keep a dictionary mapping user to chat_count and then execute the statement of the total number you've seen. This will be faster then hitting the db after every record.
  2. Use bind variables on ALL your queries. Not just the insert statement, I choose that as an example.
  3. Change all the find_*() functions that do db look ups to cache their results so they don't have to hit the db every time.
  4. psycho optimizes python programs that perform a large number of numberic operation. The script is IO expensive and not CPU expensive so I wouldn't expect to give you much if any optimization.
like image 27
Mark Roddy Avatar answered Oct 25 '22 01:10

Mark Roddy


As Mark suggested, use binding variables. The database only has to prepare each statement once, then "fill in the blanks" for each execution. As a nice side effect, it will automatically take care of string-quoting issues (which your program isn't handling).

Turn transactions on (if they aren't already) and do a single commit at the end of the program. The database won't have to write anything to disk until all the data needs to be committed. And if your program encounters an error, none of the rows will be committed, allowing you to simply re-run the program once the problem has been corrected.

Your log_hostname, log_person, and log_date functions are doing needless SELECTs on the tables. Make the appropriate table attributes PRIMARY KEY or UNIQUE. Then, instead of checking for the presence of the key before you INSERT, just do the INSERT. If the person/date/hostname already exists, the INSERT will fail from the constraint violation. (This won't work if you use a transaction with a single commit, as suggested above.)

Alternatively, if you know you're the only one INSERTing into the tables while your program is running, then create parallel data structures in memory and maintain them in memory while you do your INSERTs. For example, read in all the hostnames from the table into an associative array at the start of the program. When want to know whether to do an INSERT, just do an array lookup. If no entry found, do the INSERT and update the array appropriately. (This suggestion is compatible with transactions and a single commit, but requires more programming. It'll be wickedly faster, though.)

like image 41
Barry Brown Avatar answered Oct 24 '22 23:10

Barry Brown