How do you make Python / PostgreSQL faster?

Tags:

Right now I have a log parser reading through 515mb of plain-text files (a file for each day over the past 4 years). My code currently stands as this: http://gist.github.com/12978. I've used psyco (as seen in the code) and I'm also compiling it and using the compiled version. It's doing about 100 lines every 0.3 seconds. The machine is a standard 15" MacBook Pro (2.4ghz C2D, 2GB RAM)

Is it possible for this to go faster or is that a limitation on the language/database?

977

asked Sep 25 '08 23:09

Ryan Bigg

3 Answers

Don't waste time profiling. The time is always in the database operations. Do as few as possible. Just the minimum number of inserts.

Three Things.

One. Don't SELECT over and over again to conform the Date, Hostname and Person dimensions. Fetch all the data ONCE into a Python dictionary and use it in memory. Don't do repeated singleton selects. Use Python.

Two. Don't Update.

Specifically, Do not do this. It's bad code for two reasons.

Click to copy

cursor.execute("UPDATE people SET chats_count = chats_count + 1 WHERE id = '%s'" % person_id)

It be replaced with a simple SELECT COUNT(*) FROM ... . Never update to increment a count. Just count the rows that are there with a SELECT statement. [If you can't do this with a simple SELECT COUNT or SELECT COUNT(DISTINCT), you're missing some data -- your data model should always provide correct complete counts. Never update.]

And. Never build SQL using string substitution. Completely dumb.

If, for some reason the SELECT COUNT(*) isn't fast enough (benchmark first, before doing anything lame) you can cache the result of the count in another table. AFTER all of the loads. Do a SELECT COUNT(*) FROM whatever GROUP BY whatever and insert this into a table of counts. Don't Update. Ever.

Three. Use Bind Variables. Always.

Click to copy

cursor.execute( "INSERT INTO ... VALUES( %(x)s, %(y)s, %(z)s )", {'x':person_id, 'y':time_to_string(time), 'z':channel,} )

The SQL never changes. The values bound in change, but the SQL never changes. This is MUCH faster. Never build SQL statements dynamically. Never.

181

answered Oct 24 '22 23:10

S.Lott

In the for loop, you're inserting into the 'chats' table repeatedly, so you only need a single sql statement with bind variables, to be executed with different values. So you could put this before the for loop:

Click to copy

insert_statement="""
    INSERT INTO chats(person_id, message_type, created_at, channel)
    VALUES(:person_id,:message_type,:created_at,:channel)
"""

Then in place of each sql statement you execute put this in place:

Click to copy

cursor.execute(insert_statement, person_id='person',message_type='msg',created_at=some_date, channel=3)

This will make things run faster because:

The cursor object won't have to reparse the statement each time
The db server won't have to generate a new execution plan as it can use the one it create previously.
You won't have to call santitize() as special characters in the bind variables won't part of the sql statement that gets executed.

Note: The bind variable syntax I used is Oracle specific. You'll have to check the psycopg2 library's documentation for the exact syntax.

Other optimizations:

You're incrementing with the "UPDATE people SET chatscount" after each loop iteration. Keep a dictionary mapping user to chat_count and then execute the statement of the total number you've seen. This will be faster then hitting the db after every record.
Use bind variables on ALL your queries. Not just the insert statement, I choose that as an example.
Change all the find_*() functions that do db look ups to cache their results so they don't have to hit the db every time.
psycho optimizes python programs that perform a large number of numberic operation. The script is IO expensive and not CPU expensive so I wouldn't expect to give you much if any optimization.

answered Oct 25 '22 01:10

Mark Roddy

As Mark suggested, use binding variables. The database only has to prepare each statement once, then "fill in the blanks" for each execution. As a nice side effect, it will automatically take care of string-quoting issues (which your program isn't handling).

Turn transactions on (if they aren't already) and do a single commit at the end of the program. The database won't have to write anything to disk until all the data needs to be committed. And if your program encounters an error, none of the rows will be committed, allowing you to simply re-run the program once the problem has been corrected.

Your log_hostname, log_person, and log_date functions are doing needless SELECTs on the tables. Make the appropriate table attributes PRIMARY KEY or UNIQUE. Then, instead of checking for the presence of the key before you INSERT, just do the INSERT. If the person/date/hostname already exists, the INSERT will fail from the constraint violation. (This won't work if you use a transaction with a single commit, as suggested above.)

Alternatively, if you know you're the only one INSERTing into the tables while your program is running, then create parallel data structures in memory and maintain them in memory while you do your INSERTs. For example, read in all the hostnames from the table into an associative array at the start of the program. When want to know whether to do an INSERT, just do an array lookup. If no entry found, do the INSERT and update the array appropriately. (This suggestion is compatible with transactions and a single commit, but requires more programming. It'll be wickedly faster, though.)

answered Oct 24 '22 23:10

Barry Brown

Related questions
                            
                                TypeError: Object of type ResultProxy is not JSON serializable: result in sqlalchemy query?
                            
                                bisect.insort complexity not as expected
                            
                                What is the best way to store login credentials on Airflow?
                            
                                AttributeError: 'tuple' object has no attribute 'log_softmax'
                            
                                Generic function typing in Python
                            
                                Schedule Asyncio task to execute every X seconds?
                            
                                How to implement polynomial logistic regression in scikit-learn?
                            
                                Is there a way to create gantt charts in python?
                            
                                Split S3 file into smaller files of 1000 lines
                            
                                django-zappa: Error loading psycopg2 module: libpq.so.5: cannot open shared object file: No such file or directory
                            
                                How to fix "ImportError: unable to find Qt5Core.dll on PATH" after pyinstaller bundled the python application
                            
                                pandas if else conditions on multiple columns [duplicate]
                            
                                How to install python application with tkcalendar module by pyinstaller?
                            
                                Hooking every function call in Python
                            
                                ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject
                            
                                using qgis and shaply error: GEOSGeom_createLinearRing_r returned a NULL pointer
                            
                                Why can't I import geopy.distance.vincenty() on Jupyter Notebook? I installed geopy 2.0.0
                            
                                Can't fully disable python linting Pylance VSCODE
                            
                                AWS throws the following error: "bad interpreter: No such file or directory"
                            
                                pip2 installation on ubuntu 20.04

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you make Python / PostgreSQL faster?

Tags:

python

postgresql

Ryan Bigg

People also ask

3 Answers

S.Lott

Mark Roddy

Barry Brown

Recent Activity

Donate For Us