I am "converting" a large (~1.6GB) CSV file and inserting specific fields of the CSV into a SQLite database. Essentially my code looks like: <pre class="prettyprint"><code>import csv, sqlite3 conn = sqlite3.connect( "path/to/file.db" ) conn.text_factory = str #bugger 8-bit bytestrings cur = conn.cur() cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)') reader = csv.reader(open(filecsv.txt, "rb")) for field1, field2, field3, field4, field5 in reader: cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4)) </code></pre> Everything works as I expect it to with the exception... IT TAKES AN INCREDIBLE AMOUNT OF TIME TO PROCESS. Am I coding it incorrectly? Is there a better way to achieve a higher performance and accomplish what I'm needing (simply convert a few fields of a CSV into SQLite table)? **EDIT -- I tried directly importing the csv into sqlite as suggested but it turns out my file has commas in fields (e.g. <code>"My title, comma"</code>). That's creating errors with the import. It appears there are too many of those occurrences to manually edit the file... any other thoughts??**

It's possible to import the CSV directly: <pre class="prettyprint"><code>sqlite> .separator "," sqlite> .import filecsv.txt mytable </code></pre> http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles

Python CSV to SQLite

Tags:

I am "converting" a large (~1.6GB) CSV file and inserting specific fields of the CSV into a SQLite database. Essentially my code looks like:

import csv, sqlite3  conn = sqlite3.connect( "path/to/file.db" ) conn.text_factory = str  #bugger 8-bit bytestrings cur = conn.cur() cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')  reader = csv.reader(open(filecsv.txt, "rb")) for field1, field2, field3, field4, field5 in reader:   cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))

Everything works as I expect it to with the exception... IT TAKES AN INCREDIBLE AMOUNT OF TIME TO PROCESS. Am I coding it incorrectly? Is there a better way to achieve a higher performance and accomplish what I'm needing (simply convert a few fields of a CSV into SQLite table)?

**EDIT -- I tried directly importing the csv into sqlite as suggested but it turns out my file has commas in fields (e.g. "My title, comma"). That's creating errors with the import. It appears there are too many of those occurrences to manually edit the file...

any other thoughts??**

499

asked May 09 '11 20:05

user735304

2 Answers

Chris is right - use transactions; divide the data into chunks and then store it.

"... Unless already in a transaction, each SQL statement has a new transaction started for it. This is very expensive, since it requires reopening, writing to, and closing the journal file for each statement. This can be avoided by wrapping sequences of SQL statements with BEGIN TRANSACTION; and END TRANSACTION; statements. This speedup is also obtained for statements which don't alter the database." - Source: http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html

"... there is another trick you can use to speed up SQLite: transactions. Whenever you have to do multiple database writes, put them inside a transaction. Instead of writing to (and locking) the file each and every time a write query is issued, the write will only happen once when the transaction completes." - Source: How Scalable is SQLite?

import csv, sqlite3, time  def chunks(data, rows=10000):     """ Divides the data into 10000 rows each """      for i in xrange(0, len(data), rows):         yield data[i:i+rows]   if __name__ == "__main__":      t = time.time()      conn = sqlite3.connect( "path/to/file.db" )     conn.text_factory = str  #bugger 8-bit bytestrings     cur = conn.cur()     cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')      csvData = csv.reader(open(filecsv.txt, "rb"))      divData = chunks(csvData) # divide into 10000 rows each      for chunk in divData:         cur.execute('BEGIN TRANSACTION')          for field1, field2, field3, field4, field5 in chunk:             cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))          cur.execute('COMMIT')      print "\n Time Taken: %.3f sec" % (time.time()-t)

answered Jan 05 '23 05:01

Sam

It's possible to import the CSV directly:

sqlite> .separator "," sqlite> .import filecsv.txt mytable

http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles

answered Jan 05 '23 04:01

fengb

Related questions
                            
                                SQL Group by Year
                            
                                Combining multiple conditional expressions in C#
                            
                                group by range in mysql
                            
                                Eager loading associated models in ActiveAdmin sql query
                            
                                C++: Const correctness and pointer arguments
                            
                                How do I set the accessibility label for a particular segment of a UISegmentedControl?
                            
                                urllib2 HTTP Error 400: Bad Request
                            
                                Button to go back to MainActivity
                            
                                Amazon EC2 terminated instance (Free Tier)
                            
                                Skip first line using Open CSV reader
                            
                                PHP - how to use $timestamp to check if today is Monday or 1st of the month?
                            
                                EditText not showing a keyboard [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With