Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to transform a huge CSV into SQLite using Pandas?

I have a huge table (about 60 GB) in form of an archived CSV file. I want to transform it into an SQLite file.

What I do at the moment in the following:

import pandas
import sqlite3
cnx = sqlite3.connect('db.sqlite')
df = pandas.read_csv('db.gz', compression='gzip')
df.to_sql('table_name', cnx)

It works fine for smaller files but with the huge files I have memory problem. The problem is that pandas reads the whole table into memory (RAM) and then saves it into SQLite file.

Is there an elegant solution to this problem?

like image 326
Roman Avatar asked Jan 08 '16 08:01

Roman


People also ask

How do I convert a CSV file to SQLite?

First, from the menu choose tool menu item. Second, choose the database and table that you want to import data then click the Next button. Third, choose CSV as the data source type, choose the CSV file in the Input file field, and choose the ,(comma) option as the Field separator as shown in the picture below.

Is SQLite faster than CSV?

csv schools is hugely faster than importing via sqlite-utils insert and doing the work in Python - but it can only be implemented by shelling out to the sqlite3 CLI tool, it's not functionality that is exposed to the Python sqlite3 module.


2 Answers

I haven't done any work with CSVs of that size, but it sounds like the kind of thing Odo might solve quickly.

I did a cursory check of the docs, and it appears they've written something addressing the topic of larger-than-memory CSV parsing into SQL databases that specifically calls out SQLite3 as a destination.

Here's the example they publish for parsing a 33 GB text file.

In [1]: dshape = discover(resource('all.csv'))

In [2]: %time t = odo('all.no.header.csv', 'sqlite:///db.db::nyc',
   ...:               dshape=dshape)
CPU times: user 3.09 s, sys: 819 ms, total: 3.91 s
Wall time: 57min 31s
like image 174
cwcobb Avatar answered Oct 02 '22 16:10

cwcobb


This is going to be problematic with pandas due to its size. Any reason you can't use the csv module and just iterate through the file.

Basic idea (untested):

import gzip
import csv
import sqlite3

with gzip.open('db.gz') as f, sqlite3.connect('db.sqlite') as cnx:
    reader = csv.reader(f)
    c = cnx.cursor()
    c.executemany('insert into table_name values (?,?,...)', reader)
like image 28
AChampion Avatar answered Oct 02 '22 14:10

AChampion