Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas - Using to_sql to write large data frames in chunks

I'm using Pandas' to_sql function to write to MySQL, which is timing out due to large frame size (1M rows, 20 columns).

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Is there a more official way to chunk through the data and write rows in blocks? I've written my own code, which seems to work. I'd prefer an official solution though. Thanks!

def write_to_db(engine, frame, table_name, chunk_size):

    start_index = 0
    end_index = chunk_size if chunk_size < len(frame) else len(frame)

    frame = frame.where(pd.notnull(frame), None)
    if_exists_param = 'replace'

    while start_index != end_index:
        print "Writing rows %s through %s" % (start_index, end_index)
        frame.iloc[start_index:end_index, :].to_sql(con=engine, name=table_name, if_exists=if_exists_param)
        if_exists_param = 'append'

        start_index = min(start_index + chunk_size, len(frame))
        end_index = min(end_index + chunk_size, len(frame))

engine = sqlalchemy.create_engine('mysql://...') #database details omited
write_to_db(engine, frame, 'retail_pendingcustomers', 20000)
like image 787
Krishan Gupta Avatar asked Jun 03 '14 05:06

Krishan Gupta


People also ask

Is pandas efficient for large data sets?

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

How big data frames can pandas handle?

The upper limit for pandas Dataframe was 100 GB of free disk space on the machine. When your Mac needs memory, it will push something that isn't currently being used into a swapfile for temporary storage.

What is chunking in pandas?

Technically the number of rows read at a time in a file by pandas is referred to as chunksize. Suppose If the chunksize is 100 then pandas will load the first 100 rows. The object returned is not a data frame but a TextFileReader which needs to be iterated to get the data.

What is Syntaxfor Panda's data frame?

data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. For the row labels, the Index to be used for the resulting frame is Optional Default np. arange(n) if no index is passed. For column labels, the optional default syntax is - np.


2 Answers

Update: this functionality has been merged in pandas master and will be released in 0.15 (probably end of september), thanks to @artemyk! See https://github.com/pydata/pandas/pull/8062

So starting from 0.15, you can specify the chunksize argument and e.g. simply do:

df.to_sql('table', engine, chunksize=20000)
like image 198
joris Avatar answered Oct 03 '22 19:10

joris


There is beautiful idiomatic function chunks provided in answer to this question

In your case you can use this function like this:

def chunks(l, n):
""" Yield successive n-sized chunks from l.
"""
    for i in xrange(0, len(l), n):
         yield l.iloc[i:i+n]

def write_to_db(engine, frame, table_name, chunk_size):
    for idx, chunk in enumerate(chunks(frame, chunk_size)):
        if idx == 0:
            if_exists_param = 'replace':
        else:
            if_exists_param = 'append'
        chunk.to_sql(con=engine, name=table_name, if_exists=if_exists_param)

Only drawback that it doesn't support slicing second index in iloc function.

like image 30
nes Avatar answered Oct 03 '22 19:10

nes