I'm trying to import some XML data into my MySQL database after processing it with Python. To keep things simple, I'm doing it all from one script that uses SQLAlchemy to access my database.
The XML file has about 80,000 entries and I process it using xml.etree.cElementTree
's iterparse
method and delete nodes after I use them to keep my memory usage at around 20mb.
Once I include SQLAlchemy and start adding things into the database, my memory usage rises by about 10mb per second until the script exhausts all of my memory and the OS kills it.
Here's basically what my code looks like:
index = 0
for element in iterate_xml():
...
index += 1
session.add(Model(**data))
if index % 1000 == 0:
session.flush()
session.commit()
I'm not sure what else to try. A periodic .flush()
and .commit()
do help a little bit, but they don't fix the problem.
Is SQLAlchemy not the right tool for this task?
I setup SQLAlchemy like this:
Base = declarative_base()
engine = create_engine(config.SQLALCHEMY_DATABASE_URI, echo=False)
Session = sessionmaker(bind=engine, autoflush=False, expire_on_commit=False)
session = Session()
And my table looks like this:
columns = []
for name, datatype in structure.iteritems():
if isinstance(datatype, int):
datatype = String(datatype or 20)
column = Column(name, datatype)
columns.append(column)
metadata = MetaData(bind=engine)
table = Table('table_name', metadata,
Column('id', Integer, primary_key=True),
*columns
)
metadata.drop_all(engine)
metadata.create_all(engine)
class MyTable(Base):
__tablename__ = 'table_name'
__table_args__ = {
'autoload': True,
'autoload_with': engine
}
structure
is a dictionary that maps column names to data types (it's generated from the XML):
structure = {
'column_name': SQLAlchemyDataType,
...
}
Here's a SQLAlchemy-only version of your code. Testing in 0.7 and 0.8, it doesn't leak any memory, which is not a surprise to me because we have over a dozen tests under continuous integration to ensure nothing leaks under many scenarios. So first step is to confirm this script doesn't leak for you, then try to figure out what changes between this script and yours to produce a test case that actually shows the leaking memory.
from sqlalchemy import Column, String, Integer, create_engine
from sqlalchemy.orm import Session
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Model(Base):
__tablename__ = "a"
id = Column(Integer, primary_key=True)
data = Column(String)
e = create_engine("sqlite:///somefile.db")
Base.metadata.create_all(e)
session = Session(e)
for index in xrange(10000000):
session.add(Model(data="data %d" % index))
if index % 1000 == 0:
print "flushing... %d" % index
session.flush()
session.commit()
It's important to note of course, those issues where SQLAlchemy has leaked memory in the past. Here's a recent history of leaks fixed:
0.7.8 - the most recent. The leak fixed here only occurred when using: 1. the C extensions, 2. the pyodbc driver, during certain result fetch operations (not all of them)
0.6.6 - the "Decimal" result processor in the C extensions had a leak.
0.6.6 - the SQLSoup extension was identified as having a potential leak if used to select rows in certain ways (SQLSoup is now it's own project)
0.5.5 - fixed potential memory leak when objects would be unpickled and placed back into a Session
0.5.4 - major improvements to the Session's memory usage were made. You definitely want to be well past this version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With