I am using Scrapy to scrape data from a web forum. I am storing this data in a PostgreSQL database using SQLAlchemy. The table and columns create fine, however, I am not able to have SQLAlchemy create an index on one of the columns. I am trying to create a trigram index (pg_trgm) using gin.
The Postgresql code that would create this index is:
CREATE INDEX description_idx ON table USING gin (description gin_trgm_ops);
The SQLAlchemy code I have added to my models.py file is:
desc_idx = Index('description_idx', text("description gin_trgm_ops"), postgresql_using='gin')
I have added this line to my models.py but when I check in postgresql, the index was never created.
Below are my full models.py and pipelines.py files. Am I going about this all wrong??
Any help would be greatly appreciated!!
from sqlalchemy import create_engine, Column, Integer, String, DateTime, Index, text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL
import settings
DeclarativeBase = declarative_base()
def db_connect():
return create_engine(URL(**settings.DATABASE))
def create_forum_table(engine):
DeclarativeBase.metadata.create_all(engine)
class forumDB(DeclarativeBase):
__tablename__ = "table"
id = Column(Integer, primary_key=True)
title = Column('title', String)
desc = Column('description', String, nullable=True)
desc_idx = Index('description_idx', text("description gin_trgm_ops"), postgresql_using='gin')
from scrapy.exceptions import DropItem
from sqlalchemy.orm import sessionmaker
from models import forumDB, db_connect, create_forum_table
class ScrapeforumToDB(object):
def __init__(self):
engine = db_connect()
create_forum_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
session = self.Session()
forumitem = forumDB(**item)
try:
session.add(forumitem)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
The proper way to reference an Operator Class in SQLAlchemy (such as gin_trgm_ops
) is to use the postgresql_ops
parameter. This will also allow tools like alembic to understand how use it when auto-generating migrations.
Index('description_idx',
'description', postgresql_using='gin',
postgresql_ops={
'description': 'gin_trgm_ops',
})
Since the Index
definition uses text
expression it has no references to the Table
"table", which has been implicitly created by the declarative class forumDB
. Compare that to using a Column
as expression, or some derivative of it, like this:
Index('some_index_idx', forumDB.title)
In the above definition the index will know about the table and the other way around.
What this means in your case is that the Table
"table" has no idea that such an index exists. Adding it as an attribute of the declarative class is the wrong way to do it. It should be passed to the implicitly created Table
instance. The attribute __table_args__
is just for that:
class forumDB(DeclarativeBase):
__tablename__ = "table"
# Note: This used to use `text('description gin_trgm_ops')` instead of the
# `postgresql_ops` parameter, which should be used.
__table_args__ = (
Index('description_idx', "description",
postgresql_ops={"description": "gin_trgm_ops"},
postgresql_using='gin'),
)
id = Column(Integer, primary_key=True)
title = Column('title', String)
desc = Column('description', String, nullable=True)
With the modification in place, a call to create_forum_table(engine)
resulted in:
> \d "table"
Table "public.table"
Column | Type | Modifiers
-------------+-------------------+----------------------------------------------------
id | integer | not null default nextval('table_id_seq'::regclass)
title | character varying |
description | character varying |
Indexes:
"table_pkey" PRIMARY KEY, btree (id)
"description_idx" gin (description gin_trgm_ops)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With