TL;DR: I have a table with millions of instances and I'm wondering how should I index it. I have a Django project that uses SQL Server as the database backend. After having a model with around 14 million instances in the Production environment, I realized that I was getting performance issues: <pre class="prettyprint"><code>class UserEvent(models.Model) A_EVENT = 'A' B_EVENT = 'B' types = ( (A_EVENT, 'Event A'), (B_EVENT, 'Event B') ) event_type = models.CharField(max_length=1, choices=types) contract = models.ForeignKey(Contract) # field_x = (...) # field_y = (...) </code></pre> I use a lot of queries based in this field, and it is being highly inefficient, since the field isn't indexed. Filtering the model using only by this field takes almost 7 seconds, while querying by an indexed foreign key doesn't carry performance issues: <pre class="prettyprint"><code>UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count() # elapsed time: 0:00:06.921287 UserEvent.objects.filter(contract_id=62).count() # elapsed time: 0:00:00.344261 </code></pre> When I realized this, I also made a question to myself: "Shouldn't this field be a SmallIntegerField? Since I only have a small set of choices, and queries based in integer fields are more efficient than text/varchar based queries." So, from what I understand, I have two options*: <blockquote> *I realize that a third option may exist, since indexing fields with low cardinality may not cause severe improvements, but since my values have a [1%-99%] distribution (and I'm looking for the 1% part), indexing this field seems to be a valid option. </blockquote> <ul> <li> A) Simply index this field, and leave it as a CharField. <pre class="prettyprint"><code>A_EVENT = 'A' B_EVENT = 'B' types = ( (A_EVENT, 'Event A'), (B_EVENT, 'Event B') ) event_type = models.CharField(max_length=1, choices=types, db_index=True) </code></pre> </li> <li> B) Perform a migration to transform this field in a SmallIntegerField (I don't want it to be a BooleanField, since it may be possible to add more options to the field), and then index the field. <pre class="prettyprint"><code>A_EVENT = 1 B_EVENT = 2 types = ( (A_EVENT, 'Event A'), (B_EVENT, 'Event B') ) event_type = models.SmallIntegerField(choices=types, db_index=True) </code></pre> </li> </ul> <h3>Option A</h3> Pros: Simplicity Cons: CharField based indexes are less efficient than Integer based indexes <h3>Option B</h3> Pros: Integer based indexes are more efficient than CharField based indexes Cons: I have to perform a complex operation: <ol> <li>Schema migration to create a new SmallIntegerField </li> <li>Data migration copying (and transforming) the millions of instances from the old field to the new field.</li> <li>Update the project code to use the new field or perform another schema migration to rename the new field as the previous one.</li> <li>Delete the old field.</li> </ol> <hr> Summing up, the real question here is: The performance improvement I get from migrating the field to a SmallIntegerField worths the risk? I'm leaned to try option A, and check if the performance improvements are adequate. <hr> I also brought up this question to StackOverflow because a more generic question arised: <ul> <li>Is there any situation where using CharFields along the Django choices is a better option than using Boolean/Integer/SmallIntegerField?</li> </ul> This situation was originated because when defining the project models I was inspired by Django documentation code snippet: <pre class="prettyprint"><code>YEAR_IN_SCHOOL_CHOICES = ( ('FR', 'Freshman'), ('SO', 'Sophomore'), ('JR', 'Junior'), ('SR', 'Senior'), ) year_in_school = models.CharField(max_length=2, choices=YEAR_IN_SCHOOL_CHOICES, default=FRESHMAN) </code></pre> Why are they using chars when they could be using integers, since it is just a value representation that shouldn't never be displayed?

Speed of Count queries. <pre class="prettyprint"><code>UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count() # elapsed time: 0:00:06.921287 </code></pre> Queries of this nature, unfortunately will always be slow in databases when the table has a large number of entries. Mysql optimizes count queries by looking at the index provided the indexed columns are numeric. So that's a good reason to use SmallIntegeField instead of Charfield if you were on mysql but apparently you are not. Your mileage varies with other databases. I am not an expert on SQL server but my understanding is that it's particularly poor at using indexes on COUNT(*) queries. Partitioning You might be able to improve overall performance of queries involving event_type by partitioning the data. Because the cardinality of the current index is poor it's often better for the planner to do a full table scan. If the data was partitioned, only that particular partition would need to be scanned. Char or Smallint Which takes up more space char(2) or small int? The answer is that it depends on your character set. If the character set requires only one byte per character small integer and char(2) would take up the same amount of space. Since the field is going to have very low cardinality, using char or smallint will not make any significant difference in this case.

Django Model Choices: IntegerField vs CharField

Tags:

sql-server

indexing

django

django-models

TL;DR: I have a table with millions of instances and I'm wondering how should I index it.

I have a Django project that uses SQL Server as the database backend.

After having a model with around 14 million instances in the Production environment, I realized that I was getting performance issues:

class UserEvent(models.Model)

    A_EVENT = 'A'
    B_EVENT = 'B'

    types = (
        (A_EVENT, 'Event A'),
        (B_EVENT, 'Event B')
    )

    event_type = models.CharField(max_length=1, choices=types)

    contract = models.ForeignKey(Contract)

    # field_x = (...)
    # field_y = (...)

I use a lot of queries based in this field, and it is being highly inefficient, since the field isn't indexed. Filtering the model using only by this field takes almost 7 seconds, while querying by an indexed foreign key doesn't carry performance issues:

UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count()
# elapsed time: 0:00:06.921287

UserEvent.objects.filter(contract_id=62).count()
# elapsed time: 0:00:00.344261

When I realized this, I also made a question to myself: "Shouldn't this field be a SmallIntegerField? Since I only have a small set of choices, and queries based in integer fields are more efficient than text/varchar based queries."

So, from what I understand, I have two options*:

*I realize that a third option may exist, since indexing fields with low cardinality may not cause severe improvements, but since my values have a [1%-99%] distribution (and I'm looking for the 1% part), indexing this field seems to be a valid option.

A) Simply index this field, and leave it as a CharField.

A_EVENT = 'A'
B_EVENT = 'B'

types = (
    (A_EVENT, 'Event A'),
    (B_EVENT, 'Event B')
)

event_type = models.CharField(max_length=1, choices=types, db_index=True)

B) Perform a migration to transform this field in a SmallIntegerField (I don't want it to be a BooleanField, since it may be possible to add more options to the field), and then index the field.
```
A_EVENT = 1
B_EVENT = 2

types = (
    (A_EVENT, 'Event A'),
    (B_EVENT, 'Event B')
)

event_type = models.SmallIntegerField(choices=types, db_index=True)
```

Option A

Pros: Simplicity

Cons: CharField based indexes are less efficient than Integer based indexes

Option B

Pros: Integer based indexes are more efficient than CharField based indexes

Cons: I have to perform a complex operation:

Schema migration to create a new SmallIntegerField
Data migration copying (and transforming) the millions of instances from the old field to the new field.
Update the project code to use the new field or perform another schema migration to rename the new field as the previous one.
Delete the old field.

Summing up, the real question here is:

The performance improvement I get from migrating the field to a SmallIntegerField worths the risk?

I'm leaned to try option A, and check if the performance improvements are adequate.

I also brought up this question to StackOverflow because a more generic question arised:

Is there any situation where using CharFields along the Django choices is a better option than using Boolean/Integer/SmallIntegerField?

This situation was originated because when defining the project models I was inspired by Django documentation code snippet:

YEAR_IN_SCHOOL_CHOICES = (
     ('FR', 'Freshman'),
     ('SO', 'Sophomore'),
     ('JR', 'Junior'),
     ('SR', 'Senior'),
)

year_in_school = models.CharField(max_length=2,
                                  choices=YEAR_IN_SCHOOL_CHOICES,
                                  default=FRESHMAN)

Why are they using chars when they could be using integers, since it is just a value representation that shouldn't never be displayed?

553

asked Apr 18 '16 19:04

JCJS

1 Answers

Speed of Count queries.

UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count()
# elapsed time: 0:00:06.921287

Queries of this nature, unfortunately will always be slow in databases when the table has a large number of entries.

Mysql optimizes count queries by looking at the index provided the indexed columns are numeric. So that's a good reason to use SmallIntegeField instead of Charfield if you were on mysql but apparently you are not. Your mileage varies with other databases. I am not an expert on SQL server but my understanding is that it's particularly poor at using indexes on COUNT(*) queries.

Partitioning

You might be able to improve overall performance of queries involving event_type by partitioning the data. Because the cardinality of the current index is poor it's often better for the planner to do a full table scan. If the data was partitioned, only that particular partition would need to be scanned.

Char or Smallint

Which takes up more space char(2) or small int? The answer is that it depends on your character set. If the character set requires only one byte per character small integer and char(2) would take up the same amount of space. Since the field is going to have very low cardinality, using char or smallint will not make any significant difference in this case.

124

answered Sep 18 '22 14:09

e4c5

Related questions
                            
                                When can I host IIS and SQL Server on the same machine?
                            
                                What is your biggest SQL Server mistake or embarrassing incident?
                            
                                Visual Studio 2010 database edition schema compare where target is dbproj
                            
                                How to accomplish query notification on SQL Server with python
                            
                                Differences in Default Network Packet Size: SqlConnection vs. SQL Server defaults
                            
                                Select query skips records during concurrent updates
                            
                                Routing to Different SQL Server Instances Running through Docker on Default Port
                            
                                Merging duplicated records together with "Merge" syntax
                            
                                What is the best way to attach existing database to sql localdb?
                            
                                SQL Srv 2016: Login failed for user 'MicrosoftAccount\...'
                            
                                SqlConnection vs Sql Session. Do their lifetimes coincide?
                            
                                Convert Historical Local Time to UTC Time in SQL Server
                            
                                How to implement ASP.NET identity: CREATE DATABASE permission denied in database 'master'
                            
                                SSRS report files (.rdl) how to upgrade to latest?
                            
                                SQL Server 2008 Hierarchy Data Type Performance?
                            
                                Is there a practical way to use the hierarchyID datatype in entity framework 4?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With