TL;DR: I have a table with millions of instances and I'm wondering how should I index it.
I have a Django project that uses SQL Server as the database backend.
After having a model with around 14 million instances in the Production environment, I realized that I was getting performance issues:
class UserEvent(models.Model)
A_EVENT = 'A'
B_EVENT = 'B'
types = (
(A_EVENT, 'Event A'),
(B_EVENT, 'Event B')
)
event_type = models.CharField(max_length=1, choices=types)
contract = models.ForeignKey(Contract)
# field_x = (...)
# field_y = (...)
I use a lot of queries based in this field, and it is being highly inefficient, since the field isn't indexed. Filtering the model using only by this field takes almost 7 seconds, while querying by an indexed foreign key doesn't carry performance issues:
UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count()
# elapsed time: 0:00:06.921287
UserEvent.objects.filter(contract_id=62).count()
# elapsed time: 0:00:00.344261
When I realized this, I also made a question to myself: "Shouldn't this field be a SmallIntegerField? Since I only have a small set of choices, and queries based in integer fields are more efficient than text/varchar based queries."
So, from what I understand, I have two options*:
*I realize that a third option may exist, since indexing fields with low cardinality may not cause severe improvements, but since my values have a [1%-99%] distribution (and I'm looking for the 1% part), indexing this field seems to be a valid option.
A) Simply index this field, and leave it as a CharField.
A_EVENT = 'A'
B_EVENT = 'B'
types = (
(A_EVENT, 'Event A'),
(B_EVENT, 'Event B')
)
event_type = models.CharField(max_length=1, choices=types, db_index=True)
B) Perform a migration to transform this field in a SmallIntegerField (I don't want it to be a BooleanField, since it may be possible to add more options to the field), and then index the field.
A_EVENT = 1
B_EVENT = 2
types = (
(A_EVENT, 'Event A'),
(B_EVENT, 'Event B')
)
event_type = models.SmallIntegerField(choices=types, db_index=True)
Pros: Simplicity
Cons: CharField based indexes are less efficient than Integer based indexes
Pros: Integer based indexes are more efficient than CharField based indexes
Cons: I have to perform a complex operation:
Summing up, the real question here is:
The performance improvement I get from migrating the field to a SmallIntegerField worths the risk?
I'm leaned to try option A, and check if the performance improvements are adequate.
I also brought up this question to StackOverflow because a more generic question arised:
This situation was originated because when defining the project models I was inspired by Django documentation code snippet:
YEAR_IN_SCHOOL_CHOICES = (
('FR', 'Freshman'),
('SO', 'Sophomore'),
('JR', 'Junior'),
('SR', 'Senior'),
)
year_in_school = models.CharField(max_length=2,
choices=YEAR_IN_SCHOOL_CHOICES,
default=FRESHMAN)
Why are they using chars when they could be using integers, since it is just a value representation that shouldn't never be displayed?
CharField is a string field, for small- to large-sized strings. It is like a string field in C/C+++. CharField is generally used for storing small strings like first name, last name, etc. To store larger text TextField is used. The default form widget for this field is TextInput.
The Django DB Docs will tell you, that max_length=255 is guaranteed to work always. If you need something of the amount you've specified in your question, I'd suggest to use a TextField.
verbose_name is a human-readable name for the field. If the verbose name isn't given, Django will automatically create it using the field's attribute name, converting underscores to spaces. This attribute in general changes the field name in admin interface.
ForeignKey is a Django ORM field-to-column mapping for creating and working with relationships between tables in relational databases. ForeignKey is defined within the django. db. models. related module but is typically referenced from django.
Speed of Count queries.
UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count()
# elapsed time: 0:00:06.921287
Queries of this nature, unfortunately will always be slow in databases when the table has a large number of entries.
Mysql optimizes count queries by looking at the index provided the indexed columns are numeric. So that's a good reason to use SmallIntegeField instead of Charfield if you were on mysql but apparently you are not. Your mileage varies with other databases. I am not an expert on SQL server but my understanding is that it's particularly poor at using indexes on COUNT(*) queries.
Partitioning
You might be able to improve overall performance of queries involving event_type by partitioning the data. Because the cardinality of the current index is poor it's often better for the planner to do a full table scan. If the data was partitioned, only that particular partition would need to be scanned.
Char or Smallint
Which takes up more space char(2) or small int? The answer is that it depends on your character set. If the character set requires only one byte per character small integer and char(2) would take up the same amount of space. Since the field is going to have very low cardinality, using char or smallint will not make any significant difference in this case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With