Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django Model Choices: IntegerField vs CharField

TL;DR: I have a table with millions of instances and I'm wondering how should I index it.

I have a Django project that uses SQL Server as the database backend.

After having a model with around 14 million instances in the Production environment, I realized that I was getting performance issues:

class UserEvent(models.Model)

    A_EVENT = 'A'
    B_EVENT = 'B'

    types = (
        (A_EVENT, 'Event A'),
        (B_EVENT, 'Event B')
    )

    event_type = models.CharField(max_length=1, choices=types)

    contract = models.ForeignKey(Contract)

    # field_x = (...)
    # field_y = (...)

I use a lot of queries based in this field, and it is being highly inefficient, since the field isn't indexed. Filtering the model using only by this field takes almost 7 seconds, while querying by an indexed foreign key doesn't carry performance issues:

UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count()
# elapsed time: 0:00:06.921287

UserEvent.objects.filter(contract_id=62).count()
# elapsed time: 0:00:00.344261

When I realized this, I also made a question to myself: "Shouldn't this field be a SmallIntegerField? Since I only have a small set of choices, and queries based in integer fields are more efficient than text/varchar based queries."

So, from what I understand, I have two options*:

*I realize that a third option may exist, since indexing fields with low cardinality may not cause severe improvements, but since my values have a [1%-99%] distribution (and I'm looking for the 1% part), indexing this field seems to be a valid option.

  • A) Simply index this field, and leave it as a CharField.

    A_EVENT = 'A'
    B_EVENT = 'B'
    
    types = (
        (A_EVENT, 'Event A'),
        (B_EVENT, 'Event B')
    )
    
    event_type = models.CharField(max_length=1, choices=types, db_index=True)
    
  • B) Perform a migration to transform this field in a SmallIntegerField (I don't want it to be a BooleanField, since it may be possible to add more options to the field), and then index the field.

    A_EVENT = 1
    B_EVENT = 2
    
    types = (
        (A_EVENT, 'Event A'),
        (B_EVENT, 'Event B')
    )
    
    event_type = models.SmallIntegerField(choices=types, db_index=True)
    

Option A

Pros: Simplicity

Cons: CharField based indexes are less efficient than Integer based indexes

Option B

Pros: Integer based indexes are more efficient than CharField based indexes

Cons: I have to perform a complex operation:

  1. Schema migration to create a new SmallIntegerField
  2. Data migration copying (and transforming) the millions of instances from the old field to the new field.
  3. Update the project code to use the new field or perform another schema migration to rename the new field as the previous one.
  4. Delete the old field.

Summing up, the real question here is:

The performance improvement I get from migrating the field to a SmallIntegerField worths the risk?

I'm leaned to try option A, and check if the performance improvements are adequate.


I also brought up this question to StackOverflow because a more generic question arised:

  • Is there any situation where using CharFields along the Django choices is a better option than using Boolean/Integer/SmallIntegerField?

This situation was originated because when defining the project models I was inspired by Django documentation code snippet:

YEAR_IN_SCHOOL_CHOICES = (
     ('FR', 'Freshman'),
     ('SO', 'Sophomore'),
     ('JR', 'Junior'),
     ('SR', 'Senior'),
)

year_in_school = models.CharField(max_length=2,
                                  choices=YEAR_IN_SCHOOL_CHOICES,
                                  default=FRESHMAN)

Why are they using chars when they could be using integers, since it is just a value representation that shouldn't never be displayed?

like image 553
JCJS Avatar asked Apr 18 '16 19:04

JCJS


People also ask

What does CharField mean in Django?

CharField is a string field, for small- to large-sized strings. It is like a string field in C/C+++. CharField is generally used for storing small strings like first name, last name, etc. To store larger text TextField is used. The default form widget for this field is TextInput.

What is Max_length in Django?

The Django DB Docs will tell you, that max_length=255 is guaranteed to work always. If you need something of the amount you've specified in your question, I'd suggest to use a TextField.

What is Verbose_name in Django?

verbose_name is a human-readable name for the field. If the verbose name isn't given, Django will automatically create it using the field's attribute name, converting underscores to spaces. This attribute in general changes the field name in admin interface.

What is Django ForeignKey model?

ForeignKey is a Django ORM field-to-column mapping for creating and working with relationships between tables in relational databases. ForeignKey is defined within the django. db. models. related module but is typically referenced from django.


1 Answers

Speed of Count queries.

UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count()
# elapsed time: 0:00:06.921287

Queries of this nature, unfortunately will always be slow in databases when the table has a large number of entries.

Mysql optimizes count queries by looking at the index provided the indexed columns are numeric. So that's a good reason to use SmallIntegeField instead of Charfield if you were on mysql but apparently you are not. Your mileage varies with other databases. I am not an expert on SQL server but my understanding is that it's particularly poor at using indexes on COUNT(*) queries.

Partitioning

You might be able to improve overall performance of queries involving event_type by partitioning the data. Because the cardinality of the current index is poor it's often better for the planner to do a full table scan. If the data was partitioned, only that particular partition would need to be scanned.

Char or Smallint

Which takes up more space char(2) or small int? The answer is that it depends on your character set. If the character set requires only one byte per character small integer and char(2) would take up the same amount of space. Since the field is going to have very low cardinality, using char or smallint will not make any significant difference in this case.

like image 124
e4c5 Avatar answered Sep 18 '22 14:09

e4c5