Five Common Django Mistakes
Table of Contents
Introduction
Django is a fantastic framework for building web applications and takes a batteries included approach. With this approach, comes complexity. When you’re starting out with Django, you can introduce subtle bugs due to lack of knowledge. I wrote this post partially for myself to reference in my professional development, but also to help illuminate some mistakes that I see. As with anything, there are tradeoffs with this approach. Django contains a lot of great code that you would have to write for every web application and reinvent the wheel, but when you don’t fully understand the framework, you can write code with unintended effects. For this post, we’ll develop an example Django application that handles employee management for various organizations.
Example Code
from django.contrib.auth import get_user_model
from django.core.exceptions import ValidationError
from django.db import models
User = get_user_model()
class Organization(models.Model):
name = models.CharField(max_length=100)
datetime_created = models.DateTimeField(auto_now_add=True, editable=False)
is_active = models.BooleanField(default=True)
class Employee(models.Model):
user = models.ForeignKey(
User, on_delete=models.CASCADE, related_name="employees"
)
organization = models.ForeignKey(
Organization, on_delete=models.CASCADE, related_name="employees"
)
is_currently_employed = models.BooleanField(default=True)
reference_id = models.CharField(null=True, blank=True, max_length=255)
last_clock_in = models.DateTimeField(null=True, blank=True)
datetime_created = models.DateTimeField(auto_now_add=True, editable=False)
def clean(self):
try:
if self.last_clock_in < self.datetime_created:
raise ValidationError(
"Last clock in must occur after the employee entered"
" the system."
)
except TypeError:
# Raises TypeError if there is no last_clock_in because
# you cant compare None to datetime
pass
Not using select_related and prefetch_related
Let’s say we write some code that loops over every active organization’s employees.
for org in Organization.objects.filter(is_active=True):
for emp in org.employees.all():
if emp.is_currently_employed:
do_something(org, emp)
This loop results in a query for every employee in our database. This could lead to tens of thousands of queries, which
slows down our application. However, if we add a prefetch_related
to the organization query, we minimize the amount
of queries.
for org in Organization.objects.filter(is_active=True).prefetch_related(
"employees"
):
Adding these methods results in large performance improvements without much work, but adding them can be easy to forget.
For a ForeignKey
or OneToOneField
, use
select_related. For a reverse
ForeignKey
or a ManyToManyField
, use prefetch_related.
We could make this more efficient by starting at the employee table and using the database to filter out results. We still
need to add select_related
since the function do_something
uses the employee’s organization. If we don’t, the loop
can result in thousands of queries to the organization table.
for emp in Employee.objects.filter(
organization__is_active=True, is_currently_employed=True
).select_related("organization"):
do_something(emp.organization, emp)
Adding null to a CharField or TextField
Django’s documentation recommends not adding
null=True
to a CharField
. Looking at our example code, the employee’s reference id contains null=True
. In our
example application, we optionally integrate with our customer’s employee tracking system, and we use reference_id
as
the integrated system’s id.
reference_id = models.CharField(null=True, blank=True, max_length=255)
Adding null=True
means the field has two “no data” values, null and the empty string. Conventionally, Django uses the
empty string to represent no data. By also having null as a “no data” value, we can introduce subtle bugs. Let’s say we
need to write some code to fetch data from our customer’s system.
if employee.reference_id is not None:
fetch_employee_record(employee)
Ideally, you write the if statement to handle any “no data” values by using if employee.reference_id:
, but I’ve
found this does not happen in practice. Since a reference_id
could be either null or the empty string, we’ve created
a bug here where the system attempts to fetch the employee record if the reference_id
is the empty string. Clearly,
this doesn’t work and will cause an error in our system. According to Django’s documentation, one exception exists for
adding null=True
to a CharField
. When you need to add both blank=True
and unique=True
to a CharField
,
null=True
is required.
Descending versus ascending with order_by or latest
Django’s order_by defaults to ascending order.
By adding a -
in front of your keywords, you tell Django to provide descending order. Let’s look at an example.
oldest_organization_first = Organization.objects.order_by("datetime_created")
newest_organization_first = Organization.objects.order_by("-datetime_created")
With the minus in front of datetime_created
, Django gives us the newest organization first. Conversely without the minus,
we get the oldest organization first. Mistaking the default ascending order can result in very subtle bugs. Django
querysets also come with latest, which gives us
the latest object in a table based on the passed keyword fields. The latest
method defaults to descending order as
opposed to order_by
which defaults to ascending order.
oldest_organization_first = Organization.objects.latest("-datetime_created")
newest_organization_first = Organization.objects.latest("datetime_created")
In multiple projects, I have introduced bugs because of the different defaults between latest
and order_by
. Write
order_by
and latest
queries with caution. Let’s look at equivalent queries using latest
and order_by
.
>>> oldest_org = Organization.objects.order_by("datetime_created")[:1][0]
>>> oldest_other_org = Organization.objects.latest("-datetime_created")
>>> oldest_org == oldest_other_org
True
>>> newest_org = Organization.objects.order_by("-datetime_created")[:1][0]
>>> newest_other_org = Organization.objects.latest("datetime_created")
>>> newest_org == newest_other_org
True
Forgetting that clean is not called on save
According to Django’s documentation,
model validation methods, such as clean
, validate_unique
, and clean_fields
, do not get called automatically by a
model’s save
method. In our example code, the employee model contains a clean method that says the last_clock_in
shouldn’t occur before the employee entered the system.
def clean(self):
try:
if self.last_clock_in < self.datetime_created:
raise ValidationError(
"Last clock in must occur after the employee entered"
" the system."
)
except TypeError:
# Raises TypeError if there is no last_clock_in because
# you cant compare None to datetime
pass
Let’s say we have a view that updates the last_clock_in
time for our employee, and as a part of that view, we update
the employee by calling save
.
from django.http import HttpResponse
from django.shortcuts import get_object_or_404
from django.views.decorators.http import require_http_methods
from example_project.helpers import parse_request
from example_project.models import Employee
@require_http_methods(["POST"])
def update_employee_last_clock_in(request, employee_pk):
clock_in_datetime = parse_request(request)
employee = get_object_or_404(Employee, pk=employee_pk)
employee.last_clock_in = clock_in_datetime
employee.save()
return HttpResponse(status=200)
In our example view, we call save
without calling clean
or full_clean
, meaning that the clock_in_datetime
passed
into our view could occur before the employee’s datetime_created
and still be saved into the database. This results in
invalid data entering our database. Let’s fix our bug.
employee.last_clock_in = clock_in_datetime
employee.full_clean()
employee.save()
Now if the clock_in_datetime
comes before the employee’s datetime_created
, full_clean
raises a ValidationError
,
which prevents the invalid data from entering our database.
Not including update_fields on save
Django Model’s save method includes a
keyword argument called update_fields.
In a typical production set up for Django, people use gunicorn to run multiple
Django server processes on the same machine and use celery to run background
processes. When you call save
without update_fields
, the entire model updates with the values in memory. Let’s look
at the actual SQL to illustrate.
>>> user = User.objects.get(id=1)
>>> user.first_name = "Steven"
>>> user.save()
UPDATE "users_user"
SET "password" = 'some_hash',
"last_login" = '2021-02-25T22:43:41.033881+00:00'::timestamptz,
"is_superuser" = false,
"username" = 'stevenapate',
"first_name" = 'Steven',
"last_name" = '',
"email" = 'steven@laac.dev',
"is_staff" = false,
"is_active" = true,
"date_joined" = '2021-02-19T21:08:50.885795+00:00'::timestamptz,
WHERE "users_user"."id" = 1
>>> user.first_name = "NotSteven"
>>> user.save(update_fields=["first_name"])
UPDATE "users_user"
SET "first_name" = 'NotSteven'
WHERE "users_user"."id" = 1
The first call to save
without update_fields
results in every field on the user model being saved. With
update_fields
, only first_name
updates. In a production environment with frequent writes, calling save
without
update_fields
can result in a race condition. Let’s say we have two processes running, a gunicorn worker running our
Django server and a celery worker. On a set schedule, the celery worker queries an external API and potentially updates
the user’s is_active
.
from celery import task
from django.contrib.auth import get_user_model
from example_project.external_api import get_user_status
User = get_user_model()
@task
def update_user_status(user_pk):
user = User.objects.get(pk=user_pk)
user_status = get_user_status(user)
if user_status == "inactive":
user.is_active = False
user.save()
The celery worker starts the task, loads the entire user object into memory, and queries the external API,
but the external API takes longer than expected. While the celery worker waits for the external API, the same user
connects to our gunicorn worker and submits an update to their email, changing it from steven@laac.dev to
steven@stevenapate.com. After the email update commits to the database, the external API responds, and the celery
worker updates the user’s is_active
to False
.
In this case, the celery worker overwrites the email update because the worker loaded the entire user object into memory
before the email update committed. When the celery worker loaded the user into memory, the user’s email is
steven@laac.dev. This email sits in memory until the external API responds and overwrites the email update. In the end,
the row representing the user inside the database contains the old email address, steven@laac.dev, and
is_active=False
. Let’s change the code to prevent this race condition.
if user_status == "inactive":
user.is_active = False
user.save(update_fields=["is_active"])
If the previous scenario occurred with the updated code, the user’s email remains steven@stevenapate.com after the
celery worker updates is_active
because the update only writes the is_active
field. Only on rare occasions, such as
creating a new object, should you call save
without update_fields
. For more reading on this topic, I recommend
this ticket on the Django bug tracker and
this pull request for Django REST Framework.
Important Note
While you can solve this problem in your code base by not calling a plain save
method, third party Django packages might
contain this issue. For instance, Django REST Framework
does not use update_fields
on PATCH requests. Django REST Framework is a fantastic package that I love to use, but it
does not handle this issue. Keep this in mind when adding third party packages to your Django project.
Final Thoughts
I’ve made all of these mistakes, some multiple times. I hope this post sheds light on potential bugs and helps you prevent these mistakes. I enjoy using Django, and I think it’s a great framework for building web applications. However, with any big framework, the complexity can obfuscate, and until you know what to look out for, mistakes can be made.