Transcript: Migrating from Python 2 to 3 at EdX - David Ormsbee & Nimisha Asthagiri

00:00.0

Hello, and welcome to another episode of Django Chat, a weekly podcast on the Django Web Framework.

00:10.8

I'm Will Vincent, joined by Carlton Gibson. Hi, Carlton.

00:13.6

Hello, Will.

00:14.7

And this week, we have two guests from edX, David Ormsby and Namesha Astaghiri. Welcome,

00:20.8

both of you.

00:21.3

Thank you.

00:21.9

Thank you. Happy to be here.

00:23.9

Yeah, we're thrilled to have you. So we connected because we met at the Django Boston meetup,

00:28.4

And edX is both one of the largest educational sites in the world, and I believe has used Django from the beginning. So I wonder, we could talk about, do either of you know why Django originally? And then how has that been? Because you've been through a lot of these changes of Django, Python 2 to 3. And so hopefully we can dive into all of that.

00:47.7

The choice for Django, I think, primarily came out of the choice for Python, because this edX originally started as MITx, which, you know, it came out of MIT, and so Python was kind of the default choice, right?

01:06.1

The intro CS course teaches Python.

01:09.4

All the Europe interns that are going to work on it know Python.

01:15.3

And the fact that the first course was an electrical engineering course means that having NumPy, SciPy, that sort of set of tools was very convenient.

01:26.2

And so, by the way, these early decisions were made by Piotr Mitros, who created the original prototype for what became edX Platform and was the chief scientist for edX for, I don't know, five or six years, for a while.

01:44.7

What was the timing? Because I believe this goes back quite a ways, right? Before Python was standard at the undergrad level for a lot of places.

01:50.6

This would have been, the first prototype work would have started in, I want to say, October or November of 2011.

01:58.1

So sometime around Django 1.3, I guess.

02:02.9

And so, yeah, and so Peter also made the decision for Django.

02:08.1

But I mean, once you've gotten it down to Python, then, you know, Django was like one of the obvious choices.

02:17.0

and can i just ask because maybe our listeners don't know could you just give an overview of

02:21.1

what the edX platform oh yeah is and does good call so um let me show you on there oh sure so

02:27.5

yeah edX is um an online education platform it was founded by Harvard and MIT as Dave mentioned in

02:35.0

2011 we do specialize in higher education courses with courses and content from you know one of the

02:43.6

That's to worldwide partners from universities.

02:47.5

And what's the sort of scale that y'all are at right now?

02:50.3

Because it's in terms of, I guess, number of courses and users,

02:54.3

because it's in the millions, right?

02:56.0

Of users anyways?

02:57.1

Yeah, it's tens of millions of users, thousands of courses,

03:01.2

like probably over a million lines of code,

03:04.0

if you include all the Python code, if you include the different repos.

03:08.8

So yeah, it's at about that level.

03:11.1

And the other thing is that we are an open source platform as well. So one of the great things about edX is we have edX.org, which is our website, but then we provide our source code to the community. And the open edX community has, you know, thousands of instances of our source code where they are, you know, providing courses for regional content as well as even national platforms.

03:37.1

Wow. Yeah, we'll link to those in the show notes. Well, so one issue is since you started on 1.3 is migrating from Python 2 to 3. Is that something that has happened? Is it going to happen? How do you think about doing that at such a large site, either one of you?

03:55.8

Everyone's dealing with this right now, by the way.

03:57.3

Yeah, it's something that's still in progress.

04:04.7

We are hopefully weeks away from getting it.

04:08.1

So there have been a number of efforts that we've done to try to bridge that.

04:15.7

One is the, what is it called?

04:19.8

We have anchor tickets, which are basically a lot of,

04:24.2

Because so much of the Python 2 to 3 conversion is, you know, you can run the tooling for it and you just kind of have to sanity check that, you know, it's doing the right thing.

04:34.2

And like 90 some percent of the time it is.

04:37.0

But then, you know, so there was a very conscious effort

04:41.2

to be able to try to leverage our community better

04:46.3

and try to give them, like, really small incremental tasks

04:50.3

that don't require you to know the full stack in order to help.

04:55.5

And so that effort, Jeremy Bowman sort of really helped push that through.

05:02.2

And we did get a lot of contributions.

05:05.3

like and right now i think finial and company and some other folks are sort of driving the the last

05:11.6

bit like once once it's working enough so that the tests are are running on both environments like

05:17.1

it's been this uh weeks of just you know find the find the breaking thing watch 300 more tests

05:24.4

pass find the next breaking thing and then there's the and it's it's interesting because there's the

05:30.9

like there's the easy part relatively easy which is just well there's the strings right and then

05:36.9

there's the like code sniffs on previous stuff and then there's the real stuff yeah yeah there's

05:42.4

like rounding behavior is a little bit different and you know if we were to change how grade

05:47.6

rounding worked anywhere in the stack there there would be consequences and then even in terms of

05:54.0

backward compatibility if we you know chose to utf-8 and code the wrong way or whatnot it's

06:00.5

possible that our hash values are not the same as before and then we might have run into you know

06:06.6

backward compatibility issues once we actually launch this in production um yeah and also just

06:12.0

to give an idea like in our code base we do have a monolith with you know close to a million lines

06:18.3

of code depending on which you know which code you count um but then we also have a few like

06:23.6

satellite microservices that run you know around our monolith and so each

06:30.4

service needed to be upgraded and as Dave mentioned like there was a an

06:35.8

effort to try to do some of this initially with the community but you

06:38.9

know because we're running against the deadline we've actually taken in a lot

06:41.9

of this resourcing in-house to try to make this happen so and you know what

06:48.3

there are definitely some some things that we could do in terms of like regex

06:52.1

replacement, you know, in the strings. There's definitely, you know, just converting because

06:57.1

response.content, you know, now is returning, is being returned as bytes. You know, just,

07:03.0

there were a lot of code that just said assert in, in our test code and searching for, you know,

07:07.8

within response.content. So replacing that to be assert contains or assert not contains,

07:12.1

doing that in one bulk, you know, regex replacement in your ID and whatnot, that definitely helps,

07:17.5

you know, get us a long way, but there's a lot of these intricacies that we find as we are going

07:23.9

through. We're like, okay, what's the right way of making this change? And then are there any

07:28.3

principles that we can follow? For instance, doing the conversion from bytes or conversion from

07:34.6

strings to bytes, can that be closest to the perimeter, right? So, you know, throughout our

07:40.1

code base, as things are being passed, right, from one method to another and whatnot, you know,

07:43.9

pass them as strings as they were before, because that's, you know, that's what the business logic

07:48.2

is. But then only when you need to serialize it, let's say, put persisting in a CSV, or sending it

07:54.0

over the wire, or things like that, that's when you actually then do the conversion. Because one

07:58.0

of the concerns that I've always had with some of the Python three upgrade was, I felt there was so

08:03.1

much there's, I wanted to minimize the amount of code that has to do with strings, so that the

08:09.7

business logic continues to stream back at you, right?

08:12.6

You don't want, you don't want your code to be,

08:16.2

you know, you want your code to be as readable as possible.

08:18.4

You want the business logic to be the one

08:20.4

that's streaming back.

08:21.5

You don't want it to be, you don't want to get too caught up

08:24.1

on okay, what's happening in the details

08:25.8

with the string conversions.

08:27.2

So the more that we could try to keep that logic separated,

08:30.7

the better.

08:31.9

So that was, you know, that was one of the few principles

08:35.1

that I wanted to see, make sure that happened

08:39.0

doing this conversion. Well, it does seem that's the hierarchy of any conversion from two to three,

08:44.0

where you start off with strings, and those are pretty easy. And then you get to the asserts.

08:47.5

And that's relatively easy, but takes a lot of time. And yeah, and then it gets real with all

08:52.6

these other issues. Well, and on business logic, so this is actually, I wanted to highlight this.

08:58.0

We talked about this two weeks ago at Django Boston, because you had a really interesting

09:02.7

take on where to put the business logic. Because this is a question that Carlton and I get.

09:08.0

And for a lot of people, once you get to a decent-sized Django project, it's hard to

09:13.2

cram it into the models or the managers or the views.

09:16.6

Yes.

09:17.1

And David and I have, with other architects here at X, definitely have gone back and forth

09:23.2

on this.

09:25.5

And so there's, you know, to step back a little bit, like the reason why we're driving for

09:31.4

this, right, is we're trying to scale development of our platform.

09:34.9

And over time, it's definitely grown organically in terms of the code that's there and whatnot.

09:40.0

But now in order to allow our code base to be approachable by new developers as well as existing developers and for us to continue to maintain it, how can we do that in a way that, once again, it's easier to understand what the business logic is?

09:55.6

And so we've actually had a book club here with, you know, Dave was participating and others here as well for domain driven design and domain driven design, right, really talks about being very, very focused on what is the domain that you are building for and having those concepts and having the boundary, the bounded context and those boundaries being explicit.

10:22.8

So using that as a mindset, we were thinking, okay, well, given that we have a monolith, monoliths don't necessarily always have to be split into separate services.

10:32.7

Can you still have great abstractions in place while you're in the monolith?

10:36.5

And then if we choose to extract them as a secondary step for other reasons, that's definitely a thing that can still happen.

10:44.0

However, even within the monolith, what are those bounded contexts?

10:47.4

How do we make our code base such that it's understandable of what those interactions are?

10:54.0

Otherwise, there's a lot of tight coupling, and it's really hard to understand.

10:58.0

So, yeah, I can go.

11:00.2

So, Dave, you want to add anything to that?

11:01.8

Well, I guess one of the other things that I would add is that edX platform started as an application, right?

11:09.5

Like it started as an LMS and studio setup for a particular set of courses at edX.org.

11:17.4

And it was made into, you know, an open source project.

11:21.8

And it was intended to be that, but as a sort of, you know, the name is, I mean, the name is edX Platform, right?

11:29.1

The idea is people are supposed to be able to build extensions on top of it.

11:33.8

And so, which a lot of times you have a much stronger need for these, like, kind of internal APIs between the apps.

11:41.1

Because, you know, you might want to take a next platform, but you want to add your own thing that accesses enrollment data, for instance, or like because you have your particular special feature that you're kind of adding on to the side of it.

11:58.1

Yeah, so and having a sort of strong abstraction layer that you can like a sort of stable API interface within the application would be very, very helpful for that.

12:10.2

And we've tried to provide that in various places, but it's, you know, it's, it's a hard problem in a lot of senses. Like there, there are a lot of things that if you try to go into the sort of, if you try to move your logic outside of the, outside of the models, like you cut against the grain of Django in a bunch of, in a bunch of ways, right?

12:32.1

Like, you know, where are you going to do your validation if you want to have the same set of business logic that powers, like, you know, a REST framework like Django REST framework and your views there?

12:45.6

But you also want to have not, like, repeat yourself for, like, internal APIs.

12:51.0

Like, where does that logic go?

12:52.9

Like, your validation's in the serializer here, but, you know, are you going to call the serializer explicitly from your in-process APIs?

13:00.6

And, you know, you want to pass back query sets probably in case people want to modify them because, you know, it's often like, hey, get me these enrollments plus this data that I've keyed off of enrollments because I've added my own table here.

13:18.3

Like, how do you do that in a performant way that doesn't encourage everyone to do like N plus one queries?

13:22.8

So, you know, there's all these parts where you try to, yeah, it's been challenging for us to do that.

13:30.6

And I was going to say, and then to scale up to from a single application all the way up to like a million lines of code, there must have been teething points and difficult areas.

13:43.7

But then the goal, I guess, and I guess the question would be, do you think at a million lines of codes that the general structure is still comprehensible?

13:51.7

Right, no, so you would, this is why it's, at that scale of the code base, it's more

13:58.2

important, I believe, to be able to understand the historical traces, because unless if we

14:04.0

do put a lot of resources to make everything consistent, you will see a heterogeneous code

14:12.3

base, right, with some that haven't yet been updated to the latest direction and some that

14:16.8

are still a couple of years behind in mindset and so forth.

14:21.4

So we do have this concept of,

14:23.4

architectural decision records definitely you know this was also inspired by Michael Nygaard who

14:29.1

from you know had and ThoughtWorks had promoted this as well but architectural decision records

14:33.5

are very easy to very lightweight documentation it's immutable documentation records on a decision

14:41.1

that you've made and that goes into the docs itself it goes into the code base exactly in

14:46.0

a decisions a separate decisions directory and you know you have a context the decisions and

14:51.6

the consequences in each ADR and so for instance what the directory structure should be of our

14:59.2

monolith code base or you know that that could that could be an ADR and then this way people

15:04.4

know when exactly that was that was decided and then if they do see code that's not yet up to

15:09.9

date they can realize okay that was written earlier and if I'm now going to be working on

15:14.1

that part of the code base let me bring it up to the latest you know code standard so but yes so

15:20.4

in terms of maintenance of a large code base we need to you know develop processes such as this

15:26.3

to just sort of keep everyone up to date on because you're going to see broken windows everywhere so

15:30.5

which is which is the right window to follow and which ones are broken so i really like that idea

15:36.3

of the architectural decision record it's like leaving breadcrumbs so look here's the path yeah

15:41.3

and and one thing that we also found with going back to the where the business logic should be

15:45.8

kept, I mean, we were thinking about Django and initially when, so Dave had had, Dave was the

15:51.1

founding architect here. So Dave knows a lot about the initial war stories and how that happened. I

15:56.8

joined five, about five and a half years ago. So by then some of the architectural decisions were

16:02.6

already made and there was a lot of organic, organically made changes. So it was, it's actually

16:10.2

a great partnership I believe because I'm able to like have a little bit more fresher eyes and you

16:16.2

know Dave gives a lot more historical context but when I when I approached a code base I saw a lot

16:23.3

of divisions by technical concerns and this is a thing about Django's out of the box right where

16:29.5

you're thinking about views and models and urls.py and they are in my mind like they're very much

16:35.5

technical concerns of as an engineer like how to think about how to construct something but

16:41.6

if you want to figure out what are the business concerns what is it that this app is actually

16:47.5

trying to do you don't see that right away when you see these files everyone's you know just

16:52.7

models and urls and views or whatever so um so i think you know having a way to make that come

16:59.1

across much better is what we're looking towards and then one one great example of where this

17:05.5

applies is right now when we're now moving towards a different technical technology stack base when

17:12.5

we're thinking about moving our front end logic to be in react micro front ends right so now whatever

17:20.2

business logic happened to be within views.pi or even in a template or you know it shouldn't have

17:27.0

have been the template in the first place but anyway some of our code was so like some of those

17:31.0

things were in models.py and views.py and whatnot so like how do we if we had better abstractions

17:37.7

in place it would have been easier for us to do this replatforming effort and so I think from the

17:44.6

get-go if we could start our Django projects and apps in a way that it's clear where the business

17:51.0

logic is and let the web framework be a detail and this is what like i'm really talking you know

17:58.8

echoing uncle bob here but um you know the web is a detail whether it's exposing via rest api is a

18:05.7

detail whether it's being exposed via channels and um eventing models that's also a detail but

18:11.8

the business logic can stay more stable and more core to the app so that's how i've been thinking

18:17.6

about it. And I think Dave's bringing in the perspective also about the scalability and

18:22.0

performance impact, right? So like, if you do have these great, nice abstractions, Ninisha,

18:26.6

well, what are we going to do about those queries and, you know, the scalability of making sure that,

18:31.9

you know, when we don't, we're not always translating to APIs when, you know, like we

18:38.0

actually, if we do need to go directly to the models, we can in order to be performant.

18:43.0

I mean, I guess a couple of things I do want to add to that is just one, like Nimisha has been at edX for like five and a half, six years, the last couple of which she's been chief architect. So she's, she's got plenty of context. I do, I do have some of the like old stories.

19:00.8

But another thing is that, to be clear here, a lot of the early work on edX Platform, what became edX Platform, was not very intentional in a lot of cases.

19:14.3

You take a group of people, most of us had done Python work before.

19:20.3

Some of us hadn't.

19:21.1

You can actually tell parts of Studio that are written in kind of a Java accent.

19:26.5

But we had never done Django.

19:30.7

We had never done large Django.

19:32.6

And while Django docs are great for many things in starting out,

19:37.4

there was not as much guidance for,

19:40.8

Like, here's how you build this enormous application or the foundations of this enormous application or project, you know, that is going to grow to this size.

19:50.4

And even if there were, we probably wouldn't have had time to read it because we were in like frantic scramble mode.

19:57.6

Those entire buds are this hazy, sleep deprived thing that I just.

20:02.8

So so so certainly like things were put together very quickly.

20:08.1

And now we are trying to address some of those things, but as Nimisha said, it represents different generations of code and philosophy and expertise and understanding, frankly, of Django itself.

20:25.4

And so I can look at a piece of code completely out of context and tell you approximately what year it was written in, just because of the idioms that shift over time.

20:36.7

Right. Well, there's also been, I mean, Namesha, what you were saying about React and the logic from views to the front end. I mean, there's also been that shift in the last four or five years where, I mean, state in general and logic has moved to the front end. So I don't see how you could have architected for that five years ago because, I mean, was it 2013 when React came out or something like that?

20:58.7

I mean, that's also just the case of being flexible.

21:02.0

And I liked your idea of just having the Django piece, the architecture be as kind of simple as it is so that you can go back and forth as really as the web changes with how to do state and how to show things.

21:15.0

I mean, who knows if that's going to switch back.

21:16.8

Right. So for us, so what we started prototyping and assessing, and hopefully we'll solidify into a stronger best practice, but is to actually have a separate module. Right now we're calling it API.py, but essentially it's basically the domain logic.py, you know, where your business logic is.

21:40.9

And then the views.py file within a Django app would, you know, consume the functions or, you know, classes that are within those API.py files.

21:54.5

And that layer, let's just call it domain logic.py, that layer domain logic.py then is also an abstraction above your models.py.

22:04.5

And other apps that may exist in your Django project, they cannot go directly to your models.

22:11.0

They won't access your views or anything else directly, but they would go through this and interface right above your domain logic.py.

22:18.5

So that's how we're thinking about it.

22:20.4

And so this way, if we want to then have Django signals or Django channels integration or some other eventing, if we want to have a Kafka layer later.

22:31.0

And, you know, these are all ways of communicating out.

22:35.2

REST APIs are just one way of communicating out, right?

22:38.0

And so that's why views.py then becomes much thinner in that views.py would be more about it would it's it's it's responsibility would be more around the authentication layer.

22:52.2

Like other things that you might want to do, like response HTTP, converting things to proper HTTP response codes and formats and things like that.

23:00.2

so that's what that that it becomes it has this very separate separation of concern from where

23:07.7

the business logic is and models.py's separation you know concern would its responsibility would

23:13.4

be more about around the data um and so that's where i'm thinking we might we're going to try

23:19.0

some of our django apps and trying to implement it in that way and we're hoping that will then

23:23.4

allow us to evolve as the web evolves and as the you know all of our industry evolves so

23:29.9

Yes, but you're completely right that there are going to be things that we cannot anticipate.

23:34.2

So we'll always need to figure out how to refactor as we go.

23:37.0

But like currently, that's one way that we're thinking about it.

23:40.8

Carlton, how does that ring for you?

23:41.9

Because I was so struck by this conversation we had right before DjangoCon that at DjangoCon, basically the conversation I had with everyone was, where do you put the logic?

23:49.9

Where do you put the logic?

23:50.6

You know, and folks at really large sites and it's basically always somewhere else.

23:55.7

you know they call it something else but it's basically they yeah somewhat abstracted from

24:00.9

the traditional django hierarchy but carlton what what were you going to say right so so the sort of

24:06.2

um basic that you know to much simpler django example is let's say you um have your uh your

24:13.7

business logic your model validation in your save method right you've got just well fine okay so but

24:19.2

then what happens you've got a model form on top of that and then you your view validates the model

24:23.7

form it says yeah this is valid data and then so instead of returning 400 saying this is a bad

24:29.2

response this is a bad request you've got invalid data it says no this is okay let's go to save and

24:34.6

then you end up raising an error at the save point which turns into a 500 that's a server error

24:39.4

right so you don't want your validation logic in save i mean you might you might want to use it in

24:45.8

save but you also want it available to your form or to your serializer or wherever so the view can

24:50.5

use it so to to wear it earlier and say to the user hey what you've given me isn't correct can

24:55.9

you try again yeah and i think that's one thing that we also keep running into and honestly i'm

25:01.2

not completely happy with the sort of trade-offs even where we're landing it's just that it feels

25:07.3

like you're fighting the framework right in in a lot of ways like this is not like the like the

25:13.6

the primitive pieces that you're given don't plug in like don't connect to each other in quite the

25:18.6

way you want to to make this sort of abstraction work and you can do it but you know on either

25:25.3

you get kind of you can get kind of clunky code and like or like duplicated code and kind of

25:31.4

repeat that the validation call like somewhere else explicitly so yeah i i yeah i don't know but

25:37.6

one of the big issues that we find on uh passing models around for instance and sort of not having

25:46.2

that layer is that one, those things can change, right?

25:53.0

Like, you know, we can add stuff to it or whatever.

25:55.6

Two, it's like the model, passing around models is like this huge implicit interface, right?

26:00.8

That you're just kind of just throwing around everywhere because anyone can say, hey, I've

26:06.0

got this.

26:07.1

Now, let me like just grab that class, do a query and, you know, sort by this unindexed

26:14.9

field um you know on the field that on the table that has billions of items in it and like great

26:22.7

like you know and and even scarier it's like maybe that does that's fine on your machine

26:27.9

that's fine for your system because your system you've got you know two classes and 50 students

26:33.9

and hey like it works okay and then you know you bring it over and you try to merge it upstream

26:39.0

and it's like no like this is yeah yeah no if you could scale up you need to have more strict

26:44.3

rules and and sort of having an explicit layer where we say like these are the things you're

26:50.9

allowed to call and these are the things that we can make some kind of performance guarantees

26:55.1

around uh is is really it's really important for us and and also to clarify i mean there's

27:03.1

definitely a maturity model right like i don't i think what we are where we are arriving is

27:08.9

because of the scale that we are at exactly you know for a company that's starting up fresh and

27:13.8

new and with Django there's so many great things that Django comes out of the box and

27:18.3

actually its defaults are probably good enough for what you need at that time.

27:24.8

But I think depending then as you scale, perhaps by users, by developers or whatnot, then you'll

27:31.2

have different concerns and different requirements.

27:34.2

So for us, for instance, even this, what we talked about, even for edX at scale, we may

27:40.6

not apply this design pattern that we're talking about like domain logic dot pi whatever to the

27:46.5

entire code base right i mean this might be more for in domain driven design they have these terms

27:51.8

called core supporting and generic and core is the one that's more of your core value proposition it

27:57.3

has more of your domain concepts and so at that layer within your core you might want to have

28:03.3

these this way and this design pattern but the things that are perhaps more more volatile and

28:09.2

more in the periphery and you do want to change those more quickly and experiment you know you're

28:14.4

going to just use drf right off the back and use the serializers and models that apply directly i

28:18.4

mean fine you know that's quick you're trying to but then if after some time after a couple years

28:23.3

or or months you realize oh no this is a great core concept then you might want to figure out

28:28.7

how to stabilize it more so there's definitely a maturity model of the company of the code base

28:34.3

and then even of a feature but also the return on investment there's no point doing this kind

28:38.7

of super engineered high scale thing for a proof of concept.

28:43.6

You do a proof of concept, test the concept, does it work, then we'll

28:46.9

put the extra resources in and even for us like our monoliths the way that we're going about and

28:52.6

thinking about it is um are there parts of the monolith that are core to the business and core

28:58.0

to the platform and other things that are more extensions can we build them as plugins and one

29:04.4

thing that edX has developed and we'd love to contribute back to actually the Django community

29:08.9

once we extract it out of the monolith because right now it's still within it but there's a

29:13.6

something that we're calling Django app plugins. And it's built on Python's on Stevedore technology

29:20.0

that will allow one to basically, you know, import to their own extensions or plugins to a monolith.

29:30.7

The monolith provides some interface perhaps, but the plugin, its own view, its own urls.py and

29:38.4

even installing it all of those things can be automatically detected via stevedore so anyway

29:45.8

it's a it's a great technology we found it to be very valuable our our open source community

29:51.4

really appreciates it because they don't if they need to make a change or want to add something

29:57.0

they don't need to fork the entire monolith they can just create their own plugin and have it

30:01.8

automatically be detected by the monolith so um it very much goes with the solid design principles

30:08.4

with dependency inversion and and things like that so anyway that's that's where is also an

30:14.1

idea you know this concept of what's core and what's not core making that more explicit

30:18.8

um and the things that are not core core how can we incorporate them um with appropriate

30:24.4

bounded contexts and boundaries yeah and this goes back to what david was saying about

30:28.4

um giving people a set of apis they're allowed to call that you know are safe right well one thing

30:34.2

uh i wanted to go over is uh feature toggles right because i believe you you use django waffles now

30:40.9

because this is another uh again i was having this at django con with so many people when you're at

30:45.1

scale rolling out new features um you don't always want to just turn it on so i guess how um what did

30:52.4

you do before django waffle and how do you think about turning things on given the size of your

30:57.3

community um dave you want me to take this or uh yeah you you wrote that of i think so

31:03.0

i'm sorry the by the way of is is our sort of proposal process so that we have architecture

31:09.3

decision records that are specific to a given repo they're kind of local decisions by the team

31:14.6

on their particular like this this repo um but if you have something that affects that has

31:21.0

implications across all repos like for org wide engineering uh then we have the open mx proposal

31:27.1

process where we have sort of more general guidelines and uh namisha was the one who wrote

31:32.4

up uh the one on sort of feature toggles and such well it's like django has its depths i mean it's a

31:38.3

very similar uh and oeps were um so kale at x like he came up i think with that term but um

31:45.6

and this process where yes it is it was inspired by pep so we have you know so we have op1 op2

31:52.0

whatever and it's and so this is open edx proposals similar to python enhancement proposals

31:57.0

but yeah so there's one on feature toggles and definitely the reason we we decide we need to

32:02.5

have some good guidelines and best practices for it is because um we do when you're when you're

32:09.3

now we're talking about scaling um and even the deployment of features right and we don't want

32:17.0

We want to be able to allow teams to try features and whatnot in a way that it is more controllable.

32:24.9

We want to decouple deployment or enablement of a feature from deployment of the code.

32:31.7

So this way, teams have their own autonomy on when exactly they enable a feature toggle and how it is rolled out to the user base.

32:44.1

Perhaps they want to have it in beta testing and then eventually roll it out to a larger user base and so forth.

32:51.7

So anyway, we could we can share the OEP in the notes if you want of the podcast.

32:58.2

Yeah, it goes through a bunch of different like use cases.

33:01.5

So, you know, like I said, beta testing is one.

33:05.5

Some of the other ones might just be that we want to for operational reasons.

33:09.9

We want to just roll out gradually in case we are concerned there might be a performance issue or scalability issue or even functional issues and whatnot.

33:18.6

And so it allows a lot more control over that.

33:22.3

And the thing about this, though, is this, like feature toggles are great and it gives you that control, but then also increases the permutations in your code base.

33:32.4

Yeah, yeah, exactly.

33:33.7

You know, which set of permutations are your tests actually going to cover?

33:39.7

And so there's this also process of how do we then make sure that feature toggles that were created and all of those code branches are then deleted once they are no longer needed?

33:52.1

Yeah, exactly.

33:53.0

The cleanup is always the hard part.

33:55.0

Yes, yes.

33:55.9

And so making teams accountable and reminding them of that.

33:59.0

So this OAuth covers a little bit of that process as well and allowing us to have a tool and a reporting mechanism for understanding exactly when was the feature toggle created, when is the expiration time for it, what was the use case for it, and so forth.

34:14.1

So that allows us to then monitor those toggles.

34:20.2

And yes, as Will, as you said, we are using Django Waffles.

34:23.2

Django waffles are great. It allows us to, you know, specify whether they're on or off and then also which subset of users we want to turn it on for and things like that.

34:33.0

Yeah, it does seem to be the default that I can tell anecdotally that companies are using for this right now.

34:39.7

Yeah.

34:41.0

So another thing we've talked about I wanted to highlight.

34:45.2

So Django Celery usage.

34:47.1

And this is particularly relevant because at DjangoCon, there's a lot of talk around async starting to be rolled out.

34:53.5

And so both how you use Django Celery and I guess the huge question is if and when Django is fully async, do you see any use cases at edX or is it more of a side thing?

35:05.2

Because that's like the two-parter kind of for a lot of folks.

35:08.7

Like, do I actually need it when Celery and Qs work pretty well?

35:13.4

So as a lot of questions, take whatever you want to answer.

35:17.1

Yeah, I'll start with a few principles and design concepts that I have, and then Dave can talk a little more of the details. That sounds good.

35:26.1

Okay, so, I mean, for us, because we were thinking about running at an extent scale, right, we, and there's, you know, we want to keep the response time back to the users as, you know, within one to two seconds.

35:40.6

And if there are going to be some tasks or some requests that are going to take longer, either because we need to recompute your grades or we need to, you know, do some extra work in the background and whatnot,

35:54.4

it's very important for us to separate those into an asynchronous task and one

35:59.4

of the things that we found was that being intentional about perhaps even

36:05.9

separating our reads from our rights would be is very valuable and so when

36:12.1

someone is so basically don't do too many side effects and don't do too many

36:16.4

costly operations within a request especially if it's a user facing request

36:21.1

and the user wants to respond quickly.

36:23.0

And so you need to be able to put some tasks

36:26.6

into an asynchronous operation.

36:31.6

And so Django Celery was definitely one

36:33.8

that is a technology that we've used.

36:36.1

And initially we were doing it on RabbitMQ,

36:38.8

then we converted to Redis for scalability reasons

36:41.1

and other reasons.

36:41.7

But that definitely has been a way for us

36:46.3

to scale out our infrastructure.

36:48.2

I would love, though, to move towards another model

36:51.0

Because one of the downsides with Celery is that as the task that – basically, it's not a fully PubSub model.

37:01.5

So you do need to – if you want to be able to have the Celery task run on a separate service, you do need to know what – you know, like the – it's not very easy to have an API that is more like a – the subscriber needs to know what the publisher's API and vice versa.

37:20.0

right you know i mean like it's you're you're too tightly coupled well instead in celery you import

37:26.2

your entire django project right so you have to import the monolith into the celery instance to

37:30.5

run the queue and it's like it would be nice if if tasks could be sort of separate if they didn't

37:36.2

have to know exactly exactly and so um so anyway so that that's where i would like to lead to

37:42.7

eventually and i was thinking about more of an eventing architecture but but maybe there's

37:46.3

something else and i need to learn more about async and what it provides and there might be

37:50.2

some things out of the box now once we upgrade to django 2 in a couple of months but yeah well

37:55.4

django 3 is when it starts to get real and even then it's a 3.0 has ascii um server and then the

38:02.3

plan i believe carlton is 3.1 we'll start with views um and then orm thereafter more stuff

38:10.7

thereafter we tend to hug the long-term long-term support releases oh don't get carlton started on

38:16.6

this yeah yeah well well because um you know because most people do most people do yeah but

38:23.4

because also we need something to hopefully say well like when we put out a version of uh a release

38:28.7

of open edX for people to use because we do have sort of um like we we named them after trees but

38:34.6

But so, you know, we put out Ironwood and universities that use it are not going to upgrade until like maybe the next, you know, the following summer or something when school's not in session.

38:47.5

And so, yeah, we're not going to see async on the big repos.

38:54.0

So the smaller repos tend to go up faster because there's just less inertia to move.

39:01.5

And also fewer groups run the e-commerce service that we have as opposed to the, you know, everyone runs that X platform because it's where the LMS and the authoring environment is.

39:14.2

right um so yeah on in terms of the async though like there are features i would love to

39:20.1

to play with yeah like what um i mean channels is is is great i played around with version one

39:27.6

of channels before it became that i can't play around version two on the next platform because

39:31.9

you know well carlton's in charge of maintaining channels now okay that's no it's it's really cool

39:38.5

stuff uh and frankly there's a lot of stuff around like web sagas and such that we love to use

39:43.2

you know having async support for like service service calls that are not blocking can make you

39:49.4

know better use of better use of resources is great although I think edX platform is a bit

39:56.0

unusual for Django project in that a lot of our performance issues or performance challenges are

40:02.2

we're actually CPU and the CPU and cache related really yeah I know that's so there

40:10.2

so okay what are you processing like what's because okay because as we've been talking

40:17.3

i've imagined that you've got databases scaling issues you've got a lot of content you've got

40:20.9

videos you've got this course content you've got other loading sites uh loading pages but

40:25.9

where the calculation issues that's kind of interesting okay so this is um okay so so this

40:32.8

This is, yay, history stuff.

40:36.1

So when edX started, right, we were, like, again, like, imagine, so it started with a course, a single course that was being offered to both MIT students and to, and as a MOOC version that ran sort of two weeks behind the MIT cohort.

40:54.2

And, you know, like, okay, when Coursera, if, like, those early days, if, like, Coursera or Udacity, they can say, oh, you know, we're going to push back the start of this course by two weeks because, you know, whatever to help iron out.

41:11.7

And they're, you know, non-paying students are like, okay, that's cool, whatever.

41:16.0

You don't tell MIT you have to delay the semester for two weeks, right?

41:20.0

Like, that's just not going to happen.

41:23.8

So a lot of things like we those things were like really really like to

41:30.8

Sort of very quickly put together and one of the things that was put together was that the course format in those early days course

41:36.2

Teams is is XML. It's like a giant XML file. In fact, it was a giant XML file. That was a Mako template

41:43.0

so

41:44.2

And so the entire like definition of what the course is all the sequences all the problems were this big file that that Django read

41:51.3

on startup. And it's really easy to create a set of objects like this very quickly.

42:00.8

And that's fine for the prototype. But obviously, the prototype days are over and we are in the

42:07.3

world now where everything's in databases like you would expect. But a lot of the access patterns,

42:13.7

if you look in those early days, when you're trying to prototype something, you're basically

42:18.1

You're going for maximum power with minimum code.

42:22.2

And you can create, it is easier to create really powerful interfaces than it is to create performant data models, right?

42:31.8

As an example, if you look at the grading code, the original grading code for Hexplatform was basically, okay, here's my tree of content, everything in the course.

42:43.0

I'm going to crawl through and go through the whole tree

42:47.4

by checking my children.

42:48.9

And for each node that can be graded

42:51.0

as a gradable thing, like a problem,

42:53.2

I'm going to ask that pluggable interface,

42:55.8

hey, how many points are you worth?

42:58.5

And how many points did the student earn

43:01.1

based on their current state of your problem?

43:04.3

Because we have this notion of X modules back then,

43:06.9

and videos were X modules,

43:08.7

problems were X modules, everything.

43:10.3

So you had some common set of interfaces you could ask.

43:12.8

And that gave you a great deal of flexibility in terms of power.

43:16.2

But at the same time, how long does it take to show the progress page

43:21.4

when you have to ask this question of every node in the course?

43:25.4

And the answer is, you have no idea.

43:29.3

Because you've created this interface where it's like max score and get score, whatever.

43:36.4

And you've pushed off having to think about the data model.

43:39.8

But in return for that, you've lost all ability to reason about the performance of the system, right?

43:46.0

So we had X modules that started sandbox processes, like in Python, and did RPC calls to sort of get information because you don't want untrusted code to be running.

43:58.3

We had ones that would parse their internal XML, and depending on how many response types you had inside would change their score accordingly.

44:06.8

we had ones that would make HTTP calls to another system entirely to get back what the latest score for that person was

44:15.5

because the first version of pure grading happened on a different service.

44:21.7

And so if you wanted to make an X module that would return you a different max possible grade for username starting with C on Tuesday afternoons,

44:34.6

Like, there's nothing in the contract that explicitly forbids that.

44:39.1

I'm seeing a theme here with the architecture.

44:41.6

Yeah, exactly, right?

44:42.6

But once you sort of have that out there, like, you know, that's fine for the prototype and for an exercise.

44:49.2

That's not cool when you have millions of users and, you know, you're trying to run a course of scale.

44:55.0

And so you find yourself trying to claw that back in a bunch of different ways.

45:01.0

And so eventually what happens is you sort of have this sort of shift that goes from your courseware being these like smart objects that you can just make requests for and leave it to them to figure out how to implement it.

45:16.6

And you sort of shift that relationship to be, okay, I'm going to have this, I'm going to have a grading system.

45:22.5

The grading system has a data model.

45:25.6

And I know that the grading system, I can query it.

45:29.4

I can ask for things like what is the student's grade in the course and get a quick reply.

45:35.7

And then I'm going to make those, the X modules and X blocks, the course where like individual leaf nodes, they're going to push data into that data model so that I can query that efficiently.

45:46.4

Right. So, but you have this kind of like shift and that applies to a bunch of things.

45:51.6

We have to do that for grading. We're doing that for more scheduling related things. But it's hard to do that in a sort of backwards compatible way because we have a lot of course teams explaining a lot of effort into a lot of course content.

46:08.0

And so I think one of the things that Namesha was sort of pointing out was one of the ways we do this is to kind of shift it, like use async methods like CeleryTask, not async as in the Python async, but use CeleryTask in order to kind of shift the burden of processing.

46:26.6

So, for instance, right now what happens is that there is a grading system and it will hold your scores whenever they change.

46:36.9

But there's a set of very complicated permissions calculations that came about because, hey, we have this whole thing loaded in memory already, right?

46:46.0

Right, guys?

46:47.0

Right, yeah, yeah.

46:49.0

And so now what we do is when you change a score, all those crazy computations still happen, but they happen in a Celery task that runs, and then it puts your calculated score into this grading system that has a data model that you can actually make guarantees about.

47:08.1

And so that's been sort of our bridge in a lot of ways to take things from this smart object world to the discrete services world.

47:19.0

But yeah, that's the reason why we're CPU bound in so many things.

47:23.1

Can I just ask, if I was writing a new module, a new plugin, I'd just go straight to the new data store?

47:31.3

I wouldn't go via the old mechanism?

47:34.1

Yeah, I mean, probably.

47:36.1

I don't know.

47:36.9

Yeah, definitely.

47:38.2

I mean, there is a grade system now.

47:41.2

And if you want to ask questions about a student's grade, yeah, you would go to the new thing.

47:46.8

You won't want to mess up in that one.

47:49.5

Nimesha, you were going to say.

47:50.4

Yeah, I was just going to say that one of the design principles that, you know, when we're thinking about when we implemented the grading system and, you know, we're talking about persistent grades before grades were computed on the fly and now we're able to read them from the database.

48:04.9

But in order to make that happen, because we had so much flexibility with those X modules and X blocks and what they could do and giving us data in the runtime and all that stuff, we were inspired by, it's like the reverse ETL, you know, with ETL, you know, basically you're, with ETL, you're reading the data, then you're transforming it, and then you're writing into new form.

48:26.4

It's reversed in the sense that basically we allowed these plugins and these Xbox, right, to basically give us a lot of that content.

48:33.7

And they were able to then write and push that into a form that then they were able to then transform however they want to for read optimizations.

48:42.8

So basically, that's where that separation of reads versus writes come in, where like we want to have very fast read optimization views of this data and do it in a.

48:54.1

So we implemented this very simple interface where basically allowing anyone to collect the data and then allowing anyone to transform the data.

49:03.2

And then we automatically look through all the transformers that are registered in our system, run them through sort of like an ETL type of thing there, except it's LTE and that, you know, it's allowing people to write.

49:16.9

And then when it comes to responding to the user, we're able to do that very quickly because it's already transformed for read optimization.

49:25.2

And then the response times are now once again improved.

49:29.3

So it's tricks like this that have allowed us to figure out how can we take what we had legacy-wise.

49:35.8

They were a lot more generic interfaces.

49:39.5

We're going to be more intentional going forward with what our APIs and interfaces are, but we still needed to support the old.

49:45.8

so how do we do that in a way that's performant but you know now with optimizations in mind and

49:51.0

actually there's one other thing i want to in defense of the people who who who made the that

49:55.4

system originally like it is it is really powerful right like like there there are um if you are in

50:03.8

a situation where you don't know what the interface really is for like what what you should and

50:09.5

shouldn't allow for grading and also where there are bugs all over the place right we're building

50:14.0

the platform as the first course is running

50:16.1

on it, and, oh, this

50:17.7

bug, like, you know, you detect

50:20.2

this bug is

50:21.3

misscoring someone or, like, you know,

50:23.8

is interpreting the score wrong, then

50:26.1

if you store it in a persistent

50:28.2

state, then, okay, you have

50:29.4

tasks to, like, regrade them and

50:32.0

restore and modify their stores and

50:34.1

whatever. If it's

50:36.2

always dynamically computed on the fly, then

50:38.3

like, yeah, just reload the progress page.

50:40.5

Boom. Your

50:41.2

score is fixed.

50:44.0

And so it was one of those things where, like, you know, and even the sort of query anything, like, one of the sort of painful things about moving from, like, query the smart object to, like, separate systems is that you do lose some power, right?

51:00.9

Like, the very original version of the prototype that we had had A-B tests running in a totally hacky, crazy way, but they were running.

51:08.8

And it took us years to get that functionality back

51:11.2

in a sort of more predictable and performant way

51:14.3

just because like you, yeah, it's...

51:17.0

So there were trade-offs

51:18.9

and I definitely like,

51:22.6

I have suffered from the grading system more than most,

51:26.7

so has Namesha.

51:28.4

And so definitely I'm not going to say

51:30.7

that that's the way you should architect it,

51:32.0

but I guess I feel like it's important to note

51:34.2

that there are trade-offs,

51:35.3

especially for something that as young and quick

51:37.3

as we were doing back then.

51:38.5

When you're prototyping to just write a Python class and keep everything in memory is, you know, pickle it if you have to.

51:46.4

That's a great development environment.

51:48.5

And that's a great way of finding out what your requirements are when people can't specify them for you.

51:53.3

You can say, oh, well, you know, what does your grading system involve?

51:55.7

And it's like, well, A to F?

51:57.3

No, it's loads more than that.

51:58.6

What does it actually involve?

51:59.5

Well, we don't know.

52:00.6

Well, let's build something, let you use it.

52:03.2

And then when we've exposed what the actual requirements are, then we can build something

52:08.0

which scales up.

52:09.1

As we wrap up, one thing, if we can, I'd love to talk about the automated discovery of CSRF

52:13.8

issues, because this is something, Namesha, you and I briefly talked about Django Boston.

52:19.6

You'd mentioned just generally, there was a number of security things that edX had come

52:23.6

up with that to help Django, but weren't part of Django core.

52:27.9

Yes, yes.

52:29.1

So, I mean, security is definitely very important to us.

52:33.2

I think Django's definitely improved, at least in the last five years that I've seen, like, over time in terms of having more secure defaults.

52:41.1

I think for us, we do have a security working group within edX that triages issues that come in.

52:49.4

And so over time, there have been some issues that are more prevalent than others.

52:54.4

The automated discovery of CSRF issues, that was something that I had written a while back.

52:59.5

And sorry, I'm trying to just context switch what that was in exactly.

53:02.7

But I think what that was, was, you know, basically side effecting get requests, right?

53:08.8

Whenever someone makes a get request, we want to make sure that there are no modifications to a model or, you know, to any data, right, in your system.

53:21.6

On a get request?

53:23.4

Yeah, so in CourseWare, this actually, like, was the case for various student tracking things.

53:30.0

Okay.

53:30.2

Yes, and CourseWare, yes, might have a legitimate issue when it did that way.

53:35.3

But we found, so basically I had created this middleware.

53:38.6

It was a Django middleware, I believe.

53:41.1

And whenever there was a, it detected that there was a get request being made.

53:46.9

And then what it did was, actually, I think it used Django signals to track any post-save model changes.

53:56.6

And so it was able to detect if any Django model changed whenever a GET request was made, as opposed to POST request, right?

54:05.7

And then basically we reported this, and then you would be able to realize that this is a potential CSRF issue that needs to be fixed.

54:17.4

And I think there was other things, too, that I detected.

54:20.1

I may have detected if there were Django Celery tasks that was also being initiated whenever a

54:25.3

GET request was made. So anything that could have potentially had a side effect that could have

54:33.5

resulted. So that's something that we had created. I think what happened was we found a few CRSRF

54:40.8

issues within our system and so we had then been fixing them over time. And so at some point we

54:47.3

we wanted to make it a public application others could also use, but we haven't gotten around to

54:53.8

that yet. So we'd love to be able to contribute that as well. There's also like Robert's CSRF,

55:00.1

wasn't there like template level CSRF stuff that runs in our Jenkins builds now? Yeah,

55:04.9

those are XSS. Oh, I'm sorry. Robert has implemented an XSS linter as well. So he's

55:12.2

able to detect if there's any XSS issues in Django

55:16.2

templates and maybe other things as well. So Carlton, would Django be interested

55:20.4

in these things? Yeah, I mean, quite probably, yeah. I mean, if they're extracted nice and small

55:24.3

and easy to implement, yeah. I mean, just going back to the CSRF example, that's a

55:28.2

great use of signals, right? So people use signals as a way of decoupling their application

55:32.2

but they can get overused and they can lead to hard to understand and hard to follow

55:36.3

code. So the sort of maxim I always use is, do you know

55:40.5

at runtime who the receiver of this signal is going to be who's going to act on this signal

55:44.8

if you do don't use a signal but of course when you're monitoring is there are there any saves

55:49.7

across any models in this request you don't know who's going to be sending it so you don't

55:54.0

it's perfect case for signals it's it's a lovely example right no signals can be very powerful and

55:59.2

there are actually good use cases for it so this is definitely one another one i will say is that

56:04.0

you know when we're thinking about the monolith and the core and then we have these django app

56:07.3

plugins because there too we don't know who the recipient is and it's not core to our platform

56:12.6

we'd actually prefer to use django signals as a way of communicating that out but then to our

56:17.8

earlier point you know that has then becomes an intentional interface yeah and you have to

56:22.0

document that and provide guarantees and ported and all that stuff but django signals does allow

56:27.8

asynchronous it allows decoupling um you know of your components and so but yeah i agree with you

56:34.0

that's that's a good rule of thumb if you don't know the recipient that it definitely makes sense

56:37.9

i mean it's this is a possible interface that you could yeah yeah i mean i'm sure we could keep

56:43.1

talking i know we're kind of run out of time um thank you so much for sharing the realities of a

56:48.4

large scale uh code base because this is how it is for everyone there's always the trade-off between

56:53.8

prototype and and legacy and um you know i guess the thing when when beginners ask me sometimes

56:59.0

about these questions i say it's an honor to be where where you all are at you know as frustrating

57:03.6

as it is, it's like these are the problems you want to have, that you're at scale and you're

57:07.6

keeping up to date with Django. So it's fascinating to hear kind of under the hood

57:12.8

how it's all working and how you're thinking about it. And we'll link in the show notes to

57:16.4

a number of these, especially the OEP. That would be great. Yeah, really super interesting.

57:22.4

So again, thank you, David. Thank you, Namesha. Yes, thank you both.

57:26.5

And for those listening, we're at DjangoChat.com, ChatDjango on Twitter,

57:29.9

and we'll see you all next time. Bye-bye. Join us next time. Bye-bye.