← Back to Show Notes

Transcript: Datasette, LLMs, and Django - Simon Willison

Hi, welcome to another episode of Django Chat, a podcast on the Django Web Framework.

I'm Will Vincent, joined by Carlton Gibson.

Hello, Carlton.

Hello, Will.

And we're very pleased to welcome back Simon Willis, and welcome, Simon.

Hey, Will.

Hey, Carlton.

Hey, Simon.

Thank you for coming on.

Really excited to have you again.

So for those who don't know, Simon is one of the original co-creators of Django.

He's currently working on Dataset.

he writes a lot about ai llm and much much more so we'll get into all that but i'd started off

with actually so you were at the most uh django con us i guess last year um but day to day you

don't do i don't think a lot of django so i'm curious how do you see django 20 years in as

someone who is familiar with it but isn't maybe as in the weeds as some other folks how do you

assess its kind of strengths and weaknesses in the web framework uh landscape as it is now

so the thing i love about django today is that django qualifies as boring technology and i i'm

a huge there's this incredible essay that um online name mcfunley and dan mckinley put out

this wonderful essay a few years ago about how you should pick boring technology where what he means

is that anytime you're building something um there are things that you want to innovate on

and where you want to build like something new and exciting and solve problems that have never

been solved before and then there's everything else and for everything else you should pick

the most obvious boring technology you can so that you're not constantly trying to figure out oh how

do i do csrf protection in this framework flame using or whatever just just make sure your defaults

are boring and i love that django absolutely qualifies now right i i never in my wildest

imagination streamed that django would be the boring default choice for building things but it

is and so actually i'm building um dataset cloud right now it's sas hosting for my dataset project

The core of that is a Django app.

I've got a Postgres and Django app, which manages user accounts and manages signups and all of that kind of thing.

And then it launches Docker containers on Fly.io, which run Dataset and all of that.

So all of the exciting stuff I'm getting to innovate on in the corner, but the sort of bog standard bits that make the whole thing run, it's Django.

And that's great.

So, yeah, I love that.

I love that Django is now the safe default choice for building a web application.

Lovely.

Well, so you mentioned user accounts, I have to ask.

So Carlton's had some thoughts on, you know, maybe 20 years on changing some of the defaults.

Carlton, do you want to give your quick pitch and we'll see what Simon thinks?

Okay, yeah.

So my kind of take is that we've kind of got a leaky battery with the user model because we ask users to create this custom user.

And it's a whole world of complexity that for that central auth model, which is like for every single request, the identity of this user is X.

not the profile data, which obviously you want custom per app,

but we sort of have this custom user model,

which we forget to set up.

And we say, there's all these warnings in the doc,

how you should use it, but don't migrate to it

because that's too hard.

And I think we made a mistake there.

I think what we should have done is trimmed off

all the non-identity stuff from that user model

and then locked up Django Country Borth really tight.

Couldn't agree more.

There were four flaws in the default user model.

Firstly, it expects everyone to have an email address, which doesn't work in 2024.

It makes people pick a username, which is very archaic.

And it expects your name to split into first name and last name, which for many cultures doesn't work.

So, yeah, I'm very with you that the user model has not dated well, unfortunately.

Yeah. So, I mean, what I'd kind of like to do is cut it down and trim it, you know, find a way slowly.

It's obviously over time because Django is very stable

and we have the migration policy, but just reopen that debate

about whether we can trim off the bits and somehow do it.

I think we gave up a little too early on that.

So I've been experimenting in that domain.

Yeah, I love that.

I mean, I use the user model as a key that other things key onto,

and then I have a separate table of Google accounts

that have been associated with an individual.

I don't make people pick a first name and last name.

Yeah, all of that kind of stuff.

Yeah, I know.

Okay, good.

Do you use Django a lot or you write your own then, it sounds like, to manage social authentication?

I mean, I use, yeah, so how am I doing social authentication?

I think I rolled my own Google OAuth thing, which keys against Django.

I think I've probably got code for that lying around somewhere.

And I've done that before in the past.

I tend to use the default user model, partly just for the admin.

the most convenient way to get the admin up and running and i always have the i mean the admin is

such a key feature for me to quickly iterate on what i'm doing and build out like internal tooling

and so forth and yeah i get when you're starting a new project as well the last thing you want is

to sit stop oh i need half a day of planning my needs for a user model you just want to start

and then yeah i'm impressed you implemented your own author uh i've done it so many times at this

point right okay so it's nice yeah um i mean uh so my blog runs on django um simon willison.net

and that's open source like the it's not a very complicated application but it's all sat there on

github and i find myself i tweak that about once every three or four months i'll go in and i'll

tweak something about it it's always fun it also um it's managed by depend bot so it magically

upgraded itself to django 5 a few weeks ago it just did it which was great i didn't have to think

about it i've got just enough automated tests that i trust it that the thing's going to work

after i apply updates and yeah that's that that's a nice sort of way of staying staying connected

with what's going on in a very sort of low risk environment as well yeah i saw you put out a post

a little while ago just on the blog topic about um you know how to build a blog in django it was

really good it was like a kind of checklist that you could run through of how to build a

django and i kind of i see everybody struggling with hugo sites and this static site and this

generator or that state site generator and i sometimes think no just run your own django app

because it's great to have a lot of that playground and you've been doing it for like 20 years with

the same django application just evolving it's lovely really it's yeah it's it pretty much i

think my my the very first version of my blog was php running on my university's shared hosting

with flat nice as like was it even was it the php equivalent of pickle i think it might have been

php's pickle equivalent of just a big like array of posts and stuff and yeah then i i flipped over

to django actually only about 2000 and no i i might it's on my blog somewhere when i first

ported it to django and then i did a major upgrade um in 2017 when i came back after not blogging for

like seven years and did the python 3 upgrade and stuff and i've just been iterating on that ever

sense it's great but also jacob kaplan moss his blog is built on django but also if you go to his

github repo it says forked from simon willison i'd forgotten that oh that's brilliant

actually i stole the feature off of him a few years ago he has this idea of a series of posts

around a certain topic and so i added series into my blog inspired by what he'd been doing

didn't he open a pull request to get it merged into the upstream branch

he didn't but that would that wouldn't have surprised me if he had that's a that's what

carlton would have done yeah i don't know i don't know anyway we'll carry on oh it's i guess just

one more um you know putting your kind of old man hat on with django i i've heard you mention that

the fact that you know so flask django the felt the fact that flask can be a single file i don't

know if you kept up with this but carlton did a talk in 2019 on single file django and then at

the most recent django con us palo um melchiori has like there's a whole repo of like i think

it's six lines of like kind of proving like you can do django in a couple lines um and i guess i

wonder i think about this as teaching because i again like my brother-in-law is going through a

boot coding boot camp and i'm like hey i'm here let me help you like oh we're doing flask i'm like

like i almost feel like i don't know if it's worth doing showing like single file blog on

django or something just to make the point that like hey it's possible because even in flask like

you don't no one does it that way you could but no one would do it that way i do i love the single

file thing i actually i built my own django single file thing like 10 years ago something called djng

and yeah that was basically just trying to do a little thin shim that lets you do a flask

imitation on top of Django.

Because I love that for just hacking out quick

things, not having to bother

about the directory structure and so forth.

So yeah, I'm thrilled to hear people are still

pushing ahead on that. It's

a great idea. If we were to design

Django today, I'm certain it would be

capable of doing single file

out of the box. That just makes sense to me.

Okay, one more and then I'll let you go, Carlton.

This question comes from Eric Mathis

who wrote Python Crash Course

and he was asking, you sort of answered it, but

What is your preferred way of building web apps today?

I think specifically on the front end, having seen it go from server-side rendered to jQuery to SPAS and now, I guess, HTMX.

But where do you fall on that pendulum?

So I spent a few years trying to do the React thing because it was clearly the way it was going.

And I hated it so much.

The thing I hated, it's the build script.

I hate it when you have a front-end project which you work on every six months, and you come back in six months, and nothing works.

You have to re-spin up your webpack configuration, all of that kind of stuff.

And so a few years ago, I said, you know what?

I'm going to give myself permission to write JavaScript like it's 2008 again.

And so no libraries, no build scripts, no TypeScript, nothing like that.

Just like a little bit – because the thing is that we used to use jQuery because of the browser differences.

But the browser differences are gone, right?

today document.queryselector all and all of that stuff it works exactly the same across everything

so you can build code like you're using jquery but without using jquery you just write like event

handles and so forth and it was so liberating like suddenly i enjoyed front-end development again

because i didn't find myself fighting webpack and whatever and v whatever the new cool stuff is

and i and i could go back to projects i wrote like this two years ago i can drop in and i can

maintain them and i can add new features to them and on top of that the um like language model

stuff chat gpt is really really good at all forms of javascript so it's not like i ever find myself

stuck trying to remember how a certain api works if there's something which is going to be a bit

tedious because the javascript is going to be 20 lines of boilerplate it'll spit out the 20 lines

of boilerplate i could just just let it go let me get on with it so yeah i've got really into that

I have played with HTMX on a couple of projects.

I really like it.

It fits my – I've always been into the sort of unobtrusive JavaScript,

the idea of progressive enhancement.

HTMX is so good for that kind of thing.

And so I really – I like that.

I love that that's getting popular, and I love the performance

that you get from it because you don't have to serve a megabyte

JavaScript bundle just to share a contact form or whatever.

And then Dataset itself is very strictly, it's just HTML.

And when you click a link, it loads a new page.

But I've been playing with the Chrome view transition stuff recently, which is super, super interesting.

Like cutting edge Chrome, I think you might still have to turn on one of the experimental flags.

You can actually serve up CSS that says, and when the user navigates from this page to this page, keep this area of the page stable and sort of like blur update this other bit.

And it's like a couple of lines of CSS, and suddenly it feels like a SPA.

You click a link, and only part of the page updates and so forth.

But it's a real navigation.

There's no JavaScript involved.

That's thrilling.

I can't wait to see that roll out to other browsers as well.

I have to ask.

Well, sorry, Carlton.

Go on.

I promised the last one.

Just with bundling, because I just did a redesign of my main site, which is using Tailwind.

And I like Tailwind, but it's a little disappointing.

I now have to have like Node and stuff running to,

it's almost like it's switched from JavaScript to CSS now

to have a build script for everything.

Yeah, this is one of the reasons

I've not adopted the modern CSS stuff as well

is I just, the build scripts,

they're fantastic for larger, more complex applications.

The stuff I do, I always try and keep it small

and simple enough that you don't necessarily need that.

And then they just become friction.

Like it's just something that prevents me

from being able to,

because I have so many projects on the go at once.

I've got what a hundred and nearly 200 it's it's,

I've got some ridiculous number of actively maintained projects.

And the only way to do that is to make it as easy as possible to drop into

something that you've almost forgotten all of the details of and get it up and

running again. I feel like with the front end build stack, if you do,

you work on the same projects every day, it's completely fine.

It gives you a huge productivity boost and that there's none of that friction

because you're, you've constantly got that stuff sort of warm in your head.

If you drop into a project every six months, it's completely different.

And that's what I like to optimize for being able to hop across hundreds of different projects

and make small changes to them without getting stuck on the building.

That's the exact same point as the boring technology talk, right?

Is if you focus on one or two or three technologies, then you're able to really get the most out

of them rather than spreading yourself thin over, you know, say half a dozen and that

slows you down.

It's the sort of same...

But the secret to running lots of projects is they've all got to be as boring and similar as possible.

Like I've got 100 odd repos.

They're all Python pluggy plugins, Ginger templates, like Datastep plugins.

They're all the exact same shape.

They've all got GitHub Actions running workflows and so forth.

It just works.

Okay, good.

Interesting.

So you mentioned LLMs there and ChatGDP and things like that.

But I wanted to ask you something before we, you know, talk about those in more depth,

which is you're not, as well as doing all this amazing work in open source,

you're now on the board for the PSF.

Yes.

So can you tell us a little bit about what you're doing there

and how you're finding it because you're new.

This is your first year on the board.

It's my second year now.

I just hit the 12-month point, and it's interesting.

So the reason I'm on the board of the PSF is that I'd been hassling the PSF

on a sort of low-grade basis every now and then.

I'd go, I'm really annoyed that the PSF isn't doing more

to help make python easier for people to get into like solving the the horrors of the python

learning development environment all of that kind of stuff and um and also the fact that it's very

difficult to distribute applications written in python because you know if you want to you don't

want people to have to install python to use your stuff and i realized i almost had a snap judgment

one day i was like you know what it's not reasonable for me to complain at the psf and not

offer to help and not try to do something so i put myself up for um election on the basis of

i want these are the problems that i think the psf should be addressing and i got elected which

was a little bit of a surprise because i didn't really i mean i think it is name recognition

because you show up on the list of names people are oh i recognize that person or whatever um

and of course then i made it now that i'm in the psf i realized the psf is not particularly

well equipped to solve the problems that i was most interested in solving because the psf nobody

told you well it's it's always difficult to understand quite what what these organizations

are able to do the psf is basically a it's about it's about money that's that's raised and is

distributed around the python community and initiated to um the the psf's focus is on the

community and the health of the community there's a huge amount of sort of sponsorship of events

of um of initiatives like that which is fantastic the stuff i care about is it's not completely

aligned with what the PSF is for, but it's not unaligned either.

So what I'm having to learn is, okay, how do, how do I align what the PSF

can do with the things that I want to get done in a way that supports the,

the missions of the organization?

so forth so it's been a huge learning curve you know this is my first time on the board of a

non-profit it's understanding what levers are available to pull and what priorities make sense

and so forth um and yeah so the first year i was in mainly in sort of just trying to understand

what this how this thing is shaped and what it can do now that i'm through that i'm looking

forward to to maybe trying to tweak those levers a little bit myself to put all your weight on this

one this is why the dsf we just switched to two-year terms exactly for this reason because

it basically takes a year to get up to speed and during covid we had less turnover and i feel like

we got a lot done because we had largely the same crew for two or three years um because it does it

takes a it takes a year to just understand how it works right sorry i interrupted though you were

gonna no i was just making a joke about thing but i was good when time was talking i was like this

is exactly will's experience with being on the dsf board is that people think oh the giant the

dsf can do this the gff can do that and they're yeah well what i hear from will is from his

experience on the board there is that actually the dsf can't do very much well it i mean i think

it's interesting i mean we're gonna have jacob kaplan moss on in a couple weeks um who just

joined back the dsf after obviously working on django and being one of the i think the first

president um but when i joined i had a similar list of things i wanted to do which i guess i

in hindsight, I guess I was lucky that they aligned with, they were all things that could

be done around like sponsorship and, um, God, I forget. I have a blog post on it, but I didn't,

I hadn't thought about the fact that maybe the things I wanted done didn't align directly with

the mission, but, but you're right. It's fundamentally, these organizations are about

money and community and helping others. I mean, one thing DSF is, is now doing is having working

groups, which the PSF has had mixed success with, but at least some success, whereas historically

it's just been everything goes to the dsf and when you're on the board that's kind of its own

thing it's unreasonable to be on the board and spearhead an initiative um from what i've seen

i imagine it's similar on the psf or i don't know are you thinking of like you actively doing it or

more you can well like help spin up a group that that's what i'm still trying to figure out because

the other thing is that the psf is a like the psf has staff the psf is a like like the stuff

Unlike DSF.

And the staff do incredible projects.

I mean, PyPI, I think, is one of the most impactful things that the PSF does outside of the PyCon and event sponsorship and so forth.

And so the directors are not there to do the work.

The directors are there essentially to sort of help make those high-level decisions, help set strategy, and, yeah, make decisions about where the money goes to a certain extent.

Yeah, it's understanding, okay, also, what's ethical and responsible to do?

Like, if I throw all of my weight in trying to push the PSF in one direction, am I actually starving other important initiatives that the PSF are doing that just don't happen to align with my own personal interests?

Right.

Well, in a sense, pure is not the word, but quite a few people work for big tech companies, and so there's even more of a potential of, I don't know, not conflict, but it gets a little, you've got to watch what you're doing.

Yeah, I mean, and I think there are rules about how many PSF, how many people on the board of directors can work for the same company as there should be. Because yeah, that's always a risk with these kinds of things. Yeah, I'm being unemployed by a large company that gives me an aspect of independence. But most of our board members are independent. I'm not unique.

Okay. I think maybe when Jeff Triplett was on the board, I think maybe it was the allocation was a little bit different.

But yeah, I mean, we just we just had new board members join a couple of months ago.

So I think we've had quite a reshaping just recently.

OK, just before we move on.

One last one, and I promise.

So, again, as this Django person, but a little bit of an outsider, I can ask you all the questions that, you know, I want someone to weigh in on.

So an executive director, we're going to have Deb Nicholson on the podcast in a couple of weeks.

There has been talk of Django potentially having one.

What has your experience been seeing an executive director at work in one of these organizations?

Can you imagine one of these organizations without one?

So is the DSF considering having a paid full-time person as an executive director?

It's something Haim, the president, and others have—it's been discussed because—and I'll give my two cents.

I think a lot of this stuff won't happen absent someone full-time to do it.

Absolutely, yeah.

that completely makes sense to me like having having this is one of the problems with boards

of directors is if everyone's just a a volunteer who's investing a few hours of their their time

a week or maybe a month it's very difficult to make progress on things you find you'll have a

meeting and it'll be that not much will have happened since the last meeting and with once

you've got an executive director and staff that completely changes you know there's constant

forward motion well i keep on telling people the the i think the best thing about the django

software foundation has always been the fellows because that's the and that this is something i

say i don't understand why other opens like community-driven open source projects aren't

trying to imitate this exactly because it works so well the psf now has i think at least two fellows

inspired by the django software foundation and those are incredibly impactful um you know the

work that they're doing the work that seth's been doing around security is a sort of relative new

edition absolutely extraordinary how much impact you can have with that so yeah i'm i'm very very

keen on the idea of these non-profit open source supporting foundations that actually have staff

that can could just keep on making progress on things but just my experience just having been

the fellow is that these these other tasks these non-fellow tasks would arrive and it'd be like

okay well i'll do that but you know i've got a bit of time in the week i can do that but it wasn't

really the fellow role and there wasn't enough capacity to make any sort of significant progress

on you know for instance you know reworking the janga project.com website okay don't do a little

bit of work on it but it's literally an hour or two here or there and not the massive month-long

project months-long project that's going on now to actually do a proper assessment and what does

it need and how do we refresh it in a sort of professional you know to a 24 2024 kind of standard

you know rather than just oh yeah can you make a tweak here it turns out there's a lot to be said

for having somebody whose job it is to get specific things done you know that's yes exactly

um yeah so i'm i i think that that sounds very very sensible to me yeah i'm biased and i realized

we we i think i think it was anna the past dsf president and i had a had a call with um private

call with with dev nicholson the new the new one and she sort of went through what one does in that

position and we were just like oh my god we so need that um so yeah i put that out there but

i'm not on the board now so but carlton yes can i nudge up so i want to get so i've been using

copilot and whatnot and i think it's awesome and it's it's you know you mentioned javascript earlier

like my javascript's come on so much because they're a bit like how do i how do i filter this

array to get this the one value that i need and previously that would take me 10 minutes of

looking up because it's not something i do you know i do it once every six months but now i can

just ask the lam it's got it and it's not it's not rocket breaking code it's not it doesn't have any

value other than it saved me 10 minutes um so i guess my question is how can i

leverage that and how can i leverage continue to leverage that and and your tooling how can i

install that and can i get something equivalent to the closed source that i can use that's open

source wow that's a whole bunch of things to talk about yeah yes but that's kind of

let's get into it yeah um i'm with you like the thing that excites me about llms is i love them

as as sort of teaching assistants right it's something i can ask question i can ask the

dumbest question in the world at three in the morning and i'll get an answer and it doesn't

judge me and i don't like like you know and i don't feel it's not knowing how to do a for loop

in bash or whatever it is you don't want to post post it on the django forum like how do i do this

Exactly, exactly. I love that. And I love that it lets me be so much more ambitious with the projects that I take on. Because like a great example, I shipped code in Go for I needed a little like high performance network proxy router thing. And I ended up writing it in Go because I don't know Go, but chat GPT, GPT-4 knows Go throughout.

And I know Go just well enough to read the code and be able to tell if it's doing the right thing.

And I can get it to write tests.

So I ended up building this, like, 100-line little custom Go server thing with comprehensive unit tests and GitHub Actions running continuous integration.

I got continuous deployment running, all of the things that I consider to be important for, like, robust projects.

And I shipped it, and it's great.

Like, last month I had to make a change to it.

And I fired up GPT-4, and I worked with it, and we figured out what to do.

and i absolutely i mean that was extraordinary because normally i would never write something

in go because i'd be fine tinkering with it but i'm not going to write production code in language

that i'm not completely fluent in in this case i'm i feel like me plus gpt4 is fluent enough

that i'm willing to deploy code written in a language i'm unfamiliar with i've written code

in apple script apple script is notoriously a read-only language like you can read it and see

what it does it's the there's like a continuum there's apple script on one end a pearl on the

other like read only write only absolutely but yeah i'm i'm using apple script for things i'm

using all of these weird little domain specific languages i use jq all the time now because jq

is really powerful but i can never remember the syntax so i love that i love it as a sort of um

an accelerator for me doing lots of things i'm taking on more projects which is terrifying

because i already had too many projects and i'm like oh i mean me plus chat gpt i can probably

get something working in 20 minutes and of course it takes two hours but still at the end of that

two hours i've got something that works and is interesting that i wouldn't have built otherwise

um but it's that first 20 minutes that you wouldn't have put in that gets you to the two hours

i do so much coding on walks with my dog now because i can be walking the dog and i can on

my phone i can just like prompt it to write me some code that does this i can use the code

interpreter mode where it actually runs the python code it generates so i can get back from an hour

long walk with the dog and i've got 50 loads of python that i know works because it actually

ran the code found the bugs fixed them all of that kind of thing it's incredible like you can

even turn on voice mode i can literally talk to it while i'm on a walk with the dog and it writes

code for me that's utterly surreal that that's even possible so yeah i love i love that aspect

of it um and yet but the as you mentioned the problem with chat gpt is it's a it's for a company

called open ai it could not be more closed right it's this proprietary hosted model they change it

all the time without telling you what they've changed so people keep on complaining that it's

got weaker it's worse at x and so forth i never know if that's actually true because it's basically

random number generator so it's very easy to assume that it's changed when it hasn't

but that's really frustrating and then but the great news is that in the past like 12 months

we've had so many new options for running these things ourselves these openly licensed models

that you can run on your own hardware and they're beginning to get pretty good like i don't use any

of them on a daily basis because gpt4 is so good so it's sort of my default but i'm constantly

experimented with them my favorite at the moment um my two favorites are these mistral models

there's mistral 7b which literally runs on my telephone like there's an app that runs it on

my phone and it's not awful like i was on a plane and i was using it to to do the kinds of things i

might have looked up on wikipedia and okay it'll probably hallucinate stuff so don't depend on it

telling you the truth but it's still useful for sort of getting things getting just starting to

explore different ideas and then the other one is this new one called mixtral which is a mistral

a mixture of experts model they just released that um a month ago and that runs on my laptop

and is feeling it the quality begins to feel like chat gpt 3.5 like it's very very good so if you've

been resisting using these things because you don't want to use some weird hosted model by some

like closed open company mixtral is something you can run on your laptop right now it's a it's

apache licensed that the whole thing is apache licensed although whether it's truly open source

is up for debate because they won't release the training data that was trained on which right is

i think that's the source code right i think for these models the the raw training data is the

source code that was used to compile the model because you can't open source that training data

because you ripped it all off it's full of copyright data and you can't just slap an apache

license on someone else's copyrighted works but yeah so this stuff is really exciting it's really

interesting so i want to come back to your tooling but you've just mentioned the the copyrighted

training data thing and so there's this um these lots of cases where the the llm will reproduce

its training data almost exactly um in in cases so the the new york times um we've got this um

lawsuit perhaps you can explain the thing i could i kind of see it though and i'm like oh wow yeah

it is actually reproducing you know you type in underwater sponge and you get a sponge called

bob square pant come out from the one of the image generators for instance this is so fascinating the

ethics of this entire space could not be more murky like every aspect of this space you're like

wow is that okay and the answer is maybe not it's all very bad and that so a lot of people have

ethical qualms against this and i agree with everything that they're saying you know the

The New York Times thing is – so the most recent thing is the New York Times filed a very big lawsuit against OpenAI a few weeks ago.

It was against OpenAI and Microsoft, and it was complaining about three different things.

It was complaining that, firstly, you took all of our work without permission, and it's copyrighted work, and you used it to train your model.

And I don't think anyone is disputing that that is what OpenAI did.

They used OpenAI, they used New York Times data as part of a vast amount of training data that went into these models.

It's effectively, you could look at it as it's a crawl of a sizable chunk of the Internet that was used to train these things.

But that included New York Times data.

The New York Times say that OpenAI put more weight on the New York Times data than they did on other data they trained on because of the high quality of that training data.

So I don't know if that's conclusively proved or not. I think the GPT-2 paper a few years ago did explicitly list that the New York Times data was being used like that. So they might be assuming that that's still true. It probably is still true. But this is one of the things I'm excited about this lawsuit is I want discovery because I want to know how GPT-4 was trained because they haven't told us. So, you know, if that comes out of this, that would be useful.

So complaint number one, they trained without permission. Complaint number two is that the models can spit out exact copies of New York Times articles. And this was news to me. I thought that the act of training muddled the stuff up to the point that it won't spit out exact copies.

it turns out if you set the temperature to zero and then feed it the first two paragraphs of a

new york times article it can often spit out the next four paragraphs and sometimes there are very

slight differences like one word will be changed but effectively it's it's it's memorized and it's

regurgitating the same thing but if you if you if you tried to publish that that would be clear

violation of copyright it would be exactly and then so the question is well are open ai publishing

that just by having an interface where people can see it.

And that's, I mean, so many of these things, I don't think there's a, obviously,

legal i'm not a lawyer at all but there's a reason this is going to go to court because

these are legal questions that are very blurry and unanswered so complaint number two is it

can regurgitate their content and they've said um this means that people bypass our paywall by

getting the model to spit out articles which is a bit of a loose claim because you've got to have

the first three paragraphs of the article anyway but they did have a really interesting thing where

they talked about the wire cutter right where the wire cutter is a new york times company it does

product recommendations if you ask chat gpt for product recommendations it will often spit out

the wire cutters picks but it won't give you the referral link that's the wire cutters business

model and this is the the definition of fair use in american law specifically talks about um whether

the thing is competitive with the thing that it ripped off and so the new york times case the main

thing they're trying to demonstrate is this competes with us this is harming us financially

because you can bypass that paywall you can like rip off wire cutter recommendations all of that

kind of stuff so that's argument number two complaint number three is actually about retrieval

augmented generation it's about the thing that um microsoft bing does and uh chat gpt browse does

where you can ask you the question it goes and does a search on the internet and it'll find the

new york times article about something read bits of it and then like summarize that and give you

the summary back again and so then the new york times is saying well look you're clearly subverting

our paywall you're you're profiting from content that's derived from us now that one that's one

one it's almost the one that worries me the most in terms of i think they've got a completely

fair point in complaining about this but summarizing stuff is my favorite use of llms

like if we come up if we end up with legal precedent that you can't even copy and paste

data into an llm to get a summary back out again that would be very harmful for for the sort of

the ways that these tools are most useful but that's the problem is that i read the 69 page

lawsuit and it's very clean it's very well argued and i think like oh like i said not a lawyer but

all of these points feel to me like points that are worth putting in front of a judge and jury

and and trying to get answers about yeah i think i mean two things come to mind from what you've

just said one is um i know google has been um told it has to pay news publishers in various

countries at various times because it does exactly that if you google the news in the country in

australia i'd pick australia i don't know if it's applied in australia but it will go and you know

get the sydney morning herald and summarize that without you ever having to leave google.com and

they were you know the one thing about that is that those lawsuits they were just about the

headlines the headlines like the first few words of the story even and what these what generative

ai is doing is so much more than that much google are clearly going to be if the the google are

clearly going to be on the chopping block next after opening after opening microsoft because

they've got a prototype um like an alpha version of their search page that does exactly that it

just adds generative ai and it spits out a generated answer to your question at the top

they've been doing this with their like little content snippet boxes and so forth as well over

the past few years and it's super worrying right if you've got a web where nobody ever clicks a

link from a search result because they just get their answers right there in search what point

is there in trying to like build a profitable web business anymore you know so all of these

ethical complaints are very very legitimate here's a meta question for you so we know now that

llms are being used to generate a lot of the content on the internet how do you see this

going forward if the lms are going to be really trained on themselves do you think that like is

this is 2021 the the you know the the high point or is there a way out of that because it seems a

bit like a vicious circle it seems like an ouroboros situation does doesn't it and it's

people have been talking about this for a couple of years now and um at one point i heard that

open ai the reason they hadn't updated their training data like there was a training cut

top of what september 2021 i think and the reason they had updated it is that after that point there

was enough usage of these tools the internet was beginning to fill up with llm generated text and

they didn't want to train llms on llm generated text because of the ouroboros effect at the same

time in the openly licensed language model community almost all of the really good ones

are actually trained on gpt4 output like the way you the way you build a really useful um like chat

tuned language model is you need to give it 20,000 examples of good conversations and the easiest way

to get those is to get gpt4 to spit them out and then you train your model on gpt4 and so if it if

that was such a bad thing we wouldn't be seeing models that were trained almost exclusively like

that show up at the top of the leaderboards so i think this is all i mean this is all part of the

larger problem that we really have very little insight into how these things work they are giant

like 16 gigabyte blobs of floating point numbers we're and and we're still trying to figure out

just the basics of how you sort of poke around inside that weird matrix brain and figure out

how it's working and what it's doing and so yet maybe the fears of llm's training on llm output

are don't don't actually work out maybe it's okay maybe it's complete catastrophe we have no idea

and it's funny that we had no idea six months ago and it feels like we still have no idea now

So despite the rate at which this technology is improving, the rate at which your understanding of it is very sort of dubious in terms of how much we can figure out.

So it really is a new world.

It is. And as a computer scientist, it's infuriating, right?

Because I like computers that do exactly what you tell them to do.

And you can write tests and you can fire up a debugger and everything is repeatable and understandable.

And these are not that at all.

It's like a completely sort of weird, blurry alternative world in which everything's based on vibes.

You come up with, you pick a model and you poke around with it and you see if the vibes feel right.

And then you tweak your prompts.

And does that seem better?

I mean, it kind of does, but it's awful.

It's really difficult to do sort of responsible development on top of it.

It does seem like the closed LLMs, like, you know, like if I'm a hospital or if I have billing records or like very niche-y things,

LLMs are fantastic. And especially like I'm in Boston, there's a lot of research places. They're like, we can't use an open LLM thing, but these closed things are definitely being sold and used on whatever industry company has huge amounts of their own data. I would say I almost feel like that's got more promise than this, like the entire web being, you know, stolen approach in the long run.

Well, the flip side of that is, so Bloomberg built their own, they trained their own language model on the internal financial documents. It was supposed to be the best possible LLM for finance. And then it turned out that GPT-4 came out, and as a general purpose model, it was beating the Bloomberg one on financial tasks.

But this is one of the things that's so challenging right now is the rate of improvement of these

things such that if you've got a project that will take six months, you maybe shouldn't

do that project because you might spend six months on it and then GPT 4.5 comes out and

it solves the problem that you just spent six months trying to solve.

And so there's this interesting strategic problem where at what point do you actually

settle down and start building on this stuff as opposed to thinking, you know what would

be quicker is if I waited two months and then started building because I'd get a better

result than if I started building today. And that's absurd, but that's genuinely the position

that we find ourselves in. That's Zeno's paradox for the 21st century. Completely.

Well, have you ever read, there's this book, AI Superpowers, that's a couple of years old now

by a Chinese American. He works in China, a US researcher. And I read that, I think I read that

five years ago. He summed all this, this is before OpenAI came out, but he basically said,

you need three things. You need the algorithms, which finally, like we had at that time, you need

training data, and then you need processing power. And he argued with the cloud that basically it all

came down to data. This is back in the day, because we had the algorithms, they're basically

open source. We have the cloud computing. And so it's really all about training data. I think he

went on to say he thought China would surpass the US for that reason, because it has no privacy

controls but all of that is to say to you where do you do you see it as tweaks in them is there

more to juice to squeeze out of these llm models do you think or is it really more about crap like

a data science thing where it's all about what you put in and trying to optimize that i'm trying to

pick between the two i'm very confident it's both um mainly because if you look at the open model

community over the past like since since since february people just keep on coming up with new

little tricks that make the models run faster and smaller like the fact that i can run a gpt 3.5

class model on my laptop now and i certainly couldn't do that a year ago because like the

models that were coming out and like the first versions of llama and stuff were much larger

required much more hardware much less optimized um so there are so many techniques that can be

used to make these things and i'd like smaller and faster right i want i want a model that works

on my phone and can do the things that i need to do i wanted to be able to summarize and extract

facts and call functions and all of that kind of stuff um but at the same time people keep on

finding that the higher quality the data the better like it really is so much to be said

especially when you're fine-tuning these models for just having super super high quality data

that you feed into them um if the new york times thing plays out one way we may find that it's no

longer possible to just steal the entire internet and train your models on it at which point

that becomes raises some really interesting questions the thing that worries me most about

that is does that mean that llms then become incredibly expensive to build because of the

licensing costs to the point that you don't give them away for free and so does that mean that only

people who are very wealthy can afford to use these tools whereas today anyone who can afford

an internet connection has access to some of the the best in class of these models so that really

scares me like the the that that i feel despite the fact that the um the the ethics around copyright

i mean there are very very real concerns here but at the same time a world in which only the

the most wealthy have access to the to these tools that feels unfair to me as well yes and we can't

lock these tools up they are super useful like to take them away would be foolish like also if you

banned them i've got a usb stick with half a dozen models and you create a blank market of people

it's very cyberpunk right people swapping usb sticks with like with the last version of

mistral that was released on them super so there was a paper a little while ago just

pick up what you said there about open ai saying we haven't got a moat or something like that um

there was a leaked memo from google it was somebody within google put this memo together

saying saying there is no moat for this technology um it's interesting to revisit that that i think

that was it was quite it came out in maybe march or april of last year and it's interesting to look

back at that now and say okay how much of this played out because uh one of the real challenges

with this stuff is um if it's all just driven by human language prompts the cost of switching to

another language model might be as simple as saying okay we'll run this against claude instead

of gpt4 and maybe that will give you the exact same effect right um or maybe it won't because

so much of the the prompting comes down to these very small tweaks that you make where you're like

oh okay if i capitalize the instructions to output in markdown maybe it'll actually listen to me this

time but that effect itself is kind of hurt by the fact that openai upgrade their own models so

just because that won't work now will it still work in a few months time it's it's kind of

uncertain. So that's part of it. There's also the fact that the closed model providers are up

against tens of thousands of researchers around the world collaborating together. That's something

I really like about the open model community is there's all of this sharing and this acceleration

that comes from just having tens of thousands of people worldwide all trying to solve these

problems. And OpenAI are an incredibly talented, experienced set of people, but I still don't like

their chances against tens of thousands of people around the world although of course when those

people around the world figure something new out cool open ai can just take that research and use

it themselves so so you can they can sort of keep up that way um but yeah it's and there's also

there's the compute right like it's we still don't know why gpt4 is so much better than everything

else um the most likely thing is that they ran it they trained it for longer and they trained it on

more data than anyone else has been able to do yet but still people are catching up now that there

is if you have a hundred million dollars maybe it's worth trying to funneling that into data

and training you know that it's not like there's a shortage of investor money floating around the

space at this point yeah i guess and the economics are so crazy because yeah it's a hundred million

dollars but then then it's just a file that you know anyone you can sell to anyone for virtually

nothing right when people people often complain about the environmental impact of language models

where they say well look training like training a language model takes this enormous amount of

carbon dioxide which is true at the same time it's about the same amount of carbon dioxide as flying

a boeing 747 across the atlantic twice you know which is a vast sum but i would argue it benefits

more people because your airline flight benefits the people on that plane the language model if

it's then used by a few million people over the course of six months it feels like you are getting

more value for your for your for your sort of carbon dioxide at that point yeah i have a

question about the um carbon the co2 usage so i my understanding of ml which is machine learning

which is quite limited but it was that the training was the hard bit but then once you what

you get out of um the training algorithm is a kind of vector operation which you can run almost

you know quite cheaply and then i saw though people complaining um and i didn't have the time

to follow up but that every time you generate an image with dali or or whatever that uses so

much water or so much this because it's still computationally expensive to run the model not

just train the model is that true so this is an interesting question so like i said i run models

on my iphone i run models on my laptop i am not worried about their resource constraints um but

again i don't know what gpt4 is running on i'm pretty sure it's running on a full server rack of

of gpus so my hunch is that for the very large models yeah there's a lot of cost in the inference

i still think it's a fraction of what it costs to train them that's that's the intuition i've

i've gotten from this um and you know like the image like stable diffusion also runs on my phone

so there are versions of these models where the environmental impact of running them is no worse

than turning your laptop on that's but but i don't really have good insight into what the large

hosted models are doing so i had so we've talked all about llms i wanted to ask about your tool

because if i want to run this you've got the perfect tool for me to download and do so please

tell us about that because we we've talked about all the exciting things okay okay if i actually

want to do it what do i have to do so i built this tool in python called llm i got lucky llm was

still available on the package index. So you can pip x install llm, and you get a command line tool

for interacting with models. But what's really fun about it is that it's inspired by dataset.

It's all based around plugins. So out of the box, you can give an OpenAI API key, and it will run

against OpenAI. And then there's about a dozen plugins you can install that will add additional

models, including models that run on your own machine. So you can essentially pip install

my tool, and then pip install a plugin

that adds a language model to it.

And now you've got a four gigabyte file

on your computer that you can start interacting with.

But crucially, the interface is the same

no matter what model you're using.

So it's LLM space, double quotes,

your prompt or you can pipe things into it as well you can do cat my file dot txt pipe llm

and then if you by default it'll use your default model if you stick dash m space claude on the end

and you've got the claude plugin it'll run it against claude and so forth and um of course

everything it does is log to sqlite because i do everything with sqlite so one of the great

things about using this tool is that it's a way of building a sort of database of all of your

experiments across all of the different models so i just use it on a like daily basis for all

sorts of different bits and pieces and i've accumulated like a few thousand prompts and

responses in my sqlite database of things that i've tried out maybe at some point i'll do some

analysis on that and try and start comparing models that way but really the fun thing about

it is i'm trying to make it so whenever there's a interesting new model you can install a plugin

and start playing with that model and that works for hosted models and it works for local models

as well um and yeah it's it's really really fun to hack on one of the things i've realized from

playing with it one of the original ideas is um the unix philosophy the unix command line of piping

things to other things is an amazingly good fit for language models because the language models

it's a function you you pipe it a prompt and it gives you a response and so one of the things that

i use my tool for is um it's um it it ties into this concept of system prompts which is something

that open ai did originally and other models have started picking up where you've sort of got a

second prompt that gives you instructions about what to do with your other data so a great example

is i can take i can take a um file and i can say cat my file.py pipe llm dash dash system write me

some unit tests and then the model gets the prompt write some unit tests and it gets a bunch of

piping code and it'll spit out a bunch of unit tests and of course they won't be exactly what

you need but it's that skeleton that you can start hacking on it's really good at explaining

code i pipe it code and say explain what this thing does um i use it for uh release notes not

to publish i i kind of feel like it's rude to just straight up publish something that an lm wrote for

you because i mean what are you doing right like it's it's fine to take as long as it's fine to

publish something which you're willing to sign your name to because you at the very least reviewed

it extensively and hopefully revised it and tidied it up but there are lots of projects out there

that don't bother writing good release notes and what you can do is you can check out their git

repository and you can do git diff between this version and this version pipe llm dash dash system

write release notes and gpt4 can understand a diff format it'll it'll read it and it'll spit out

release notes which in my experience are about 90 correct and 10 slightly wrong or maybe there's

hallucination in there and that's fine right that's good enough for my purposes just saying

okay what have they done in this release that they didn't bother writing release notes for

so yeah i i i recommend trying this thing out partly because it's fun to play with models

and something i'll say about the models you can run on your own laptop is they are kind of crap

like they are they are very very weak compared to gpt4 but that's a feature because it's easier

to build a mental model of how they work when you work with the weak ones like gpt4 because it's so

good you can use it for a few days without really seeing the weaknesses and the flaws in it because

it gets most things right but it's still just you know guessing what word should come next it's

doing the same kind of thing the little ones will hallucinate wildly which is so useful for getting

a feeling for okay these things are not intelligences these things are dumb autocomplete

that's just been scaled up to be able to cope with lots of things um i love i use myself as a test

thing because i've been around on the internet for long enough that these things can answer

questions about me like i can ask for a bio and some of the models will get most of the details

right they might say i went to a different university or whatever and some of them will

just hallucinate wildly and so i've had models tell me that i i co-founded github and things

like that um and it's amusing but it's also quite good as a sort of like just an initial sniff test

to see, okay, how good is this model

when it comes to hallucination and that kind of thing?

Okay, super.

We're coming up on time a little bit.

I wanted to add one positive note,

which I've heard about, you know,

we mentioned that these tools could further

increase the economic divide,

but they are democratizing a lot of things,

like in unexpected ways, at least to me.

Like, for example, someone I know

is an admissions director at UC Berkeley,

and a friend asked that person,

hey, you know, what is it now? What is it like with these college essays now that

ChatGTP exists? And he said, it's actually great because it's an equalizer because rich kids have

had private essay tutors for forever. And now everyone has, you know, 80%, 90% of it. I mean,

it probably makes them all sound kind of the same anyways, but it's a tool that people who don't

have these external resources, if they know how to use it, can, you know, up the, you know,

It's just like Grammarly and all these tools to help increase the writing.

And and so that's I was pleased with that because I think it's very easy to get a little

doom and gloomy about it.

But it is for almost no money bringing these resources to so many people who didn't have

them before.

Oh, I couldn't agree more.

I feel like we always get very hung up on the many ethical flaws of this technology

and the harmful ways that can be used.

The positive ways they can be used are just enormous.

Like the reason I'm spending so much time with this tech is that I do believe that it's

genuinely useful and it does genuinely provide enormous amounts of value to enormous numbers

of people um if you have english as a second language this tool is phenomenal right you can

now you're no longer cut out of those things in your life those parts of society where you need

to be able to write like somebody who's a native speaker who's at a certain level of education

and that's that that has been completely flattened i am like people sometimes say oh it's not worth

learning to program anymore because the chat you'll just do it all i think that's complete

rubbish i think now is the best time it's ever been to program because anyone who's coached

somebody learning to program has seen that the first six months are just utterly horrific like

it's it's so frustrating because you try something and you get this obscure error message that

doesn't make sense to you and you can bang your head against it for two hours and maybe you give

up lots of people do give up they assume that they're not smart enough to learn to program

And it wasn't that they weren't smart enough. It's that they weren't patient enough. Nobody warned them how tedious and stupid this stuff is. And now we can give them a tool. We can say, look, if you get an error message, paste it into chat GPT, and nine times out of 10, it will tell you what to do next and how to get out of that condition. That's phenomenal, right?

the the flattening of that learning curve getting more people my my ideal end point of all of this

is i think every human being deserves the right to have computers automate stuff for them like

i can do this right i've got 20 years of programming experience if there's anything

that i can tedious in my life the computer can automate i can get it to automate that thing

but it's ridiculous that you need 20 years of experience to do that like that should be a

It's a universal human ability, and I think this technology might get us there.

I feel like if we get to a point where people are able to get those tedious automated things just done for them

because they didn't have to learn to program first, that feels enormously valuable to me.

Yeah, absolutely. I'm just lighting up as you spell out that scenario.

One of the parallels to that is why hasn't technology pervaded more deeply throughout the clerical world, for instance?

It's like people are still doing with paper or sort of manually repeating a task on a computer.

Yes, it's in a spreadsheet, but it's not automated.

Why not? Because they never picked up that programming skills.

But all of a sudden, if these assistants are built into Excel or built into Word or built into the software they're using, it can be automated easily.

I heard a horrifying story the other day about a a local fire chief, like the guy running a fire department who, due to some mess up, had to manually unsubscribe 2000 people from a mailing list.

And he spent a full day clicking the unsubscribe button over and over and again in some horrible piece.

This is somebody who has a very real, very important job to do.

And I think this pattern plays out a lot.

A lot of people with a lot of important things in life

end up stuck for a day doing something tedious and manual

because we haven't given them the tooling

that lets them not have to do that.

So yeah, so I'm really excited about that.

I think as an educational assistant, it's amazing.

I think one thing that isn't necessarily talked about enough

is these things are actually very difficult to use effectively.

And they feel like they should be easy

because it's just a chat bot.

But actually, to really get the best out of them, you have to understand the prompting techniques.

You have to know what it can do, what it can't do, what are the things that it's going to break on.

I love that we've created computers that are bad at maths and can't look at facts for you, which are the two things that computers have always been best at.

So people sit down and check, well, it got maths wrong and I asked it for a fact and it couldn't tell me the answer because that's not what it's for.

But that's really not obvious, you know?

But it can now, right?

Isn't that the whole thing with Sam Altman?

One of the things is it can do math now, allegedly, the new version?

I mean, well, it can if you give it tools.

So ChatGPT, the paid version now has access to Bing search, so it can look up facts, and it has access to Code Interpreter, so it can run mathematics using Python, which on the one hand, it does fill those giant gaps.

On the other hand, it makes it even harder to use because now you have to know what Bing search is.

You have to understand bits of Python.

You have to know this is the kind of thing where it's got vision support so it can read documents.

But the interaction of all of these features is incredibly complicated.

A great example is sometimes I will give it like a photograph of a receipt and ask it to add up the numbers in the receipt.

And it will then write Python code that imports Tesseract and use Tesseract OCR to pull out the numbers and then it will try and add them up.

But of course, Tesseract isn't as good as GPT Vision, right?

If it had taken that image and used its built-in OCR to pull the numbers out

and then passed to Python, I'd have got a more reliable result.

How the heck am I supposed to explain that to anyone?

Like, I have to know what Tesseract is.

Kind of.

But you see what I mean?

As they add more features, the matrix of complexity of how the features interact

gets even more complicated.

So having expert skills to use this stuff gets harder.

But I think you said this in a recent interview.

I mean, using a chatbot is one of the worst user interfaces.

It's like terminal, right?

We're not, like, no one's using terminal, right?

So it's, do you have any, and I guess the final question for me,

do you have any predictions on, you know,

the mouse equivalent of where this stuff goes?

Because we're not all going to be using chatbots forever, I don't think.

I certainly hope not.

Yeah, like, I mean, yeah, like you said,

the problem with chatbots is there's no discoverability.

Like, you've just got a blank box to start on.

And they might give you a couple of suggestions,

but it's a terrible user interface.

But that's the thing that's really exciting about the space right now is there is so much low hanging fruit.

Like you could sit down and just come up with an alternative UI for interacting with language models.

And right now, maybe you'll invent the thing that everyone will be using for the next six years because there's been because we're so early in this process.

There are so much scope for innovation around how we use these, how we interact with them.

And that I find really exciting.

I love that now that people are beginning to understand this tech and what it

can do, like we need designers on this stuff.

We need user experience people.

We need, but it turns out machine learning nerds are the worst possible people

to actually make use of this technology.

Because they're thinking in terms of, you know, they're thinking in terms of,

okay, well, I've got to optimize my gradient descent or whatever.

You don't need to know what gradient descent is to innovate on top of language

models.

That's almost a distraction from what we're trying to,

what we can achieve with them.

Brilliant.

Brilliant.

Okay. So we are going over. I did want to ask you about Dataset. So it was kind of a run short. So

can you give us the 30 second, what's new in Dataset? We've talked to you about it before,

but what's hot? The most exciting new feature in Dataset, I've been building this feature

called enrichments, where the idea is that you've got, say, a CSV file with 10,000 addresses in,

and you load that into Dataset and you want to see them on a map. So you need to geocode those

addresses. With enrichments, you can have a plugin that lets you select the address column and say,

geocode this and it'll go and churn away against the geocode of your choice and it'll populate

latitude along a two column next to it but crucially these things are all built as plugins

so you can have a plugin that does geocoding i've got a plugin that does just regular expression

extraction of things and i've got a gpt plugin so you can say take this take this database table

run this prompt against every single row and then put the output of that prompt in this other column

And there's one example of that. It can do the GPT vision thing. So I actually fed it a database table with 100 URLs to images and told it to write me descriptions of those images. And I got back three or four paragraphs per image describing what was in the image right there in my table.

and of course now i can search against that and do all of that sort of stuff so i'm really excited

about that it means that data set is evolving into more of a data cleanup and manipulation tool

which is a departure originally it was about publishing exploring data um but i realized that

the problem i most want to solve especially around journalism is if somebody gives you a hundred

thousand rows of data what the heck are you supposed to do with that right especially if

It's slightly too big to put in Microsoft Excel,

but you can't afford to hire some programmers

to build you like a custom Django Postgres app

for this thing.

What do you do?

And if I can build plugin-based tools,

especially with Dataset Cloud now,

so I can host them for people,

where you can upload your CSV file,

click, click geocode,

wait a couple of minutes as the progress bar fills in.

Now it's all geocoded.

Now you can visualize it on a map.

That's really exciting.

And yeah, and the enrichments,

I tried to make it as easy as possible

to write additional enrichments as plugins.

So I'm hoping to see people building their own enrichments

for all sorts of other data transformations they might want to pull off.

I have to ask just one more question then.

If you're doing it on Dataset Cloud and I've got an enrichment

and I'm going to give you some software,

did you find a solution to how you can run my software in a sort of trusted way?

Well, at the moment, I can review your software

and make sure it doesn't have any faithful holes in it.

And then I've actually – Dataset Cloud has a feature now

where I can basically say to this customer,

pick install this additional package.

So I do have that now.

And also, Dataset Cloud, I built it on top of Fly.io

precisely because they offer secure containers.

So with Dataset Cloud, every customer gets a separate container.

So if you somehow manage to screw up the security in your container,

it's isolated.

That's a problem for you.

It's not a problem for other customers in the system,

which felt really important to me.

Yeah, okay, good, good.

Because I know you've been noodling on that problem for quite a long time.

Yeah, I still want to be able to run WebAssembly server-side reliably for untrusted code.

That's like my ultimate goal.

Because, yeah, I want users to be able to say, here is some Python code, run this against all of my data to transform it without risk of them breaking things or whatever.

And it feels like we're almost there with WebAssembly.

And that would be amazing.

If I can take untrusted Python code and run it in a WebAssembly sandbox that's locked

down and that can't do network access and can't reach the file system, that would be

amazing.

Okay, super.

We could go for another hour, but thank you for taking the time to talk about all these

things.

We're going to have links to everything and, you know, Dataset, Dataset Cloud, Enrichments.

Those are, I think, the three big things that fans of yours should go take a closer look

if they're not already familiar.

Cool.

Yeah, this has been really fun.

Yeah, I'll put together some links to this as well.

Okay. Thanks, Simon. Thanks so much for coming. That was really awesome and illuminating and

filled in so many questions around a really hot topic. So super.

Thank you everyone for listening. We're at DjangoChat.com and we'll see you next time. Bye-bye.

Bye-bye.