Transcript: Datasette, LLMs, and Django - Simon Willison
Hi, welcome to another episode of Django Chat, a podcast on the Django Web Framework.
I'm Will Vincent, joined by Carlton Gibson.
Hello, Carlton.
Hello, Will.
And we're very pleased to welcome back Simon Willis, and welcome, Simon.
Hey, Will.
Hey, Carlton.
Hey, Simon.
Thank you for coming on.
Really excited to have you again.
So for those who don't know, Simon is one of the original co-creators of Django.
He's currently working on Dataset.
he writes a lot about ai llm and much much more so we'll get into all that but i'd started off
with actually so you were at the most uh django con us i guess last year um but day to day you
don't do i don't think a lot of django so i'm curious how do you see django 20 years in as
someone who is familiar with it but isn't maybe as in the weeds as some other folks how do you
assess its kind of strengths and weaknesses in the web framework uh landscape as it is now
so the thing i love about django today is that django qualifies as boring technology and i i'm
a huge there's this incredible essay that um online name mcfunley and dan mckinley put out
this wonderful essay a few years ago about how you should pick boring technology where what he means
is that anytime you're building something um there are things that you want to innovate on
and where you want to build like something new and exciting and solve problems that have never
been solved before and then there's everything else and for everything else you should pick
the most obvious boring technology you can so that you're not constantly trying to figure out oh how
do i do csrf protection in this framework flame using or whatever just just make sure your defaults
are boring and i love that django absolutely qualifies now right i i never in my wildest
imagination streamed that django would be the boring default choice for building things but it
is and so actually i'm building um dataset cloud right now it's sas hosting for my dataset project
The core of that is a Django app.
I've got a Postgres and Django app, which manages user accounts and manages signups and all of that kind of thing.
And then it launches Docker containers on Fly.io, which run Dataset and all of that.
So all of the exciting stuff I'm getting to innovate on in the corner, but the sort of bog standard bits that make the whole thing run, it's Django.
And that's great.
So, yeah, I love that.
I love that Django is now the safe default choice for building a web application.
Lovely.
Well, so you mentioned user accounts, I have to ask.
So Carlton's had some thoughts on, you know, maybe 20 years on changing some of the defaults.
Carlton, do you want to give your quick pitch and we'll see what Simon thinks?
Okay, yeah.
So my kind of take is that we've kind of got a leaky battery with the user model because we ask users to create this custom user.
And it's a whole world of complexity that for that central auth model, which is like for every single request, the identity of this user is X.
not the profile data, which obviously you want custom per app,
but we sort of have this custom user model,
which we forget to set up.
And we say, there's all these warnings in the doc,
how you should use it, but don't migrate to it
because that's too hard.
And I think we made a mistake there.
I think what we should have done is trimmed off
all the non-identity stuff from that user model
and then locked up Django Country Borth really tight.
Couldn't agree more.
There were four flaws in the default user model.
Firstly, it expects everyone to have an email address, which doesn't work in 2024.
It makes people pick a username, which is very archaic.
And it expects your name to split into first name and last name, which for many cultures doesn't work.
So, yeah, I'm very with you that the user model has not dated well, unfortunately.
Yeah. So, I mean, what I'd kind of like to do is cut it down and trim it, you know, find a way slowly.
It's obviously over time because Django is very stable
and we have the migration policy, but just reopen that debate
about whether we can trim off the bits and somehow do it.
I think we gave up a little too early on that.
So I've been experimenting in that domain.
Yeah, I love that.
I mean, I use the user model as a key that other things key onto,
and then I have a separate table of Google accounts
that have been associated with an individual.
I don't make people pick a first name and last name.
Yeah, all of that kind of stuff.
Yeah, I know.
Okay, good.
Do you use Django a lot or you write your own then, it sounds like, to manage social authentication?
I mean, I use, yeah, so how am I doing social authentication?
I think I rolled my own Google OAuth thing, which keys against Django.
I think I've probably got code for that lying around somewhere.
And I've done that before in the past.
I tend to use the default user model, partly just for the admin.
the most convenient way to get the admin up and running and i always have the i mean the admin is
such a key feature for me to quickly iterate on what i'm doing and build out like internal tooling
and so forth and yeah i get when you're starting a new project as well the last thing you want is
to sit stop oh i need half a day of planning my needs for a user model you just want to start
and then yeah i'm impressed you implemented your own author uh i've done it so many times at this
point right okay so it's nice yeah um i mean uh so my blog runs on django um simon willison.net
and that's open source like the it's not a very complicated application but it's all sat there on
github and i find myself i tweak that about once every three or four months i'll go in and i'll
tweak something about it it's always fun it also um it's managed by depend bot so it magically
upgraded itself to django 5 a few weeks ago it just did it which was great i didn't have to think
about it i've got just enough automated tests that i trust it that the thing's going to work
after i apply updates and yeah that's that that's a nice sort of way of staying staying connected
with what's going on in a very sort of low risk environment as well yeah i saw you put out a post
a little while ago just on the blog topic about um you know how to build a blog in django it was
really good it was like a kind of checklist that you could run through of how to build a
django and i kind of i see everybody struggling with hugo sites and this static site and this
generator or that state site generator and i sometimes think no just run your own django app
because it's great to have a lot of that playground and you've been doing it for like 20 years with
the same django application just evolving it's lovely really it's yeah it's it pretty much i
think my my the very first version of my blog was php running on my university's shared hosting
with flat nice as like was it even was it the php equivalent of pickle i think it might have been
php's pickle equivalent of just a big like array of posts and stuff and yeah then i i flipped over
to django actually only about 2000 and no i i might it's on my blog somewhere when i first
ported it to django and then i did a major upgrade um in 2017 when i came back after not blogging for
like seven years and did the python 3 upgrade and stuff and i've just been iterating on that ever
sense it's great but also jacob kaplan moss his blog is built on django but also if you go to his
github repo it says forked from simon willison i'd forgotten that oh that's brilliant
actually i stole the feature off of him a few years ago he has this idea of a series of posts
around a certain topic and so i added series into my blog inspired by what he'd been doing
didn't he open a pull request to get it merged into the upstream branch
he didn't but that would that wouldn't have surprised me if he had that's a that's what
carlton would have done yeah i don't know i don't know anyway we'll carry on oh it's i guess just
one more um you know putting your kind of old man hat on with django i i've heard you mention that
the fact that you know so flask django the felt the fact that flask can be a single file i don't
know if you kept up with this but carlton did a talk in 2019 on single file django and then at
the most recent django con us palo um melchiori has like there's a whole repo of like i think
it's six lines of like kind of proving like you can do django in a couple lines um and i guess i
wonder i think about this as teaching because i again like my brother-in-law is going through a
boot coding boot camp and i'm like hey i'm here let me help you like oh we're doing flask i'm like
like i almost feel like i don't know if it's worth doing showing like single file blog on
django or something just to make the point that like hey it's possible because even in flask like
you don't no one does it that way you could but no one would do it that way i do i love the single
file thing i actually i built my own django single file thing like 10 years ago something called djng
and yeah that was basically just trying to do a little thin shim that lets you do a flask
imitation on top of Django.
Because I love that for just hacking out quick
things, not having to bother
about the directory structure and so forth.
So yeah, I'm thrilled to hear people are still
pushing ahead on that. It's
a great idea. If we were to design
Django today, I'm certain it would be
capable of doing single file
out of the box. That just makes sense to me.
Okay, one more and then I'll let you go, Carlton.
This question comes from Eric Mathis
who wrote Python Crash Course
and he was asking, you sort of answered it, but
What is your preferred way of building web apps today?
I think specifically on the front end, having seen it go from server-side rendered to jQuery to SPAS and now, I guess, HTMX.
But where do you fall on that pendulum?
So I spent a few years trying to do the React thing because it was clearly the way it was going.
And I hated it so much.
The thing I hated, it's the build script.
I hate it when you have a front-end project which you work on every six months, and you come back in six months, and nothing works.
You have to re-spin up your webpack configuration, all of that kind of stuff.
And so a few years ago, I said, you know what?
I'm going to give myself permission to write JavaScript like it's 2008 again.
And so no libraries, no build scripts, no TypeScript, nothing like that.
Just like a little bit – because the thing is that we used to use jQuery because of the browser differences.
But the browser differences are gone, right?
today document.queryselector all and all of that stuff it works exactly the same across everything
so you can build code like you're using jquery but without using jquery you just write like event
handles and so forth and it was so liberating like suddenly i enjoyed front-end development again
because i didn't find myself fighting webpack and whatever and v whatever the new cool stuff is
and i and i could go back to projects i wrote like this two years ago i can drop in and i can
maintain them and i can add new features to them and on top of that the um like language model
stuff chat gpt is really really good at all forms of javascript so it's not like i ever find myself
stuck trying to remember how a certain api works if there's something which is going to be a bit
tedious because the javascript is going to be 20 lines of boilerplate it'll spit out the 20 lines
of boilerplate i could just just let it go let me get on with it so yeah i've got really into that
I have played with HTMX on a couple of projects.
I really like it.
It fits my – I've always been into the sort of unobtrusive JavaScript,
the idea of progressive enhancement.
HTMX is so good for that kind of thing.
And so I really – I like that.
I love that that's getting popular, and I love the performance
that you get from it because you don't have to serve a megabyte
JavaScript bundle just to share a contact form or whatever.
And then Dataset itself is very strictly, it's just HTML.
And when you click a link, it loads a new page.
But I've been playing with the Chrome view transition stuff recently, which is super, super interesting.
Like cutting edge Chrome, I think you might still have to turn on one of the experimental flags.
You can actually serve up CSS that says, and when the user navigates from this page to this page, keep this area of the page stable and sort of like blur update this other bit.
And it's like a couple of lines of CSS, and suddenly it feels like a SPA.
You click a link, and only part of the page updates and so forth.
But it's a real navigation.
There's no JavaScript involved.
That's thrilling.
I can't wait to see that roll out to other browsers as well.
I have to ask.
Well, sorry, Carlton.
Go on.
I promised the last one.
Just with bundling, because I just did a redesign of my main site, which is using Tailwind.
And I like Tailwind, but it's a little disappointing.
I now have to have like Node and stuff running to,
it's almost like it's switched from JavaScript to CSS now
to have a build script for everything.
Yeah, this is one of the reasons
I've not adopted the modern CSS stuff as well
is I just, the build scripts,
they're fantastic for larger, more complex applications.
The stuff I do, I always try and keep it small
and simple enough that you don't necessarily need that.
And then they just become friction.
Like it's just something that prevents me
from being able to,
because I have so many projects on the go at once.
I've got what a hundred and nearly 200 it's it's,
I've got some ridiculous number of actively maintained projects.
And the only way to do that is to make it as easy as possible to drop into
something that you've almost forgotten all of the details of and get it up and
running again. I feel like with the front end build stack, if you do,
you work on the same projects every day, it's completely fine.
It gives you a huge productivity boost and that there's none of that friction
because you're, you've constantly got that stuff sort of warm in your head.
If you drop into a project every six months, it's completely different.
And that's what I like to optimize for being able to hop across hundreds of different projects
and make small changes to them without getting stuck on the building.
That's the exact same point as the boring technology talk, right?
Is if you focus on one or two or three technologies, then you're able to really get the most out
of them rather than spreading yourself thin over, you know, say half a dozen and that
slows you down.
It's the sort of same...
But the secret to running lots of projects is they've all got to be as boring and similar as possible.
Like I've got 100 odd repos.
They're all Python pluggy plugins, Ginger templates, like Datastep plugins.
They're all the exact same shape.
They've all got GitHub Actions running workflows and so forth.
It just works.
Okay, good.
Interesting.
So you mentioned LLMs there and ChatGDP and things like that.
But I wanted to ask you something before we, you know, talk about those in more depth,
which is you're not, as well as doing all this amazing work in open source,
you're now on the board for the PSF.
Yes.
So can you tell us a little bit about what you're doing there
and how you're finding it because you're new.
This is your first year on the board.
It's my second year now.
I just hit the 12-month point, and it's interesting.
So the reason I'm on the board of the PSF is that I'd been hassling the PSF
on a sort of low-grade basis every now and then.
I'd go, I'm really annoyed that the PSF isn't doing more
to help make python easier for people to get into like solving the the horrors of the python
learning development environment all of that kind of stuff and um and also the fact that it's very
difficult to distribute applications written in python because you know if you want to you don't
want people to have to install python to use your stuff and i realized i almost had a snap judgment
one day i was like you know what it's not reasonable for me to complain at the psf and not
offer to help and not try to do something so i put myself up for um election on the basis of
i want these are the problems that i think the psf should be addressing and i got elected which
was a little bit of a surprise because i didn't really i mean i think it is name recognition
because you show up on the list of names people are oh i recognize that person or whatever um
and of course then i made it now that i'm in the psf i realized the psf is not particularly
well equipped to solve the problems that i was most interested in solving because the psf nobody
told you well it's it's always difficult to understand quite what what these organizations
are able to do the psf is basically a it's about it's about money that's that's raised and is
distributed around the python community and initiated to um the the psf's focus is on the
community and the health of the community there's a huge amount of sort of sponsorship of events
of um of initiatives like that which is fantastic the stuff i care about is it's not completely
aligned with what the PSF is for, but it's not unaligned either.
So what I'm having to learn is, okay, how do, how do I align what the PSF
can do with the things that I want to get done in a way that supports the,
the missions of the organization?
so forth so it's been a huge learning curve you know this is my first time on the board of a
non-profit it's understanding what levers are available to pull and what priorities make sense
and so forth um and yeah so the first year i was in mainly in sort of just trying to understand
what this how this thing is shaped and what it can do now that i'm through that i'm looking
forward to to maybe trying to tweak those levers a little bit myself to put all your weight on this
one this is why the dsf we just switched to two-year terms exactly for this reason because
it basically takes a year to get up to speed and during covid we had less turnover and i feel like
we got a lot done because we had largely the same crew for two or three years um because it does it
takes a it takes a year to just understand how it works right sorry i interrupted though you were
gonna no i was just making a joke about thing but i was good when time was talking i was like this
is exactly will's experience with being on the dsf board is that people think oh the giant the
dsf can do this the gff can do that and they're yeah well what i hear from will is from his
experience on the board there is that actually the dsf can't do very much well it i mean i think
it's interesting i mean we're gonna have jacob kaplan moss on in a couple weeks um who just
joined back the dsf after obviously working on django and being one of the i think the first
president um but when i joined i had a similar list of things i wanted to do which i guess i
in hindsight, I guess I was lucky that they aligned with, they were all things that could
be done around like sponsorship and, um, God, I forget. I have a blog post on it, but I didn't,
I hadn't thought about the fact that maybe the things I wanted done didn't align directly with
the mission, but, but you're right. It's fundamentally, these organizations are about
money and community and helping others. I mean, one thing DSF is, is now doing is having working
groups, which the PSF has had mixed success with, but at least some success, whereas historically
it's just been everything goes to the dsf and when you're on the board that's kind of its own
thing it's unreasonable to be on the board and spearhead an initiative um from what i've seen
i imagine it's similar on the psf or i don't know are you thinking of like you actively doing it or
more you can well like help spin up a group that that's what i'm still trying to figure out because
the other thing is that the psf is a like the psf has staff the psf is a like like the stuff
Unlike DSF.
And the staff do incredible projects.
I mean, PyPI, I think, is one of the most impactful things that the PSF does outside of the PyCon and event sponsorship and so forth.
And so the directors are not there to do the work.
The directors are there essentially to sort of help make those high-level decisions, help set strategy, and, yeah, make decisions about where the money goes to a certain extent.
Yeah, it's understanding, okay, also, what's ethical and responsible to do?
Like, if I throw all of my weight in trying to push the PSF in one direction, am I actually starving other important initiatives that the PSF are doing that just don't happen to align with my own personal interests?
Right.
Well, in a sense, pure is not the word, but quite a few people work for big tech companies, and so there's even more of a potential of, I don't know, not conflict, but it gets a little, you've got to watch what you're doing.
Yeah, I mean, and I think there are rules about how many PSF, how many people on the board of directors can work for the same company as there should be. Because yeah, that's always a risk with these kinds of things. Yeah, I'm being unemployed by a large company that gives me an aspect of independence. But most of our board members are independent. I'm not unique.
Okay. I think maybe when Jeff Triplett was on the board, I think maybe it was the allocation was a little bit different.
But yeah, I mean, we just we just had new board members join a couple of months ago.
So I think we've had quite a reshaping just recently.
OK, just before we move on.
One last one, and I promise.
So, again, as this Django person, but a little bit of an outsider, I can ask you all the questions that, you know, I want someone to weigh in on.
So an executive director, we're going to have Deb Nicholson on the podcast in a couple of weeks.
There has been talk of Django potentially having one.
What has your experience been seeing an executive director at work in one of these organizations?
Can you imagine one of these organizations without one?
So is the DSF considering having a paid full-time person as an executive director?
It's something Haim, the president, and others have—it's been discussed because—and I'll give my two cents.
I think a lot of this stuff won't happen absent someone full-time to do it.
Absolutely, yeah.
that completely makes sense to me like having having this is one of the problems with boards
of directors is if everyone's just a a volunteer who's investing a few hours of their their time
a week or maybe a month it's very difficult to make progress on things you find you'll have a
meeting and it'll be that not much will have happened since the last meeting and with once
you've got an executive director and staff that completely changes you know there's constant
forward motion well i keep on telling people the the i think the best thing about the django
software foundation has always been the fellows because that's the and that this is something i
say i don't understand why other opens like community-driven open source projects aren't
trying to imitate this exactly because it works so well the psf now has i think at least two fellows
inspired by the django software foundation and those are incredibly impactful um you know the
work that they're doing the work that seth's been doing around security is a sort of relative new
edition absolutely extraordinary how much impact you can have with that so yeah i'm i'm very very
keen on the idea of these non-profit open source supporting foundations that actually have staff
that can could just keep on making progress on things but just my experience just having been
the fellow is that these these other tasks these non-fellow tasks would arrive and it'd be like
okay well i'll do that but you know i've got a bit of time in the week i can do that but it wasn't
really the fellow role and there wasn't enough capacity to make any sort of significant progress
on you know for instance you know reworking the janga project.com website okay don't do a little
bit of work on it but it's literally an hour or two here or there and not the massive month-long
project months-long project that's going on now to actually do a proper assessment and what does
it need and how do we refresh it in a sort of professional you know to a 24 2024 kind of standard
you know rather than just oh yeah can you make a tweak here it turns out there's a lot to be said
for having somebody whose job it is to get specific things done you know that's yes exactly
um yeah so i'm i i think that that sounds very very sensible to me yeah i'm biased and i realized
we we i think i think it was anna the past dsf president and i had a had a call with um private
call with with dev nicholson the new the new one and she sort of went through what one does in that
position and we were just like oh my god we so need that um so yeah i put that out there but
i'm not on the board now so but carlton yes can i nudge up so i want to get so i've been using
copilot and whatnot and i think it's awesome and it's it's you know you mentioned javascript earlier
like my javascript's come on so much because they're a bit like how do i how do i filter this
array to get this the one value that i need and previously that would take me 10 minutes of
looking up because it's not something i do you know i do it once every six months but now i can
just ask the lam it's got it and it's not it's not rocket breaking code it's not it doesn't have any
value other than it saved me 10 minutes um so i guess my question is how can i
leverage that and how can i leverage continue to leverage that and and your tooling how can i
install that and can i get something equivalent to the closed source that i can use that's open
source wow that's a whole bunch of things to talk about yeah yes but that's kind of
let's get into it yeah um i'm with you like the thing that excites me about llms is i love them
as as sort of teaching assistants right it's something i can ask question i can ask the
dumbest question in the world at three in the morning and i'll get an answer and it doesn't
judge me and i don't like like you know and i don't feel it's not knowing how to do a for loop
in bash or whatever it is you don't want to post post it on the django forum like how do i do this
Exactly, exactly. I love that. And I love that it lets me be so much more ambitious with the projects that I take on. Because like a great example, I shipped code in Go for I needed a little like high performance network proxy router thing. And I ended up writing it in Go because I don't know Go, but chat GPT, GPT-4 knows Go throughout.
And I know Go just well enough to read the code and be able to tell if it's doing the right thing.
And I can get it to write tests.
So I ended up building this, like, 100-line little custom Go server thing with comprehensive unit tests and GitHub Actions running continuous integration.
I got continuous deployment running, all of the things that I consider to be important for, like, robust projects.
And I shipped it, and it's great.
Like, last month I had to make a change to it.
And I fired up GPT-4, and I worked with it, and we figured out what to do.
and i absolutely i mean that was extraordinary because normally i would never write something
in go because i'd be fine tinkering with it but i'm not going to write production code in language
that i'm not completely fluent in in this case i'm i feel like me plus gpt4 is fluent enough
that i'm willing to deploy code written in a language i'm unfamiliar with i've written code
in apple script apple script is notoriously a read-only language like you can read it and see
what it does it's the there's like a continuum there's apple script on one end a pearl on the
other like read only write only absolutely but yeah i'm i'm using apple script for things i'm
using all of these weird little domain specific languages i use jq all the time now because jq
is really powerful but i can never remember the syntax so i love that i love it as a sort of um
an accelerator for me doing lots of things i'm taking on more projects which is terrifying
because i already had too many projects and i'm like oh i mean me plus chat gpt i can probably
get something working in 20 minutes and of course it takes two hours but still at the end of that
two hours i've got something that works and is interesting that i wouldn't have built otherwise
um but it's that first 20 minutes that you wouldn't have put in that gets you to the two hours
i do so much coding on walks with my dog now because i can be walking the dog and i can on
my phone i can just like prompt it to write me some code that does this i can use the code
interpreter mode where it actually runs the python code it generates so i can get back from an hour
long walk with the dog and i've got 50 loads of python that i know works because it actually
ran the code found the bugs fixed them all of that kind of thing it's incredible like you can
even turn on voice mode i can literally talk to it while i'm on a walk with the dog and it writes
code for me that's utterly surreal that that's even possible so yeah i love i love that aspect
of it um and yet but the as you mentioned the problem with chat gpt is it's a it's for a company
called open ai it could not be more closed right it's this proprietary hosted model they change it
all the time without telling you what they've changed so people keep on complaining that it's
got weaker it's worse at x and so forth i never know if that's actually true because it's basically
random number generator so it's very easy to assume that it's changed when it hasn't
but that's really frustrating and then but the great news is that in the past like 12 months
we've had so many new options for running these things ourselves these openly licensed models
that you can run on your own hardware and they're beginning to get pretty good like i don't use any
of them on a daily basis because gpt4 is so good so it's sort of my default but i'm constantly
experimented with them my favorite at the moment um my two favorites are these mistral models
there's mistral 7b which literally runs on my telephone like there's an app that runs it on
my phone and it's not awful like i was on a plane and i was using it to to do the kinds of things i
might have looked up on wikipedia and okay it'll probably hallucinate stuff so don't depend on it
telling you the truth but it's still useful for sort of getting things getting just starting to
explore different ideas and then the other one is this new one called mixtral which is a mistral
a mixture of experts model they just released that um a month ago and that runs on my laptop
and is feeling it the quality begins to feel like chat gpt 3.5 like it's very very good so if you've
been resisting using these things because you don't want to use some weird hosted model by some
like closed open company mixtral is something you can run on your laptop right now it's a it's
apache licensed that the whole thing is apache licensed although whether it's truly open source
is up for debate because they won't release the training data that was trained on which right is
i think that's the source code right i think for these models the the raw training data is the
source code that was used to compile the model because you can't open source that training data
because you ripped it all off it's full of copyright data and you can't just slap an apache
license on someone else's copyrighted works but yeah so this stuff is really exciting it's really
interesting so i want to come back to your tooling but you've just mentioned the the copyrighted
training data thing and so there's this um these lots of cases where the the llm will reproduce
its training data almost exactly um in in cases so the the new york times um we've got this um
lawsuit perhaps you can explain the thing i could i kind of see it though and i'm like oh wow yeah
it is actually reproducing you know you type in underwater sponge and you get a sponge called
bob square pant come out from the one of the image generators for instance this is so fascinating the
ethics of this entire space could not be more murky like every aspect of this space you're like
wow is that okay and the answer is maybe not it's all very bad and that so a lot of people have
ethical qualms against this and i agree with everything that they're saying you know the
The New York Times thing is – so the most recent thing is the New York Times filed a very big lawsuit against OpenAI a few weeks ago.
It was against OpenAI and Microsoft, and it was complaining about three different things.
It was complaining that, firstly, you took all of our work without permission, and it's copyrighted work, and you used it to train your model.
And I don't think anyone is disputing that that is what OpenAI did.
They used OpenAI, they used New York Times data as part of a vast amount of training data that went into these models.
It's effectively, you could look at it as it's a crawl of a sizable chunk of the Internet that was used to train these things.
But that included New York Times data.
The New York Times say that OpenAI put more weight on the New York Times data than they did on other data they trained on because of the high quality of that training data.
So I don't know if that's conclusively proved or not. I think the GPT-2 paper a few years ago did explicitly list that the New York Times data was being used like that. So they might be assuming that that's still true. It probably is still true. But this is one of the things I'm excited about this lawsuit is I want discovery because I want to know how GPT-4 was trained because they haven't told us. So, you know, if that comes out of this, that would be useful.
So complaint number one, they trained without permission. Complaint number two is that the models can spit out exact copies of New York Times articles. And this was news to me. I thought that the act of training muddled the stuff up to the point that it won't spit out exact copies.
it turns out if you set the temperature to zero and then feed it the first two paragraphs of a
new york times article it can often spit out the next four paragraphs and sometimes there are very
slight differences like one word will be changed but effectively it's it's it's memorized and it's
regurgitating the same thing but if you if you if you tried to publish that that would be clear
violation of copyright it would be exactly and then so the question is well are open ai publishing
that just by having an interface where people can see it.
And that's, I mean, so many of these things, I don't think there's a, obviously,
legal i'm not a lawyer at all but there's a reason this is going to go to court because
these are legal questions that are very blurry and unanswered so complaint number two is it
can regurgitate their content and they've said um this means that people bypass our paywall by
getting the model to spit out articles which is a bit of a loose claim because you've got to have
the first three paragraphs of the article anyway but they did have a really interesting thing where
they talked about the wire cutter right where the wire cutter is a new york times company it does
product recommendations if you ask chat gpt for product recommendations it will often spit out
the wire cutters picks but it won't give you the referral link that's the wire cutters business
model and this is the the definition of fair use in american law specifically talks about um whether
the thing is competitive with the thing that it ripped off and so the new york times case the main
thing they're trying to demonstrate is this competes with us this is harming us financially
because you can bypass that paywall you can like rip off wire cutter recommendations all of that
kind of stuff so that's argument number two complaint number three is actually about retrieval
augmented generation it's about the thing that um microsoft bing does and uh chat gpt browse does
where you can ask you the question it goes and does a search on the internet and it'll find the
new york times article about something read bits of it and then like summarize that and give you
the summary back again and so then the new york times is saying well look you're clearly subverting
our paywall you're you're profiting from content that's derived from us now that one that's one
one it's almost the one that worries me the most in terms of i think they've got a completely
fair point in complaining about this but summarizing stuff is my favorite use of llms
like if we come up if we end up with legal precedent that you can't even copy and paste
data into an llm to get a summary back out again that would be very harmful for for the sort of
the ways that these tools are most useful but that's the problem is that i read the 69 page
lawsuit and it's very clean it's very well argued and i think like oh like i said not a lawyer but
all of these points feel to me like points that are worth putting in front of a judge and jury
and and trying to get answers about yeah i think i mean two things come to mind from what you've
just said one is um i know google has been um told it has to pay news publishers in various
countries at various times because it does exactly that if you google the news in the country in
australia i'd pick australia i don't know if it's applied in australia but it will go and you know
get the sydney morning herald and summarize that without you ever having to leave google.com and
they were you know the one thing about that is that those lawsuits they were just about the
headlines the headlines like the first few words of the story even and what these what generative
ai is doing is so much more than that much google are clearly going to be if the the google are
clearly going to be on the chopping block next after opening after opening microsoft because
they've got a prototype um like an alpha version of their search page that does exactly that it
just adds generative ai and it spits out a generated answer to your question at the top
they've been doing this with their like little content snippet boxes and so forth as well over
the past few years and it's super worrying right if you've got a web where nobody ever clicks a
link from a search result because they just get their answers right there in search what point
is there in trying to like build a profitable web business anymore you know so all of these
ethical complaints are very very legitimate here's a meta question for you so we know now that
llms are being used to generate a lot of the content on the internet how do you see this
going forward if the lms are going to be really trained on themselves do you think that like is
this is 2021 the the you know the the high point or is there a way out of that because it seems a
bit like a vicious circle it seems like an ouroboros situation does doesn't it and it's
people have been talking about this for a couple of years now and um at one point i heard that
open ai the reason they hadn't updated their training data like there was a training cut
top of what september 2021 i think and the reason they had updated it is that after that point there
was enough usage of these tools the internet was beginning to fill up with llm generated text and
they didn't want to train llms on llm generated text because of the ouroboros effect at the same
time in the openly licensed language model community almost all of the really good ones
are actually trained on gpt4 output like the way you the way you build a really useful um like chat
tuned language model is you need to give it 20,000 examples of good conversations and the easiest way
to get those is to get gpt4 to spit them out and then you train your model on gpt4 and so if it if
that was such a bad thing we wouldn't be seeing models that were trained almost exclusively like
that show up at the top of the leaderboards so i think this is all i mean this is all part of the
larger problem that we really have very little insight into how these things work they are giant
like 16 gigabyte blobs of floating point numbers we're and and we're still trying to figure out
just the basics of how you sort of poke around inside that weird matrix brain and figure out
how it's working and what it's doing and so yet maybe the fears of llm's training on llm output
are don't don't actually work out maybe it's okay maybe it's complete catastrophe we have no idea
and it's funny that we had no idea six months ago and it feels like we still have no idea now
So despite the rate at which this technology is improving, the rate at which your understanding of it is very sort of dubious in terms of how much we can figure out.
So it really is a new world.
It is. And as a computer scientist, it's infuriating, right?
Because I like computers that do exactly what you tell them to do.
And you can write tests and you can fire up a debugger and everything is repeatable and understandable.
And these are not that at all.
It's like a completely sort of weird, blurry alternative world in which everything's based on vibes.
You come up with, you pick a model and you poke around with it and you see if the vibes feel right.
And then you tweak your prompts.
And does that seem better?
I mean, it kind of does, but it's awful.
It's really difficult to do sort of responsible development on top of it.
It does seem like the closed LLMs, like, you know, like if I'm a hospital or if I have billing records or like very niche-y things,
LLMs are fantastic. And especially like I'm in Boston, there's a lot of research places. They're like, we can't use an open LLM thing, but these closed things are definitely being sold and used on whatever industry company has huge amounts of their own data. I would say I almost feel like that's got more promise than this, like the entire web being, you know, stolen approach in the long run.
Well, the flip side of that is, so Bloomberg built their own, they trained their own language model on the internal financial documents. It was supposed to be the best possible LLM for finance. And then it turned out that GPT-4 came out, and as a general purpose model, it was beating the Bloomberg one on financial tasks.
But this is one of the things that's so challenging right now is the rate of improvement of these
things such that if you've got a project that will take six months, you maybe shouldn't
do that project because you might spend six months on it and then GPT 4.5 comes out and
it solves the problem that you just spent six months trying to solve.
And so there's this interesting strategic problem where at what point do you actually
settle down and start building on this stuff as opposed to thinking, you know what would
be quicker is if I waited two months and then started building because I'd get a better
result than if I started building today. And that's absurd, but that's genuinely the position
that we find ourselves in. That's Zeno's paradox for the 21st century. Completely.
Well, have you ever read, there's this book, AI Superpowers, that's a couple of years old now
by a Chinese American. He works in China, a US researcher. And I read that, I think I read that
five years ago. He summed all this, this is before OpenAI came out, but he basically said,
you need three things. You need the algorithms, which finally, like we had at that time, you need
training data, and then you need processing power. And he argued with the cloud that basically it all
came down to data. This is back in the day, because we had the algorithms, they're basically
open source. We have the cloud computing. And so it's really all about training data. I think he
went on to say he thought China would surpass the US for that reason, because it has no privacy
controls but all of that is to say to you where do you do you see it as tweaks in them is there
more to juice to squeeze out of these llm models do you think or is it really more about crap like
a data science thing where it's all about what you put in and trying to optimize that i'm trying to
pick between the two i'm very confident it's both um mainly because if you look at the open model
community over the past like since since since february people just keep on coming up with new
little tricks that make the models run faster and smaller like the fact that i can run a gpt 3.5
class model on my laptop now and i certainly couldn't do that a year ago because like the
models that were coming out and like the first versions of llama and stuff were much larger
required much more hardware much less optimized um so there are so many techniques that can be
used to make these things and i'd like smaller and faster right i want i want a model that works
on my phone and can do the things that i need to do i wanted to be able to summarize and extract
facts and call functions and all of that kind of stuff um but at the same time people keep on
finding that the higher quality the data the better like it really is so much to be said
especially when you're fine-tuning these models for just having super super high quality data
that you feed into them um if the new york times thing plays out one way we may find that it's no
longer possible to just steal the entire internet and train your models on it at which point
that becomes raises some really interesting questions the thing that worries me most about
that is does that mean that llms then become incredibly expensive to build because of the
licensing costs to the point that you don't give them away for free and so does that mean that only
people who are very wealthy can afford to use these tools whereas today anyone who can afford
an internet connection has access to some of the the best in class of these models so that really
scares me like the the that that i feel despite the fact that the um the the ethics around copyright
i mean there are very very real concerns here but at the same time a world in which only the
the most wealthy have access to the to these tools that feels unfair to me as well yes and we can't
lock these tools up they are super useful like to take them away would be foolish like also if you
banned them i've got a usb stick with half a dozen models and you create a blank market of people
it's very cyberpunk right people swapping usb sticks with like with the last version of
mistral that was released on them super so there was a paper a little while ago just
pick up what you said there about open ai saying we haven't got a moat or something like that um
there was a leaked memo from google it was somebody within google put this memo together
saying saying there is no moat for this technology um it's interesting to revisit that that i think
that was it was quite it came out in maybe march or april of last year and it's interesting to look
back at that now and say okay how much of this played out because uh one of the real challenges
with this stuff is um if it's all just driven by human language prompts the cost of switching to
another language model might be as simple as saying okay we'll run this against claude instead
of gpt4 and maybe that will give you the exact same effect right um or maybe it won't because
so much of the the prompting comes down to these very small tweaks that you make where you're like
oh okay if i capitalize the instructions to output in markdown maybe it'll actually listen to me this
time but that effect itself is kind of hurt by the fact that openai upgrade their own models so
just because that won't work now will it still work in a few months time it's it's kind of
uncertain. So that's part of it. There's also the fact that the closed model providers are up
against tens of thousands of researchers around the world collaborating together. That's something
I really like about the open model community is there's all of this sharing and this acceleration
that comes from just having tens of thousands of people worldwide all trying to solve these
problems. And OpenAI are an incredibly talented, experienced set of people, but I still don't like
their chances against tens of thousands of people around the world although of course when those
people around the world figure something new out cool open ai can just take that research and use
it themselves so so you can they can sort of keep up that way um but yeah it's and there's also
there's the compute right like it's we still don't know why gpt4 is so much better than everything
else um the most likely thing is that they ran it they trained it for longer and they trained it on
more data than anyone else has been able to do yet but still people are catching up now that there
is if you have a hundred million dollars maybe it's worth trying to funneling that into data
and training you know that it's not like there's a shortage of investor money floating around the
space at this point yeah i guess and the economics are so crazy because yeah it's a hundred million
dollars but then then it's just a file that you know anyone you can sell to anyone for virtually
nothing right when people people often complain about the environmental impact of language models
where they say well look training like training a language model takes this enormous amount of
carbon dioxide which is true at the same time it's about the same amount of carbon dioxide as flying
a boeing 747 across the atlantic twice you know which is a vast sum but i would argue it benefits
more people because your airline flight benefits the people on that plane the language model if
it's then used by a few million people over the course of six months it feels like you are getting
more value for your for your for your sort of carbon dioxide at that point yeah i have a
question about the um carbon the co2 usage so i my understanding of ml which is machine learning
which is quite limited but it was that the training was the hard bit but then once you what
you get out of um the training algorithm is a kind of vector operation which you can run almost
you know quite cheaply and then i saw though people complaining um and i didn't have the time
to follow up but that every time you generate an image with dali or or whatever that uses so
much water or so much this because it's still computationally expensive to run the model not
just train the model is that true so this is an interesting question so like i said i run models
on my iphone i run models on my laptop i am not worried about their resource constraints um but
again i don't know what gpt4 is running on i'm pretty sure it's running on a full server rack of
of gpus so my hunch is that for the very large models yeah there's a lot of cost in the inference
i still think it's a fraction of what it costs to train them that's that's the intuition i've
i've gotten from this um and you know like the image like stable diffusion also runs on my phone
so there are versions of these models where the environmental impact of running them is no worse
than turning your laptop on that's but but i don't really have good insight into what the large
hosted models are doing so i had so we've talked all about llms i wanted to ask about your tool
because if i want to run this you've got the perfect tool for me to download and do so please
tell us about that because we we've talked about all the exciting things okay okay if i actually
want to do it what do i have to do so i built this tool in python called llm i got lucky llm was
still available on the package index. So you can pip x install llm, and you get a command line tool
for interacting with models. But what's really fun about it is that it's inspired by dataset.
It's all based around plugins. So out of the box, you can give an OpenAI API key, and it will run
against OpenAI. And then there's about a dozen plugins you can install that will add additional
models, including models that run on your own machine. So you can essentially pip install
my tool, and then pip install a plugin
that adds a language model to it.
And now you've got a four gigabyte file
on your computer that you can start interacting with.
But crucially, the interface is the same
no matter what model you're using.
So it's LLM space, double quotes,
your prompt or you can pipe things into it as well you can do cat my file dot txt pipe llm
and then if you by default it'll use your default model if you stick dash m space claude on the end
and you've got the claude plugin it'll run it against claude and so forth and um of course
everything it does is log to sqlite because i do everything with sqlite so one of the great
things about using this tool is that it's a way of building a sort of database of all of your
experiments across all of the different models so i just use it on a like daily basis for all
sorts of different bits and pieces and i've accumulated like a few thousand prompts and
responses in my sqlite database of things that i've tried out maybe at some point i'll do some
analysis on that and try and start comparing models that way but really the fun thing about
it is i'm trying to make it so whenever there's a interesting new model you can install a plugin
and start playing with that model and that works for hosted models and it works for local models
as well um and yeah it's it's really really fun to hack on one of the things i've realized from
playing with it one of the original ideas is um the unix philosophy the unix command line of piping
things to other things is an amazingly good fit for language models because the language models
it's a function you you pipe it a prompt and it gives you a response and so one of the things that
i use my tool for is um it's um it it ties into this concept of system prompts which is something
that open ai did originally and other models have started picking up where you've sort of got a
second prompt that gives you instructions about what to do with your other data so a great example
is i can take i can take a um file and i can say cat my file.py pipe llm dash dash system write me
some unit tests and then the model gets the prompt write some unit tests and it gets a bunch of
piping code and it'll spit out a bunch of unit tests and of course they won't be exactly what
you need but it's that skeleton that you can start hacking on it's really good at explaining
code i pipe it code and say explain what this thing does um i use it for uh release notes not
to publish i i kind of feel like it's rude to just straight up publish something that an lm wrote for
you because i mean what are you doing right like it's it's fine to take as long as it's fine to
publish something which you're willing to sign your name to because you at the very least reviewed
it extensively and hopefully revised it and tidied it up but there are lots of projects out there
that don't bother writing good release notes and what you can do is you can check out their git
repository and you can do git diff between this version and this version pipe llm dash dash system
write release notes and gpt4 can understand a diff format it'll it'll read it and it'll spit out
release notes which in my experience are about 90 correct and 10 slightly wrong or maybe there's
hallucination in there and that's fine right that's good enough for my purposes just saying
okay what have they done in this release that they didn't bother writing release notes for
so yeah i i i recommend trying this thing out partly because it's fun to play with models
and something i'll say about the models you can run on your own laptop is they are kind of crap
like they are they are very very weak compared to gpt4 but that's a feature because it's easier
to build a mental model of how they work when you work with the weak ones like gpt4 because it's so
good you can use it for a few days without really seeing the weaknesses and the flaws in it because
it gets most things right but it's still just you know guessing what word should come next it's
doing the same kind of thing the little ones will hallucinate wildly which is so useful for getting
a feeling for okay these things are not intelligences these things are dumb autocomplete
that's just been scaled up to be able to cope with lots of things um i love i use myself as a test
thing because i've been around on the internet for long enough that these things can answer
questions about me like i can ask for a bio and some of the models will get most of the details
right they might say i went to a different university or whatever and some of them will
just hallucinate wildly and so i've had models tell me that i i co-founded github and things
like that um and it's amusing but it's also quite good as a sort of like just an initial sniff test
to see, okay, how good is this model
when it comes to hallucination and that kind of thing?
Okay, super.
We're coming up on time a little bit.
I wanted to add one positive note,
which I've heard about, you know,
we mentioned that these tools could further
increase the economic divide,
but they are democratizing a lot of things,
like in unexpected ways, at least to me.
Like, for example, someone I know
is an admissions director at UC Berkeley,
and a friend asked that person,
hey, you know, what is it now? What is it like with these college essays now that
ChatGTP exists? And he said, it's actually great because it's an equalizer because rich kids have
had private essay tutors for forever. And now everyone has, you know, 80%, 90% of it. I mean,
it probably makes them all sound kind of the same anyways, but it's a tool that people who don't
have these external resources, if they know how to use it, can, you know, up the, you know,
It's just like Grammarly and all these tools to help increase the writing.
And and so that's I was pleased with that because I think it's very easy to get a little
doom and gloomy about it.
But it is for almost no money bringing these resources to so many people who didn't have
them before.
Oh, I couldn't agree more.
I feel like we always get very hung up on the many ethical flaws of this technology
and the harmful ways that can be used.
The positive ways they can be used are just enormous.
Like the reason I'm spending so much time with this tech is that I do believe that it's
genuinely useful and it does genuinely provide enormous amounts of value to enormous numbers
of people um if you have english as a second language this tool is phenomenal right you can
now you're no longer cut out of those things in your life those parts of society where you need
to be able to write like somebody who's a native speaker who's at a certain level of education
and that's that that has been completely flattened i am like people sometimes say oh it's not worth
learning to program anymore because the chat you'll just do it all i think that's complete
rubbish i think now is the best time it's ever been to program because anyone who's coached
somebody learning to program has seen that the first six months are just utterly horrific like
it's it's so frustrating because you try something and you get this obscure error message that
doesn't make sense to you and you can bang your head against it for two hours and maybe you give
up lots of people do give up they assume that they're not smart enough to learn to program
And it wasn't that they weren't smart enough. It's that they weren't patient enough. Nobody warned them how tedious and stupid this stuff is. And now we can give them a tool. We can say, look, if you get an error message, paste it into chat GPT, and nine times out of 10, it will tell you what to do next and how to get out of that condition. That's phenomenal, right?
the the flattening of that learning curve getting more people my my ideal end point of all of this
is i think every human being deserves the right to have computers automate stuff for them like
i can do this right i've got 20 years of programming experience if there's anything
that i can tedious in my life the computer can automate i can get it to automate that thing
but it's ridiculous that you need 20 years of experience to do that like that should be a
It's a universal human ability, and I think this technology might get us there.
I feel like if we get to a point where people are able to get those tedious automated things just done for them
because they didn't have to learn to program first, that feels enormously valuable to me.
Yeah, absolutely. I'm just lighting up as you spell out that scenario.
One of the parallels to that is why hasn't technology pervaded more deeply throughout the clerical world, for instance?
It's like people are still doing with paper or sort of manually repeating a task on a computer.
Yes, it's in a spreadsheet, but it's not automated.
Why not? Because they never picked up that programming skills.
But all of a sudden, if these assistants are built into Excel or built into Word or built into the software they're using, it can be automated easily.
I heard a horrifying story the other day about a a local fire chief, like the guy running a fire department who, due to some mess up, had to manually unsubscribe 2000 people from a mailing list.
And he spent a full day clicking the unsubscribe button over and over and again in some horrible piece.
This is somebody who has a very real, very important job to do.
And I think this pattern plays out a lot.
A lot of people with a lot of important things in life
end up stuck for a day doing something tedious and manual
because we haven't given them the tooling
that lets them not have to do that.
So yeah, so I'm really excited about that.
I think as an educational assistant, it's amazing.
I think one thing that isn't necessarily talked about enough
is these things are actually very difficult to use effectively.
And they feel like they should be easy
because it's just a chat bot.
But actually, to really get the best out of them, you have to understand the prompting techniques.
You have to know what it can do, what it can't do, what are the things that it's going to break on.
I love that we've created computers that are bad at maths and can't look at facts for you, which are the two things that computers have always been best at.
So people sit down and check, well, it got maths wrong and I asked it for a fact and it couldn't tell me the answer because that's not what it's for.
But that's really not obvious, you know?
But it can now, right?
Isn't that the whole thing with Sam Altman?
One of the things is it can do math now, allegedly, the new version?
I mean, well, it can if you give it tools.
So ChatGPT, the paid version now has access to Bing search, so it can look up facts, and it has access to Code Interpreter, so it can run mathematics using Python, which on the one hand, it does fill those giant gaps.
On the other hand, it makes it even harder to use because now you have to know what Bing search is.
You have to understand bits of Python.
You have to know this is the kind of thing where it's got vision support so it can read documents.
But the interaction of all of these features is incredibly complicated.
A great example is sometimes I will give it like a photograph of a receipt and ask it to add up the numbers in the receipt.
And it will then write Python code that imports Tesseract and use Tesseract OCR to pull out the numbers and then it will try and add them up.
But of course, Tesseract isn't as good as GPT Vision, right?
If it had taken that image and used its built-in OCR to pull the numbers out
and then passed to Python, I'd have got a more reliable result.
How the heck am I supposed to explain that to anyone?
Like, I have to know what Tesseract is.
Kind of.
But you see what I mean?
As they add more features, the matrix of complexity of how the features interact
gets even more complicated.
So having expert skills to use this stuff gets harder.
But I think you said this in a recent interview.
I mean, using a chatbot is one of the worst user interfaces.
It's like terminal, right?
We're not, like, no one's using terminal, right?
So it's, do you have any, and I guess the final question for me,
do you have any predictions on, you know,
the mouse equivalent of where this stuff goes?
Because we're not all going to be using chatbots forever, I don't think.
I certainly hope not.
Yeah, like, I mean, yeah, like you said,
the problem with chatbots is there's no discoverability.
Like, you've just got a blank box to start on.
And they might give you a couple of suggestions,
but it's a terrible user interface.
But that's the thing that's really exciting about the space right now is there is so much low hanging fruit.
Like you could sit down and just come up with an alternative UI for interacting with language models.
And right now, maybe you'll invent the thing that everyone will be using for the next six years because there's been because we're so early in this process.
There are so much scope for innovation around how we use these, how we interact with them.
And that I find really exciting.
I love that now that people are beginning to understand this tech and what it
can do, like we need designers on this stuff.
We need user experience people.
We need, but it turns out machine learning nerds are the worst possible people
to actually make use of this technology.
Because they're thinking in terms of, you know, they're thinking in terms of,
okay, well, I've got to optimize my gradient descent or whatever.
You don't need to know what gradient descent is to innovate on top of language
models.
That's almost a distraction from what we're trying to,
what we can achieve with them.
Brilliant.
Brilliant.
Okay. So we are going over. I did want to ask you about Dataset. So it was kind of a run short. So
can you give us the 30 second, what's new in Dataset? We've talked to you about it before,
but what's hot? The most exciting new feature in Dataset, I've been building this feature
called enrichments, where the idea is that you've got, say, a CSV file with 10,000 addresses in,
and you load that into Dataset and you want to see them on a map. So you need to geocode those
addresses. With enrichments, you can have a plugin that lets you select the address column and say,
geocode this and it'll go and churn away against the geocode of your choice and it'll populate
latitude along a two column next to it but crucially these things are all built as plugins
so you can have a plugin that does geocoding i've got a plugin that does just regular expression
extraction of things and i've got a gpt plugin so you can say take this take this database table
run this prompt against every single row and then put the output of that prompt in this other column
And there's one example of that. It can do the GPT vision thing. So I actually fed it a database table with 100 URLs to images and told it to write me descriptions of those images. And I got back three or four paragraphs per image describing what was in the image right there in my table.
and of course now i can search against that and do all of that sort of stuff so i'm really excited
about that it means that data set is evolving into more of a data cleanup and manipulation tool
which is a departure originally it was about publishing exploring data um but i realized that
the problem i most want to solve especially around journalism is if somebody gives you a hundred
thousand rows of data what the heck are you supposed to do with that right especially if
It's slightly too big to put in Microsoft Excel,
but you can't afford to hire some programmers
to build you like a custom Django Postgres app
for this thing.
What do you do?
And if I can build plugin-based tools,
especially with Dataset Cloud now,
so I can host them for people,
where you can upload your CSV file,
click, click geocode,
wait a couple of minutes as the progress bar fills in.
Now it's all geocoded.
Now you can visualize it on a map.
That's really exciting.
And yeah, and the enrichments,
I tried to make it as easy as possible
to write additional enrichments as plugins.
So I'm hoping to see people building their own enrichments
for all sorts of other data transformations they might want to pull off.
I have to ask just one more question then.
If you're doing it on Dataset Cloud and I've got an enrichment
and I'm going to give you some software,
did you find a solution to how you can run my software in a sort of trusted way?
Well, at the moment, I can review your software
and make sure it doesn't have any faithful holes in it.
And then I've actually – Dataset Cloud has a feature now
where I can basically say to this customer,
pick install this additional package.
So I do have that now.
And also, Dataset Cloud, I built it on top of Fly.io
precisely because they offer secure containers.
So with Dataset Cloud, every customer gets a separate container.
So if you somehow manage to screw up the security in your container,
it's isolated.
That's a problem for you.
It's not a problem for other customers in the system,
which felt really important to me.
Yeah, okay, good, good.
Because I know you've been noodling on that problem for quite a long time.
Yeah, I still want to be able to run WebAssembly server-side reliably for untrusted code.
That's like my ultimate goal.
Because, yeah, I want users to be able to say, here is some Python code, run this against all of my data to transform it without risk of them breaking things or whatever.
And it feels like we're almost there with WebAssembly.
And that would be amazing.
If I can take untrusted Python code and run it in a WebAssembly sandbox that's locked
down and that can't do network access and can't reach the file system, that would be
amazing.
Okay, super.
We could go for another hour, but thank you for taking the time to talk about all these
things.
We're going to have links to everything and, you know, Dataset, Dataset Cloud, Enrichments.
Those are, I think, the three big things that fans of yours should go take a closer look
if they're not already familiar.
Cool.
Yeah, this has been really fun.
Yeah, I'll put together some links to this as well.
Okay. Thanks, Simon. Thanks so much for coming. That was really awesome and illuminating and
filled in so many questions around a really hot topic. So super.
Thank you everyone for listening. We're at DjangoChat.com and we'll see you next time. Bye-bye.
Bye-bye.