FeedCity: Simon Willison's Weblog

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult

Anthropic released Claude Opus 4.5 this morning, which they call "best model in the world for coding, agents, and computer use". This is their attempt to retake the crown for best coding model after significant challenges from OpenAI's GPT-5.1-Codex-Max and Google's Gemini 3...

Simon Willison's Weblog
24 Nov 19:27

sqlite-utils 3.39

sqlite-utils 3.39 I got a report of a bug in sqlite-utils concerning plugin installation - if you installed the package using uv tool install further attempts to install plugins with sqlite-utils install X would fail, because uv doesn't bundle pip by default. I had the same ...

Simon Willison's Weblog
24 Nov 15:12

sqlite-utils 4.0a1 has several (minor) backwards incompatible changes

I released a new alpha version of sqlite-utils last night - the 128th release of that package since I started building it back in 2018. sqlite-utils is two things in one package: a Python library for conveniently creating and manipulating SQLite databases and a CLI tool for ...

Simon Willison's Weblog
23 Nov 21:33

"Good engineering management" is a fad

"Good engineering management" is a fad Will Larson argues that the technology industry's idea of what makes a good engineering manager changes over time based on industry realities. ZIRP hypergrowth has been exchanged for a more cautious approach today, and expectations of m...

Simon Willison's Weblog
23 Nov 01:03

Agent design is still hard

Agent design is still hard Armin Ronacher presents a cornucopia of lessons learned from building agents over the past few months. There are several agent abstraction libraries available now (my own LLM library is edging into that territory with its tools feature) but Armin h...

Simon Willison's Weblog
23 Nov 00:18

Olmo 3 is a fully open LLM

Olmo is the LLM series from Ai2 - the Allen institute for AI. Unlike most open weight models these are notable for including the full training data, training process and checkpoints along with those releases. The new Olmo 3 claims to be "the best fully open 32B-scale thinkin...

Simon Willison's Weblog
21 Nov 17:48

We should all be using dependency cooldowns

We should all be using dependency cooldowns William Woodruff gives a name to a sensible strategy for managing dependencies while reducing the chances of a surprise supply chain attack: dependency cooldowns. Supply chain attacks happen when an attacker compromises a widely us...

Simon Willison's Weblog
20 Nov 16:42

Nano Banana Pro aka gemini-3-pro-image-preview is the best available image generation model

Hot on the heels of Tuesday's Gemini 3 Pro release, today it's Nano Banana Pro, also known as Gemini 3 Pro Image. I've had a few days of preview access and this is an astonishingly capable image generation model. As is often the case, the most useful low-level details can be...

Simon Willison's Weblog
20 Nov 01:18

Quoting Nicholas Carlini

Previously, when malware developers wanted to go and monetize their exploits, they would do exactly one thing: encrypt every file on a person's computer and request a ransome to decrypt the files. In the future I think this will change. LLMs allow attackers to instead proce...

Simon Willison's Weblog
19 Nov 23:33

Building more with GPT-5.1-Codex-Max

Building more with GPT-5.1-Codex-Max Hot on the heels of yesterday's Gemini 3 Pro release comes a new model from OpenAI called GPT-5.1-Codex-Max. (Remember when GPT-5 was meant to bring in a new era of less confusing model names? That didn't last!) It's currently only availa...

Simon Willison's Weblog
19 Nov 22:16

How I automate my Substack newsletter with content from my blog

I sent out my weekly-ish Substack newsletter this morning and took the opportunity to record a YouTube video demonstrating my process and describing the different components that make it work. There's a lot of digital duct tape involved, taking the content from Django+Heroku...

Simon Willison's Weblog
19 Nov 08:09

Quoting Matthew Prince

Cloudflare's network began experiencing significant failures to deliver core network traffic [...] triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network. [...] The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

— Matthew Prince, Cloudflare outage on November 18, 2025

Tags: scaling, postmortem, cloudflare

Simon Willison's Weblog
18 Nov 23:21

llm-gemini 0.27

llm-gemini 0.27

New release of my LLM plugin for Google's Gemini models:

Support for nested schemas in Pydantic, thanks Bill Pugh. #107

Now tests against Python 3.14.

Support for YouTube URLs as attachments and the media_resolution option. Thanks, Duane Milne. #112

New model: gemini-3-pro-preview. #113

The YouTube URL feature is particularly neat, taking advantage of this API feature. I used it against the Google Antigravity launch video:

llm -m gemini-3-pro-preview \
 -a 'https://www.youtube.com/watch?v=nTOVIGsqCuY' \
 'Summary, with detailed notes about what this thing is and how it differs from regular VS Code, then a complete detailed transcript with timestamps'

Here's the result. A spot-check of the timestamps against points in the video shows them to be exactly right.

Tags: projects, youtube, ai, generative-ai, llms, llm, gemini

Simon Willison's Weblog
18 Nov 22:24

MacWhisper has Automatic Speaker Recognition now

Inspired by this conversation on Hacker News I decided to upgrade MacWhisper to try out NVIDIA Parakeet and the new Automatic Speaker Recognition feature. It appears to work really well! Here's the result against this 39.7MB m4a file from my Gemini 3 Pro write-up this morning: You can export the transcript with both timestamps and speaker names using the Share -> Segments > .json menu item: Here's the resulting JSON. Tags: whisper, nvidia, ai, speech-to-text, macwhisper

Simon Willison's Weblog
18 Nov 20:57

Google Antigravity

Google Antigravity Google's other major release today to accompany Gemini 3 Pro. At first glance Antigravity is yet another VS Code fork Cursor clone - it's a desktop application you install that then signs in to your Google account and provides an IDE for agentic coding aga...

Simon Willison's Weblog
18 Nov 19:39

Quoting Ethan Mollick

Three years ago, we were impressed that a machine could write a poem about otters. Less than 1,000 days later, I am debating statistical methodology with an agent that built its own research environment. The era of the chatbot is turning into the era of the digital coworker. To be very clear, Gemini 3 isn’t perfect, and it still needs a manager who can guide and check it. But it suggests that “human in the loop” is evolving from “human who fixes AI mistakes” to “human who directs AI work.” And that may be the biggest change since the release of ChatGPT.

— Ethan Mollick, Three Years from GPT-3 to Gemini 3

Tags: gemini, ethan-mollick, generative-ai, chatgpt, ai, llms, ai-agents

Simon Willison's Weblog
18 Nov 19:09

Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

Google released Gemini 3 Pro today. Here's the announcement from Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu, their developer blog announcement from Logan Kilpatrick, the Gemini 3 Pro Model Card, and their collection of 11 more articles. It's a big release! I had a ...

Simon Willison's Weblog
17 Nov 23:48

The fate of “small” open source

The fate of “small” open source Nolan Lawson asks if LLM assistance means that the category of tiny open source libraries like his own blob-util is destined to fade away. Why take on additional supply chain risks adding another dependency when an LLM can likely kick out the ...

Simon Willison's Weblog
16 Nov 19:03

Quoting Andrej Karpathy

With AI now, we are able to write new programs that we could never hope to write by hand before. We do it by specifying objectives (e.g. classification accuracy, reward functions), and we search the program space via gradient descent to find neural networks that work well a...

Simon Willison's Weblog
15 Nov 21:19

llm-anthropic 0.22

llm-anthropic 0.22

New release of my llm-anthropic plugin:

Support for Claude's new structured outputs feature for Sonnet 4.5 and Opus 4.1. #54

Support for the web search tool using -o web_search 1 - thanks Nick Powell and Ian Langworth. #30

The plugin previously powered LLM schemas using this tool-call based workaround. That code is still used for Anthropic's older models.

I also figured out uv recipes for running the plugin's test suite in an isolated environment, which are now baked into the new Justfile.

Tags: projects, python, ai, generative-ai, llms, llm, anthropic, claude, uv

Simon Willison's Weblog
14 Nov 20:21

parakeet-mlx

parakeet-mlx

Neat MLX project by Senstella bringing NVIDIA's Parakeet ASR (Automatic Speech Recognition, like Whisper) model to to Apple's MLX framework.

It's packaged as a Python CLI tool, so you can run it like this:

uvx parakeet-mlx default_tc.mp3

The first time I ran this it downloaded a 2.5GB model file.

Once that was fetched it took 53 seconds to transcribe a 65MB 1hr 1m 28s podcast episode (this one) and produced this default_tc.srt file with a timestamped transcript of the audio I fed into it. The quality appears to be very high.

Tags: python, ai, nvidia, uv, mlx, speech-to-text

Simon Willison's Weblog
14 Nov 13:57

GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum

GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum I was confused about whether the new "adaptive thinking" feature of GPT-5.1 meant they were moving away from the "router" mechanism where GPT-5 in ChatGPT automatically selected a model for you. This page addresses th...

Simon Willison's Weblog
14 Nov 00:03

Introducing GPT-5.1 for developers

Introducing GPT-5.1 for developers OpenAI announced GPT-5.1 yesterday, calling it A smarter, more conversational ChatGPT. Today they've added it to their API. We actually got four new models today: gpt-5.1 gpt-5.1-chat-latest gpt-5.1-codex gpt-5.1-codex-mini There are a lo...

Simon Willison's Weblog
13 Nov 23:39

Datasette 1.0a22

Datasette 1.0a22

New Datasette 1.0 alpha, adding some small features we needed to properly integrate the new permissions system with Datasette Cloud:

datasette serve --default-deny option for running Datasette configured to deny all permissions by default. (#2592)

datasette.is_client() method for detecting if code is executing inside a datasette.client request. (#2594)

Plus a developer experience improvement for plugin authors:

datasette.pm property can now be used to register and unregister plugins in tests. (#2595)

Tags: projects, datasette, datasette-cloud, annotated-release-notes

Simon Willison's Weblog
13 Nov 23:09

Nano Banana can be prompt engineered for extremely nuanced AI image generation

Nano Banana can be prompt engineered for extremely nuanced AI image generation Max Woolf provides an exceptional deep dive into Google's Nano Banana aka Gemini 2.5 Flash Image model, still the best available image manipulation LLM tool three months after its initial release....

Simon Willison's Weblog
13 Nov 17:06

Quoting Nov 12th letter from OpenAI to Judge Ona T. Wang

On Monday, this Court entered an order requiring OpenAI to hand over to the New York Times and its co-plaintiffs 20 million ChatGPT user conversations [...] OpenAI is unaware of any court ordering wholesale production of personal information at this scale. This sets a dange...

Simon Willison's Weblog
13 Nov 16:36

What happens if AI labs train for pelicans riding bicycles?

Almost every time I share a new example of an SVG of a pelican riding a bicycle a variant of this question pops up: how do you know the labs aren't training for your benchmark? The strongest argument is that they would get caught. If a model finally comes out that produces a...

Simon Willison's Weblog
12 Nov 17:36

Quoting Steve Krouse

The fact that MCP is a difference surface from your normal API allows you to ship MUCH faster to MCP. This has been unlocked by inference at runtime

Normal APIs are promises to developers, because developer commit code that relies on those APIs, and then walk away. If you break the API, you break the promise, and you break that code. This means a developer gets woken up at 2am to fix the code

But MCP servers are called by LLMs which dynamically read the spec every time, which allow us to constantly change the MCP server. It doesn't matter! We haven't made any promises. The LLM can figure it out afresh every time

— Steve Krouse

Tags: model-context-protocol, generative-ai, steve-krouse, apis, ai, llms

Simon Willison's Weblog
12 Nov 16:21

Fun-reliable side-channels for cross-container communication

Fun-reliable side-channels for cross-container communication Here's a very clever hack for communicating between different processes running in different containers on the same machine. It's based on clever abuse of POSIX advisory locks which allow a process to create and de...

Simon Willison's Weblog
11 Nov 23:55

Agentic Pelican on a Bicycle

Agentic Pelican on a Bicycle Robert Glaser took my pelican riding a bicycle benchmark and applied an agentic loop to it, seeing if vision models could draw a better pelican if they got the chance to render their SVG to an image and then try again until they were happy with t...