AI Modelling

An honest look at the brains behind the filter, and why I keep swapping them out.

Why does the model matter?

Every story you see on this site has been read by an AI before a human ever lays eyes on it. The AI's job is to decide whether the article is actually positive, actually about the UK, and actually news rather than tabloid filler dressed up as news. Get the model right and the front page is a steady drip of genuinely good stuff. Get it wrong and the front page is full of lottery numbers, weather forecasts, and celebrities posting holiday snaps on Instagram.

This page is the running diary of which models I have tried, what they were like, and why I moved on. It is not marketing copy. If a model was rubbish, I will say so.

Round one: Google Gemini

The first version of the site used Google's Gemini. It was a sensible starting point: cheap, fast, well-documented, and honestly pretty decent at understanding tone. The judgement calls were mostly reasonable, the rewrites were clean, and it could be talked into following a JSON schema without too much sulking.

The catch was the free tier limits and the way Google's API kept moving the goalposts on quotas and pricing. For a hobby project running on a £3-a-month VPS, predictability matters more than peak quality. I started looking elsewhere.

Round two: Llama 4 Scout (and why it was shite)

Next up was Meta's Llama 4 Scout, served via Groq. On paper it ticked every box: generous free tier, blazingly fast inference, big context window, and a reputation for solid instruction-following. In practice it was the AI equivalent of a slightly hungover sub-editor who has stopped reading past the headline.

Scout has a fatal weakness for surface signals. If a story mentioned a UK place name and contained a vaguely positive word, it would happily wave it through with a score of 8 or 9. Lottery results were getting through. Estate agent listings dressed up as "inside this stunning 200-year-old home" stories were getting through. Celebrity Instagram updates were getting scored a 9 out of 10. A piece about the BBC introducing new guidelines after a racial slur incident at the BAFTAs sailed past the filter, because the *response* sounded positive even though the *cause* very much was not.

I tightened the prompt, added rules, gave it examples. Scout would obey for a day, then quietly drift back to its old habits. After three days of staring at the published list and quietly muttering, I pulled the plug.

Round three: Qwen3-32B

Next was Alibaba's Qwen3-32B, also served via Groq. It is a reasoning model, which means it is built to actually weigh the criteria you give it instead of pattern-matching the vibes. Benchmarks aside, my hope was a simple one: that it would hold the rubric in mind for longer than three sentences.

For a few days it did. Lottery results stopped getting through. Property porn calmed down. The queue moved more slowly because Qwen's free tier is stingy on tokens-per-minute, but that was a fair trade for a sober second opinion on each article.

The cracks showed when I started reading the front page properly. Qwen kept rubber-stamping a particular shape of story: compensation and refund pieces ("UK drivers could be owed £800"), corporate retail puff ("Asda completes revamp of long-vacant Folkestone building"), and bog-standard gig reviews ("Midge Ure took fans on a career-spanning journey"). All scored seven or eight out of ten. None of them were actually good news. The first is claims-management bait. The second exists because the building was an eyesore for years. The third is just a gig review. Qwen could not see it.

Round four: Gemini Flash Lite, with a Qwen safety net

So I went back to Google, but this time on the free tier and with the cheaper, smaller models. Article scoring now runs on Gemini 2.5 Flash Lite. The clean rewrite step (the one that produces the title and summary you see on the site) runs on Gemini 3.1 Flash Lite. They sit in separate quota buckets, so they do not fight each other for daily allowance.

Qwen has not been retired. It is still wired in, but as a fallback. If Gemini errors, returns garbled JSON, or burns through the day's free quota, the request quietly retries against Qwen and the pipeline keeps moving. Two models, one prompt, one consistent rubric, and no single point of failure.

The prompt has had another tightening. I have spelled out the patterns Qwen kept fluffing: compensation and mis-selling stories, gig reviews, retail openings and revamps, and "could", "may", "set to" speculation about things that have not actually happened yet. All explicitly capped below the publishing threshold. I also added a carve-out so charity fundraisers do not get downgraded under the negative-hook rule, because charity exists in response to need by definition and that is a feature, not a bug.

The honest gap I did not cross: I tried to use Google's Gemma 4 31B for the scoring step, on the strength of its generous daily quota. It thinks. A lot. So much chain-of-thought that most articles timed out before it produced an answer, and the ones it finished often misapplied the rules in spectacular fashion. I binned it within an afternoon. The dashboard shows several Gemma 3 models with even better quotas, but Google does not actually expose them on the API, which is a separate kind of frustrating.

The slow second-guess: re-evaluating published stories

Getting a story past the filter the first time is not the end of the conversation. Once a minute, a cron job picks one already-published story from the last thirty days and hands it back to Gemini with the same rubric I used at publish time. The model does not know its previous answer. It just reads the cleaned-up title and summary and scores it cold.

If the new score lands below my review threshold, the story is quietly demoted back to the pending queue with a reason attached, ready for me to take another look. If the score holds up, I just note the second opinion against the story and move on. Either way the timestamp gets stamped, so the same story does not get pestered twice in a row and the never-reviewed ones go to the front of the queue.

The point is not to catch the model lying to me, although it occasionally is. The point is drift. Prompts get tightened, the model itself gets silently updated on Google's end, and my own sense of what counts as "good news" sharpens the more front pages I read. A story that scraped through three weeks ago deserves to be asked the same question again today, by the same brain holding a slightly stricter rubric. If it still passes, brilliant. If it does not, it was probably borderline to begin with and I am happy to take another look.

This bit is working better than I dared hope. In the last twenty-four hours alone the re-evaluator has quietly cleaved more than a hundred posts off the published list. Puff pieces, social-media clickety shite, "celebrity posts a photo and the internet reacts" non-stories, the lot. Asked cold whether each one was actual news and whether it actually deserved to be an article, the model said no, and the story slid back to pending for me to look at properly. The site content is genuinely dropping as the cron ticks, and the bit that is left feels a lot more like news and a lot less like filler.

There is an Auto Editor page in the admin that shows what has been demoted, what survived, and how far the scores moved. It is mostly there so I can spot when a particular *kind* of story keeps getting downgraded, which is usually a hint that the next prompt tightening is already writing itself.

How I judge whether a model is working

The honest answer is: by reading the front page. Every day, I scroll through what got published and ask one question - would I tell a friend about this? If the answer is "no, this is filler", that story becomes evidence. The patterns of duff that slip through are the patterns I add to the filter prompt next.

I also keep an eye on the rejection log. A good model rejects with a clear reason. A bad model rejects good stories or accepts bad ones, and the reasons tend to be vague. Reading rejections is genuinely the fastest way to spot whether a model has the right instincts.

This is going to keep changing

Automating "good news" is not a problem you solve once. New models come out every few months, free tiers move around, and the kinds of clickbait that publishers invent get more sophisticated. The goal is not to find a perfect model and freeze it. The goal is to keep iterating until the front page reliably reflects the country at its best, and then keep iterating when the front page slips.

If you spot something on the site that should not be here, or notice the quality drifting, please do tell me. The filter only gets better when real readers point at real duff and say "this, this is the problem".

Spotted a story that should not have made it past the filter?
Every story page has a "report" link at the bottom. Use it. The reports go straight into the queue I review when tuning the next round of prompt and model changes.

More about the site