AI Modelling
An honest look at the brains behind the filter, and why I keep swapping them out.
Why does the model matter?
Every story you see on this site has been read by an AI before a human ever lays eyes on it. The AI's job is to decide whether the article is actually positive, actually about the UK, and actually news rather than tabloid filler dressed up as news. Get the model right and the front page is a steady drip of genuinely good stuff. Get it wrong and the front page is full of lottery numbers, weather forecasts, and celebrities posting holiday snaps on Instagram.
This page is the running diary of which models I have tried, what they were like, and why I moved on. It is not marketing copy. If a model was rubbish, I will say so.
Round one: Google Gemini
The first version of the site used Google's Gemini. It was a sensible starting point: cheap, fast, well-documented, and honestly pretty decent at understanding tone. The judgement calls were mostly reasonable, the rewrites were clean, and it could be talked into following a JSON schema without too much sulking.
The catch was the free tier limits and the way Google's API kept moving the goalposts on quotas and pricing. For a hobby project running on a £3-a-month VPS, predictability matters more than peak quality. I started looking elsewhere.
Round two: Llama 4 Scout (and why it was shite)
Next up was Meta's Llama 4 Scout, served via Groq. On paper it ticked every box: generous free tier, blazingly fast inference, big context window, and a reputation for solid instruction-following. In practice it was the AI equivalent of a slightly hungover sub-editor who has stopped reading past the headline.
Scout has a fatal weakness for surface signals. If a story mentioned a UK place name and contained a vaguely positive word, it would happily wave it through with a score of 8 or 9. Lottery results were getting through. Estate agent listings dressed up as "inside this stunning 200-year-old home" stories were getting through. Celebrity Instagram updates were getting scored a 9 out of 10. A piece about the BBC introducing new guidelines after a racial slur incident at the BAFTAs sailed past the filter, because the *response* sounded positive even though the *cause* very much was not.
I tightened the prompt, added rules, gave it examples. Scout would obey for a day, then quietly drift back to its old habits. After three days of staring at the published list and quietly muttering, I pulled the plug.
Round three: Qwen3-32B
Next was Alibaba's Qwen3-32B, also served via Groq. It is a reasoning model, which means it is built to actually weigh the criteria you give it instead of pattern-matching the vibes. Benchmarks aside, my hope was a simple one: that it would hold the rubric in mind for longer than three sentences.
For a few days it did. Lottery results stopped getting through. Property porn calmed down. The queue moved more slowly because Qwen's free tier is stingy on tokens-per-minute, but that was a fair trade for a sober second opinion on each article.
The cracks showed when I started reading the front page properly. Qwen kept rubber-stamping a particular shape of story: compensation and refund pieces ("UK drivers could be owed £800"), corporate retail puff ("Asda completes revamp of long-vacant Folkestone building"), and bog-standard gig reviews ("Midge Ure took fans on a career-spanning journey"). All scored seven or eight out of ten. None of them were actually good news. The first is claims-management bait. The second exists because the building was an eyesore for years. The third is just a gig review. Qwen could not see it.
Round four: Gemini Flash Lite, with a Qwen safety net
So I went back to Google, but this time on the free tier and with the cheaper, smaller models. Article scoring now runs on Gemini 2.5 Flash Lite. The clean rewrite step (the one that produces the title and summary you see on the site) runs on Gemini 3.1 Flash Lite. They sit in separate quota buckets, so they do not fight each other for daily allowance.
Qwen has not been retired. It is still wired in, but as a fallback. If Gemini errors, returns garbled JSON, or burns through the day's free quota, the request quietly retries against Qwen and the pipeline keeps moving. Two models, one prompt, one consistent rubric, and no single point of failure.
The prompt has had another tightening. I have spelled out the patterns Qwen kept fluffing: compensation and mis-selling stories, gig reviews, retail openings and revamps, and "could", "may", "set to" speculation about things that have not actually happened yet. All explicitly capped below the publishing threshold. I also added a carve-out so charity fundraisers do not get downgraded under the negative-hook rule, because charity exists in response to need by definition and that is a feature, not a bug.
The honest gap I did not cross: I tried to use Google's Gemma 4 31B for the scoring step, on the strength of its generous daily quota. It thinks. A lot. So much chain-of-thought that most articles timed out before it produced an answer, and the ones it finished often misapplied the rules in spectacular fashion. I binned it within an afternoon. The dashboard shows several Gemma 3 models with even better quotas, but Google does not actually expose them on the API, which is a separate kind of frustrating.
An own goal: rate-limiting myself into the red
A quick confession, because this page is meant to be honest. I opened Google's rate-limit dashboard the other day and found a cheerful red bar staring back at me. Gemini 2.5 Flash Lite, the model doing the scoring, had peaked at twelve requests a minute against a free-tier ceiling of ten. Red means you have gone over, and going over means Google starts handing back "429, slow down" errors instead of answers. Every one of those is a perfectly good story falling through to the Qwen safety net for no good reason.
The embarrassing part is how I got there. To avoid exactly this, the code throttles itself with its own rate limiter. I had set that limiter to twelve. The actual limit is ten. I had, in other words, carefully and deliberately held myself to a speed limit that was already over the speed limit. A proper own goal.
The fix came in layers. The code now enforces both requests-per-minute and rolling daily request caps before it calls Google, with a safety margin below the published free-tier limits. If a model returns a quota 429 anyway, it gets put on a local cooldown and the pipeline falls back to Qwen rather than repeatedly poking the same exhausted bucket.
The red bar is now a calm, untroubled green. No real magic, just arithmetic and the humility to admit I had set my own limit wrong, but it is a tidy reminder that on a free tier the constraints are half the engineering.
The slow second-guess: re-evaluating published stories
Getting a story past the filter the first time is not the end of the conversation. A cron job periodically picks one already-published story from the last thirty days and hands it back to the AI with the same rubric I used at publish time. The model does not know its previous answer. It just reads the cleaned-up title and summary and scores it cold.
The point is not to catch the model lying to me, although it occasionally is. The point is drift. Prompts get tightened, the model itself gets silently updated on Google's end, and my own sense of what counts as "good news" sharpens the more front pages I read. A story that scraped through three weeks ago deserves to be asked the same question again today, by the same brain holding a slightly stricter rubric. If it still passes, brilliant. If it does not, it was probably borderline to begin with and I am happy to take another look.
This bit works better when it is paced sensibly. The cron runs every two hours, picks one story, and moves on. It still catches puff pieces, social-media clickety shite, "celebrity posts a photo and the internet reacts" non-stories, and the lot, but it does not try to burn through hundreds of model calls in a day just to tidy the archive faster.
There is an Auto Editor page in the admin that shows demotions, recent second opinions, and how far the scores moved. It is mostly there so I can spot when a particular kind of story keeps getting downgraded, which is usually a hint that the next prompt tightening is already writing itself.
Round six: teaching the model that nuance exists
The next problem was subtler, and therefore more annoying. The second-pass model started doing exactly what I had asked it to do, which is always where software becomes briefly unbearable. I had told it to be wary of "negative hooks" - stories where the headline sounds cheerful but the article only exists because something grim happened. Sensible rule. Necessary rule. Also, once applied with the emotional range of a parking meter, a deeply stupid rule.
It began marking genuinely decent stories as not-good-news because there was hardship somewhere in the background. A dog rescued by the RNLI after falling from a cliff? Negative hook. A man saved by CPR? Medical emergency, therefore bad. Curlew eggs rescued from a wildfire and successfully hatched? Wildfire mentioned, straight to jail. A new NHS treatment for advanced cancer? Cancer is sad, computer says no. At this point the model had stopped filtering good news and started acting like a Victorian undertaker with a JSON schema.
The fix was to teach it the difference between negative context and negative focus. Good news often starts with a problem. Rescues require danger. Medical breakthroughs require illness. Accessibility improvements require someone being excluded first. Conservation wins usually mean nature was in trouble. Charity exists because people need help. If the positive outcome is the actual news, the bad backdrop should not automatically sink it.
The guardrails now say that successful rescues, survival, recovery, lifesaving acts, medical breakthroughs, approved treatments, conservation rescues, accessibility improvements, worker benefits, community renewal, fundraising achievements, acts of kindness, historic firsts, records, and grassroots wins can all score highly. The filter still rejects unresolved scandals, complaints, sentencing, deaths, obituaries, empty PR, celebrity fluff, and speculative "could one day" pieces. The difference is that a good outcome is allowed to have a backstory. Shocking, I know.
I also took the axe away from the re-evaluator temporarily. It could still disagree, score a story lower, and tell me why - but it could not quietly shove a published story back into pending on its own. That turned out to be the right call while the dust settled, but not a permanent answer. If a story scores two out of ten twice in a row, it probably should not be on the front page regardless of my feelings about it.
The great demotion incident, and giving the axe back carefully
Before I had a chance to reinstate automatic demotion in a sensible form, it reinstated itself in a thoroughly insensible one. The re-evaluator, running under the old rules and without the guardrails I had been meaning to add, worked through the back-catalogue and demoted over three hundred published stories in a single run. Three hundred. Gone from the front page, shoved back to pending, and most of them by a model that had read a summary stripped of adjectives and warmth - the cleaned version the site actually shows - and concluded it was not good enough.
This is the thing about the re-evaluation pass that I had underweighted. The rewrite step is told to strip empty adjectives, avoid telling the reader how to feel, and let the facts carry the positivity. So it does. The summary becomes flat and factual. Then the scorer reads that flat, factual summary and decides the story lacks warmth. The model was, in a roundabout way, penalising stories for following my own style guide. A spectacular own goal, and this time I cannot blame the arithmetic.
I put the three hundred stories back and rebuilt the demotion logic with a few rules it cannot override. A story now has to score three or below on two consecutive re-evaluation passes before it gets demoted. One grumpy read just flags it in the admin and moves on. And no matter how productive the model is feeling, it cannot demote more than five stories in any rolling twenty-four hour window. If it tries to go beyond that, the extras get flagged for my review instead. The daily cap alone means a repeat of the three-hundred-story incident is arithmetically impossible.
The re-evaluator has its axe back. It is just a smaller axe, and there is a padlock on the cabinet.
Region tagging, and what happens when the AI goes off-piste
Stories are tagged with a region when the article is specifically about a named place in the UK - a local charity, a county council decision, a community project in a particular town. The region slug gets stored against the story and shown on the story page as a location label. The list of valid slugs covers every county and unitary authority in England, Scotland, Wales, and Northern Ireland.
For a while this worked fine, except it did not. Once I looked closely, there were stories tagged as "liverpool" (not a valid slug - that should be "merseyside"), "north-east-england" (invented), "gateshead" (invented), and a handful of other places the AI had conjured from thin air rather than picking from the list it was given. Forty-six rows in total. The prompt said "use a slug from this list" but was not firm enough about it, and Qwen - the fallback model - appeared to be the main culprit. Qwen is a reasoning model that is generally good at following instructions, but its approach to region slugs seemed to be "I know what I mean, I'll just write it." Gemini was better behaved, but not blameless.
The original prompt also had the wrong default. It told the model to use "national" for UK-wide stories or when it could not pin a story to a specific county. That second clause is doing a lot of damage. Almost any story that happens in the UK has some geographic element, so the model was dutifully slapping "national" on everything it could not confidently locate elsewhere. Nearly two hundred stories ended up tagged as "national" that had no particular reason to carry a location label at all.
The fix was threefold. The prompt now says: only tag a story if geography is central to it, use "national" only for genuinely UK-wide stories like national statistics or government policy, and leave it empty if in doubt - an empty string is always better than a guess. The code now validates every slug the AI returns against a whitelist before it ever touches the database, and quietly nulls out anything that is not on the list. The re-evaluation cron also runs the same check on every story it visits, so any rogue slugs that crept in before the fix get cleaned up on the cron's normal thirty-day pass. The forty-six already in the database were cleared in one go.
Rewriting the rewrite prompt
The re-evaluation incident also exposed something about the rewrite prompt itself. The headlines and summaries you read on the site are not the originals - they go through an editorial pass that rewrites them in a consistent voice. The prompt doing that job had accumulated rules over time but had one consistent weakness: models would summarise rather than frame. They would tell you what happened in two sentences and call it done. The "why it matters" part - the bit that turns a fact into a story worth sharing - kept getting left on the floor.
The prompt now says this explicitly: frame the story, do not just summarise it. It also includes a pair of before-and-after examples showing what the difference looks like in practice, because telling a language model what not to do is less effective than showing it the shape you actually want. The examples do more work than the instructions. Few-shot beats description almost every time.
The other fix was structural. The rewrite instructions had drifted into two separate copies in the codebase - one used at publish time, one used when a story is manually rewritten from the admin. They were supposed to be identical, and for a while they were, and then they were not. They are now a single function called from both places. One edit, one result, no quiet divergence.
How I judge whether a model is working
The honest answer is: by reading the front page. Every day, I scroll through what got published and ask one question - would I tell a friend about this? If the answer is "no, this is filler", that story becomes evidence. The patterns of duff that slip through are the patterns I add to the filter prompt next.
I also keep an eye on the rejection log. A good model rejects with a clear reason. A bad model rejects good stories or accepts bad ones, and the reasons tend to be vague. Reading rejections is genuinely the fastest way to spot whether a model has the right instincts.
This is going to keep changing
Automating "good news" is not a problem you solve once. New models come out every few months, free tiers move around, and the kinds of clickbait that publishers invent get more sophisticated. The goal is not to find a perfect model and freeze it. The goal is to keep iterating until the front page reliably reflects the country at its best, and then keep iterating when the front page slips.
The economics matter too. For now the running cost is close to nothing: the AI lives entirely on free tiers, and the hosting is sponsored for the first six months by Thunder Lizard, a one-person UK personalised gift shop trying to make a go of it in a cut-throat industry. After that the VPS bill becomes my problem, which is a healthy incentive to keep those free tiers sweet and the request counts honest.
If you spot something on the site that should not be here, or notice the quality drifting, please do tell me. The filter only gets better when real readers point at real duff and say "this, this is the problem".
Every story page has a "report" link at the bottom. Use it. The reports go straight into the queue I review when tuning the next round of prompt and model changes.