One simple trick for breaking down AI guardrails

In an address to the Catholic Church yesterday, Pope Leo XIV said AI must be “disarmed” and “freed from logics that turn [it] into an instrument of domination, exclusion, and death.”

That sounds a little dramatic, but new research shows that Leo might have a point.

Driving the news: Recent tests conducted by the Financial Times and AI safety group Alice found that safety guardrails for major open-source AI models can be removed within minutes using readily available software. Models tested include Google’s Gemma 3 and Meta’s Llama 3.3.

The team was able to make the models discuss big no-nos like how to make code that can steal credit card info and how to disperse chlorine gas through a crowd.
The findings are similar to those from a Microsoft study earlier this year which found a method able to break down the safeguards of 15 models with a single prompt.

How it works: The technique is called “abliteration.” The software asks questions the model can and can’t answer. It contrasts the responses and parses the weights to figure out where the ‘I can’t answer that’ response lies. It then edits the model to remove those guardrails.

This doesn’t really work on proprietary systems — models whose underlying code isn’t available to the public — as their weights are closely guarded secrets.

Why it matters: These findings are especially concerning because they show you don’t need to be a super-hacker to upend these systems. No specialized hardware is required; no great technical know-how is needed; the only barrier to entry is a passing familiarity with widely-used coding tools.—QH

One simple trick for breaking down AI guardrails

Print media isn’t dead

Could Canada join the EU?

Canada’s biking industry is navigating rocky terrain

Get the newsletter 160,000+ Canadians start their day with.