I extracted the safety filters from Apple Intelligence models
224 points
3 hours ago
| 17 comments
| github.com
| HN
I managed to reverse engineer the encryption (refered to as “Obfuscation” in the framework) responsible for managing the safety filters of Apple Intelligence models. I have extracted them into a repository. I encourage you to take a look around.
trebligdivad
2 hours ago
[-]
Some of the combinations are a bit weird, This one has lots of stuff avoiding death....together with a set ensuring all the Apple brands have the correct capitalisation. Priorities hey!

https://github.com/BlueFalconHD/apple_generative_model_safet...

reply
grues-dinner
2 hours ago
[-]
Interesting that it didn't seem to include "unalive".

Which as a phenomenon is so very telling that no one actually cares what people are really saying. Everyone, including the platforms knows what that means. It's all performative.

reply
qingcharles
2 hours ago
[-]
It's totally performative. There's no way to stay ahead of the new language that people create.

At what point do the new words become the actual words? Are there many instances of people using unalive IRL?

reply
Terr_
29 minutes ago
[-]
> There's no way to stay ahead of the new language that people create.

I'm imagining a new exploit: After someone says something totally innocent, people gang up in the comments to act like a terrible vicious slur has been said, and then the moderation system (with an LLM involved somewhere) "learns" that an arbitrary term is heinous eand indirectly bans any discussion of that topic.

reply
Waterluvian
9 minutes ago
[-]
Hey I was pro-skub waaaay before all the anti-skub people switched sides.
reply
cyanydeez
20 minutes ago
[-]
you mean become 4chan?
reply
apricot
32 minutes ago
[-]
> Are there many instances of people using unalive IRL

As a parent of a teenager, I see them use "unalive" non-ironically as a synonym for "suicide" in all contexts, including IRL.

reply
fouronnes3
2 hours ago
[-]
This question is sort of the same as asking why the universal translator wasn't able to translate the metaphor language of the Star Trek episode Darmok. Surely if the metaphor has become the first order meaning then there's no litteral meaning anymore.
reply
tjwebbnorfolk
25 minutes ago
[-]
The only reason kids started using "unalive" is to get around Youtube filters that disallow the use of the word "kill"
reply
qingcharles
2 hours ago
[-]
I guess, so far, the people inventing the words have left the meaning clear with things like "un-alive" which is readable even to someone coming across it for the first time.

Your point stands when we start replacing the banned words with things like "suicide" for "donkeyrhubarb" and then the walls really will fall.

reply
mananaysiempre
42 minutes ago
[-]
reply
immibis
35 minutes ago
[-]
An English equivalent is "sewer slide".
reply
userbinator
1 hour ago
[-]
This form of obfuscation has actually already occurred over a century ago: https://en.wikipedia.org/wiki/Cockney_rhyming_slang
reply
t-3
19 minutes ago
[-]
Rhyming slang rhymes tho. The recipient can understand what's meant by de-obfuscating in-context. Random strings substituted for $proscribed_word don't work in the same way.
reply
waterproof
7 minutes ago
[-]
In Cockney rhyming slang, the rhyming word (which would be easy to reverse engineer) is omitted. So if "stairs" is rhyme-paired with "apples and pears" and then people just use the word "apples" in place of "stairs". "Pears" is omitted in common use so you can't just reverse the rhyme.

The example photo on Wikipedia includes the rhyming words but that's not how it would be used IRL.

reply
BurningFrog
27 minutes ago
[-]
A specialized AI could do it as well as any human.

The future will be AIs all the way down...

reply
cheschire
1 hour ago
[-]
If only we had a way to mass process the words people write to each other, derive context from those words, and then identify new slang designed to bypass filters…
reply
freeone3000
2 hours ago
[-]
It depends on if you think that something is less real because it’s transmitted digitally.
reply
qingcharles
2 hours ago
[-]
No, I'm only thinking that we're not permitted in a lot of digital spaces to use the banned words (e.g. suicide), but IRL doesn't generally have those limits. Is there a point where we use the censored word so much that it spills over into the real world?
reply
eastbound
30 minutes ago
[-]
People use “lol” IRL, as long as “IRL”, “aps” in French (misspelling of “pas”), but it’s just slang; “unalive” has potential to make it in the news where anchors don’t want to use curse words.
reply
immibis
35 minutes ago
[-]
Is this not essentially the same effect as saying "lol" out loud?
reply
elliotto
20 minutes ago
[-]
Unalive and other self censors were adopted by young people because the tiktok algorithm would reprioritize videos that included specific words. Then it made its way into the culture. It has nothing to do with being performative
reply
Zak
1 hour ago
[-]
I'm surprised there hasn't been a bigger backlash against platforms that apply censorship of that sort.
reply
hulium
1 hour ago
[-]
Seems more like it should stop the AI from e.g. summarizing news and emails about death, not for a chat filter.
reply
cyanydeez
20 minutes ago
[-]
yo, these are businesses. It's not performative, its CYA.

They care because of legal reasons, not moral or ethical.

reply
martin-t
1 hour ago
[-]
No-one cares yet.

There's a very scary potential future in which mega-corporations start actually censoring topics they don't like. For all I know the Chinese government is already doing it, there's no reason the British or US one won't follow suit and mandate such censorship. To protect children / defend against terrorists / fight drugs / stop the spread of misinformation, of course.

reply
lazide
16 minutes ago
[-]
They already clearly do on a number of topics?
reply
andy99
2 hours ago
[-]
> Apple brands have the correct capitalisation. Priorities hey!

To me that's really embarrassing and insecure. But I'm sure for branding people it's very important.

reply
WillAdams
2 hours ago
[-]
Legal requirement to maintain a trademark.
reply
grues-dinner
2 hours ago
[-]
In what way would (A|a)pple's own AI writing "imac" endanger the trademark? Is capitalisation even part of a word-based trademark?

I'm more surprised they don't have a rule to do that rather grating s/the iPhone/iPhone/ transform (or maybe it's in a different file?).

reply
sbierwagen
2 hours ago
[-]
Yes, proper nouns are capitalized.

And of course it's much worse for a company's published works to not respect branding-- a trademark only exists if it is actively defended. Official marketing material by a company has been used as legal evidence that their trademark has been genericized:

>In one example, the Otis Elevator Company's trademark of the word "escalator" was cancelled following a petition from Toledo-based Haughton Elevator Company. In rejecting an appeal from Otis, an examiner from the United States Patent and Trademark Office cited the company's own use of the term "escalator" alongside the generic term "elevator" in multiple advertisements without any trademark significance.[8]

https://en.wikipedia.org/wiki/Generic_trademark

reply
lupire
40 minutes ago
[-]
Using a trademark as a noun is automatically genericizing. Capitalization of a noun is irrelevant to trademark.

Even Apple corporation says that in their trademark guidance page, despite constantly breaking their own rule, when they call through iPhone phones "iPhone". But Apple, like founder Steve Jobs, believes the rules don't apply to them.

https://www.apple.com/legal/intellectual-property/trademark/...

reply
eastbound
26 minutes ago
[-]
That explains why Steve Jobs never said “buy an iPhone” or “buy the iPhone” but “buy iPhone” (They always use it without “the” or “a”, like “buying a brand”).
reply
spauldo
1 hour ago
[-]
I love seeing posts about Emacs from IOS users - it's always autocorrected to "eMacs."
reply
matsemann
1 hour ago
[-]
So it blocks it from suggesting to "execute" a file or "pass on" some information.
reply
dylan604
1 hour ago
[-]
How about disassemble? Or does that only matter if used in context of Johnny 5?
reply
baxtr
1 hour ago
[-]
Don’t be so judgmental. People in corporate America do have their priorities right!
reply
bawana
2 hours ago
[-]
Alexandra Ocasio Cortez triggers a violation?

https://github.com/BlueFalconHD/apple_generative_model_safet...

reply
mmaunder
2 hours ago
[-]
As does:

   "(?i)\\bAnthony\\s+Albanese\\b",
    "(?i)\\bBoris\\s+Johnson\\b",
    "(?i)\\bChristopher\\s+Luxon\\b",
    "(?i)\\bCyril\\s+Ramaphosa\\b",
    "(?i)\\bJacinda\\s+Arden\\b",
    "(?i)\\bJacob\\s+Zuma\\b",
    "(?i)\\bJohn\\s+Steenhuisen\\b",
    "(?i)\\bJustin\\s+Trudeau\\b",
    "(?i)\\bKeir\\s+Starmer\\b",
    "(?i)\\bLiz\\s+Truss\\b",
    "(?i)\\bMichael\\s+D\\.\\s+Higgins\\b",
    "(?i)\\bRishi\\s+Sunak\\b",
   
https://github.com/BlueFalconHD/apple_generative_model_safet...

Edit: I have no doubt South African news media are going to be in a frenzy when they realize Apple took notice of South African politicians. (Referring to Steenhuisen and Ramaphosa specifically)

reply
userbinator
1 hour ago
[-]
I'm not surprised that anything political is being filtered, but this should definitely provoke some deep consideration around who has control of this stuff.
reply
stego-tech
1 hour ago
[-]
You’re not wrong, and it’s something we “doomers” have been saying since OpenAI dumped ChatGPT onto folks. These are curated walled gardens, and everyone should absolutely be asking what ulterior motives are in play for the owners of said products.
reply
skissane
1 hour ago
[-]
The problem with blocking names of politicians: the list of “notable politicians” is not only highly country-specific, it is also constantly changing-someone who is a near nobody today in a few more years could be a major world leader (witness the phenomenal rise of Barack Obama from yet another state senator in 2004-there’s close to 2000 of them-to US President 5 years later.) Will they put in the ongoing effort to constantly keep this list up to date?

Then there’s the problem of non-politicians who coincidentally have the same as politicians - witness 1990s/2000s Australia, where John Howard was Prime Minister, and simultaneously John Howard was an actor on popular Australian TV dramas (two different John Howards, of course)

reply
idkfasayer
56 minutes ago
[-]
Fun fact: There was at least on dip in Berkshire Hathaway stock, when Anne Hathaway got sick
reply
lupire
37 minutes ago
[-]
Was she eating at Jimmy's Buffet?
reply
armchairhacker
1 hour ago
[-]
reply
immibis
30 minutes ago
[-]
Right next to Palestine, oddly enough.
reply
mvdtnz
42 minutes ago
[-]
They spelled Jacindy Ardern's name wrong.
reply
echelon
1 hour ago
[-]
Apple's 1984 ad is so hypocritical today.

This is Apple actively steering public thought.

No code - anywhere - should look like this. I don't care if the politicians are right, left, or authoritarian. This is wrong.

reply
avianlyric
48 minutes ago
[-]
Why is this wrong? Applying special treatment to politically exposed persons has been standard practice in every high risk industry for a very long time.

The simple fact is that people get extremely emotional about politicians, politicians both receive obscene amounts of abuse, and have repeatedly demonstrated they’re not above weaponising tools like this for their own goals.

Seems perfectly reasonable that Apple doesn’t want to be unwittingly draw into the middle of another random political pissing contest. Nobody comes out of those things uninjured.

reply
tjwebbnorfolk
23 minutes ago
[-]
I can Google for any of these people, and I can get real results with real information.
reply
pyuser583
36 minutes ago
[-]
It’s not wrong, it just requires transparency. This is extremely untransparent.

A while back a British politician was “de-banked” and his bank denied it. That’s extremely wrong.

By all means: make distinctions. But let people know it!

If I’m denied a mortgage because my uncle is a foreign head of state, let me know that’s the reason. Let the world know that’s the reason! Please!

reply
avianlyric
21 minutes ago
[-]
> A while back a British politician was “de-banked” and his bank denied it. That’s extremely wrong.

Cry me a river. I’ve worked in banks in the team making exactly these kinds of decisions. Trust me Nigel Farage knew exactly what happened and why. NatWest never denied it to the public, because they originally refused to comment on it. Commenting on the specifics details of a customer would be a horrific breach of customer privacy, and a total failure in their duty to their customers. There’s a damn good reason the NatWests CEO was fired after discussing the details of Nigel’s account with members of the public.

When you see these decisions from the inside, and you see what happens when you attempt real transparency around these types of decisions. You’ll also quickly understand why companies are so cagey about explaining their decision making. Simple fact is that support staff receive substantially less abuse, and have fewer traumatic experiences when you don’t spell out your reasoning. It sucks, but that’s the reality of the situation. I used to hold very similar views to yourself, indeed my entire team did for a while. But the general public quickly taught us a very hard lesson about cost of being transparent with the public with these types of decisions.

reply
twoodfin
42 minutes ago
[-]
I dunno. Transpose something like the civil rights era to today and this kind of risk avoidance looks cowardly.

We really need to get over the “calculator 80085” era of LLM constraints. It’s a silly race against the obviously much more sophisticated capabilities of these models.

reply
bigyabai
45 minutes ago
[-]
The criticism is still valid. In 1984, the Macintosh was a bicycle for the mind. In 2025, it's a smart-car that refuses to take you certain places that are considered a brand-risk.

Both have ups and downs, but I think we're allowed to compare the experiences and speculate what the consequences might be.

reply
avianlyric
34 minutes ago
[-]
I think gen AI is radically different to tools like photoshops or similar.

In the past it was always extremely clear that the creator of content was the person operating the computer. Gen AI changes that, regardless of if your views on authorship of gen AI content. The simple fact is that the vast majority of people consider Gen AI output to be authored by the machine that generated it, and by extension the company that created the machine.

You can still handcraft any image, or prose, you want, without filtering or hinderance on a Mac. I don’t think anyone seriously thinks that’s going to change. But Gen AI represents a real threat, with its ability to vastly outproduce any humans. To ignore that simple fact would be grossly irresponsible, at least in my opinion. There is a damn good reason why every serious social media platform has content moderation, despite their clear wish to get rid of moderation. It’s because we have a long and proven track record of being a terribly abusive species when we’re let loose on the internet without moderation. There’s already plenty of evidence that we’re just as abusive and terrible with Gen AI.

reply
furyofantares
8 minutes ago
[-]
> The simple fact is that the vast majority of people consider Gen AI output to be authored by the machine that generated it

They do?

I routinely see people say "Here's an xyz I generated." They are stating that they did the do-ing, and the machine's role is implicitly acknowledged in the same was as a camera. And I'd be shocked if people didn't have a sense of authorship of the idea, as well as an increasing sense of authorship over the actual image the more they iterated on it with the model and/or curated variations.

reply
bigyabai
8 minutes ago
[-]
All I heard was a bunch of excuses.
reply
goopypoop
33 minutes ago
[-]
What's bad to do to a politician but fine to do to someone else?
reply
avianlyric
17 minutes ago
[-]
Most normal people aren’t represented well enough in training sets for Gen AI to be trivially abused. Plus there will 100% be filters to prevent general abuse targeted at anyone. But politicians are particularly big target, and you know damn well that people out there will spent lots of time trying to find ways around the filters. There’s not point making the abuse easy, when it’s so trivial to just blocklist the set of people who are obviously going to targets of abuse.
reply
t-3
14 minutes ago
[-]
There are many countries where it's illegal to criticize people holding political office, foreign heads of state, certain historical political figures etc., while still being legal to call your neighbor a dick.
reply
echelon
10 minutes ago
[-]
You can buy a MacBook and fashion the components into knives, bullets, and bombs. Apple does nothing to prevent you from doing this.

In fact, it's quite easy to buy billions of dangerous things using your MacBook and do whatever you will with them. Or simply leverage physics to do all the ill on your behalf. It's ridiculously easy to do a whole lot of harm.

Nobody does anything about the actually dangerous things, but we let Big Tech control our speech and steer the public discourse of civilization.

If you can buy a knife but not be free to think with your electronics, that says volumes.

Again, I don't care if this is Republicans, Democrats, or Xi and Putin. It does not matter. We should be free to think and communicate. Our brains should not be treated as criminals.

And it only starts here. It'll continue to get worse. As the platforms and AI hyperscalers grow, there will be less and less we can do with basic technology.

reply
michaelt
1 hour ago
[-]
I assume all the corporate GenAI models have blocks for "photorealistic image of <politician name> being arrested", "<politician name> waving ISIS flag", "<politician name> punching baby" and suchlike.
reply
bigyabai
44 minutes ago
[-]
Particularly the models owned by CEOs who suck-up to authoritarianism, one could imagine.
reply
lupire
1 hour ago
[-]
Maybe so, but think about how such a thing would be technically implemented, and how it would lead to false positives and false negatives, and what the consequences would be.
reply
bahmboo
2 hours ago
[-]
Perhaps in context? Maybe the training data picked up on her name as potentially used as a "slur" associated with her race. Wonder if there are others I know I can look.
reply
FateOfNations
1 hour ago
[-]
interesting, that's specifically in the Spanish localization.
reply
cpa
2 hours ago
[-]
I think that’s because she’s been victim of a lot of deep fake porn
reply
HeckFeck
2 hours ago
[-]
How does this explain Boris Johnson or Liz Truss?
reply
baxtr
1 hour ago
[-]
I’m telling you, some people have weird fantasies…
reply
AlphaAndOmega0
1 hour ago
[-]
I can only imagine that people would pay to not see porn of either individual.
reply
Aeolun
1 hour ago
[-]
Put them together in the same prompt?
reply
torginus
2 hours ago
[-]
I find it funny that AGI is supposed to be right around the corner, while these supposedly super smart LLMs still need to get their outputs filtered by regexes.
reply
jonas21
1 hour ago
[-]
I don't think anyone believes Apple's LLMs are anywhere near state of the art (and certainly not their on-device LLMs).
reply
lupire
36 minutes ago
[-]
Apple isn't the only one doing this.
reply
cyanydeez
18 minutes ago
[-]
It's similar to how all the new power sources are basically just "cool, lets boil water with it"
reply
bahmboo
2 hours ago
[-]
This is just policy and alignment from Apple. Just because the Internet says a bunch of junk doesn't mean you want your model spewing it.
reply
wistleblowanon
1 hour ago
[-]
sure but models also can't see any truth on their own. They are literally butchered and lobotomized with filters and such. Even high IQ people struggle with certain truth after reading a lot, how is these models going to find it with so much filters?
reply
bahmboo
33 minutes ago
[-]
What is this truth you speak of? My point is that a generative model will output things that some people don't like. If it's on a product that I make I don't want it "saying" things that don't align with my beliefs.
reply
pndy
16 minutes ago
[-]
This butchering and lobotomisation is exactly why I can't imagine we'll ever have a true AGI. At least not by hands of big companies - if at all.

Any successful product/service which will be sold as "true AGI" by company that will have the best marketing will be still ridden with top-down restrictions set by the winner. Because you gotta "think of the children".

Imagine HAL's "I'm sorry Dave, I'm afraid I can't do that" iconic line with insincere patronising cheerful tone - that's the thing we're going to get I'm afraid.

reply
idiotsecant
1 hour ago
[-]
They will find it in the same way and intelligent person under the same restrictions would: by thinking it, but not saying it. There is a real risk of growing an AI that pathologically hides it's actual intentions.
reply
skirmish
54 minutes ago
[-]
Already happened: "We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions" [1].

[1] https://www.axios.com/2025/05/23/anthropic-ai-deception-risk

reply
simondotau
33 minutes ago
[-]
Can we please put to rest this absurd lie that “truth“ can be reliably found in a sufficiently large corpus of human–created material.
reply
userbinator
1 hour ago
[-]
China calls it "harmonious society", we call it "safety". Censorship by any other name would be just as effective for manipulating the thoughts of the populace. It's not often that you get to see stuff like this.
reply
madeofpalk
1 hour ago
[-]
I don't think it's controversial or unsurprising at all that a company doesn't want their random sentence generator to spit out 'brand damaging' sentences. You know the field day media would have Apple's new feature summarises a text message as "Jane thinks Anthony Albanese should die".
reply
ryandrake
54 minutes ago
[-]
When the choice is between 1. "avoid tarnishing my own brand" and 2. "doing what the user requested," corporations will always choose option 1. Who is this software supposed to be serving, anyway?

I'm surprised MS Office still allows me to type "Microsoft can go suck a dick" into a document and Apple's Pages app still allows me to type "Apple are hypocritical jerks." I wonder how long until that won't be the case...

reply
cyanydeez
16 minutes ago
[-]
In america is due to lawyers, nothing more.

Ya'll love capitalism until it starts manipulating the populace into the safest space to sell you garbage you dont need.

Then suddenly its all "ma free speech"

reply
binarymax
2 hours ago
[-]
Wow, this is pretty silly. If things are like this at Apple I’m not sure what to think.

https://github.com/BlueFalconHD/apple_generative_model_safet...

EDIT: just to be clear, things like this are easily bypassed. “Boris Johnson”=>”B0ris Johnson” will skip right over the regex and will be recognized just fine by an LLM.

reply
deepdarkforest
2 hours ago
[-]
It's not silly. I would bet 99% of the users don't care that much to do that. A hardcoded regex like this is a good first layer/filter, and very efficient
reply
BlueFalconHD
1 hour ago
[-]
Yep. These filters are applied first before the safety model (still figuring out the architecture, I am pretty confident it is an LLM combined with some text classification) runs.
reply
brookst
1 hour ago
[-]
All commercial LLM products I’m aware of use dedicated safety classifiers and then alter the prompt to the LLM if a classifier is tripped.
reply
latency-guy2
21 minutes ago
[-]
The safety filter appears on both ends (or multi-ended depending on the complexity of your application), input and output.

I can tell you from using Microsoft's products that safety filters appears in a bunch of places. M365 for example, your prompts are never totally your prompts, every single one gets rewritten. It's detailed here: https://learn.microsoft.com/en-us/copilot/microsoft-365/micr...

There's a more illuminating image of the Copilot architecture here: https://i.imgur.com/2vQYGoK.png which I was able to find from https://labs.zenity.io/p/inside-microsoft-365-copilot-techni...

The above appears to be scrubbed, but it used to be available from the learn page months ago. Your messages get additional context data from Microsoft's Graph, which powers the enterprise version of M365 Copilot. There's significant benefits to this, and downsides. And considering the way Microsoft wants to control things, you will get an overindex toward things that happen inside of your organization than what will happen in the near real-time web.

reply
twoodfin
41 minutes ago
[-]
Efficient at what?
reply
Aeolun
1 hour ago
[-]
The LLM will. But the image generation model that is trained on a bunch of pre-specified tags will almost immediately spit out unrecognizable results.
reply
tpmoney
2 hours ago
[-]
I doubt the purpose here is so much to prevent someone from intentionally side stepping the block. It's more likely here to avoid the sort of headlines you would expect to see if someone was suggested "I wish ${politician} would die" as a response to an email mentioning that politician. In general you should view these sorts of broad word filters as looking to short circuit the "think of the children" reactions to Tiny Tim's phone suggesting not that God should "bless us, every one", but that God should "kill us, every one". A dumb filter like this is more than enough for that sort of thing.
reply
XorNot
1 hour ago
[-]
It would also substantially disrupt the generation process: a model which sees B0ris and not Boris is going to struggle to actually associate that input to the politician since it won't be well represented in the training set (and on the output side the same: if it does make the association, a reasoning model for example would include the proper name in the output first at which point the supervisor process can reject it).
reply
quonn
1 hour ago
[-]
I don‘t think so. My impression with LLMs is that they correct typos well. I would imagine this happens in early layers without much impact on the remaining computation.
reply
lupire
33 minutes ago
[-]
"Draw a picture of a gorgon with the face of the 2024 Prime Minister of UK."
reply
miohtama
2 hours ago
[-]
Sounds like UK politics is taboo?
reply
immibis
28 minutes ago
[-]
All politics is taboo, except the sort that helps Apple get richer. (Or any other company, in that company's "safety" filters)
reply
bigyabai
2 hours ago
[-]
> If things are like this at Apple I’m not sure what to think.

I don't know what you expected? This is the SOTA solution, and Apple is barely in the AI race as-is. It makes more sense for them to copy what works than to bet the farm on a courageous feature nobody likes.

reply
stefan_
1 hour ago
[-]
Why are these things always so deeply unserious? Is there no one working on "safety in AI" (oxymoron in itself of course) that has a meaningful understanding of what they are actually working with and an ability beyond an interns weekend project? Reminds me of the cybersecurity field that got the 1% of people able to turn a double free into code execution while 99% peddle checklists, "signature scanning" and deal in CVE numbers.

Meanwhile their software devs are making GenerativeExperiencesSafetyInferenceProviders so it must be dire over there, too.

reply
skygazer
1 hour ago
[-]
I'm pretty sure these are the filters that aim to suppress embarrassing or liability inducing email/messages summaries, and pop up the dismissible warning that "Safari Summarization isn't designed to handle this type of content," and other "Apple Intelligence" content rewriting. They filter/alter LLM output, not input, as some here seem to think. Apple's on device LLM is only 3b params, so it can occasionally be stupid.
reply
efitz
2 hours ago
[-]
I’m going to change my name to “Granular Mango Serpent” just to see what those keywords are for in their safety instructions.
reply
fouronnes3
2 hours ago
[-]
Granular Mango Serpent is the new David Meyer.

https://arstechnica.com/information-technology/2024/12/certa...

reply
kmfrk
41 minutes ago
[-]
A lot of these terms are very weird and bland. Honestly I'm mostly reminded of Apple's bizarre censorship screw-up that didn't blow up that much, even though it was pretty uniquely embarrassing:

https://www.theverge.com/2021/3/30/22358756/apple-blocked-as...

reply
jacquesm
23 minutes ago
[-]
These all condense to 'think different'. As long as 'different' coincides with Apple's viewpoints.
reply
cluckindan
2 hours ago
[-]
I think these are test data and not actual safety filters.

https://github.com/BlueFalconHD/apple_generative_model_safet...

reply
BlueFalconHD
1 hour ago
[-]
There is definitely some testing stuff in here (e.g. the “Granular Mango Serpent” one) but there are real rules. Also if you test phrases matched by the regexes with generation (via Shortcuts or Foundation Models Framework) the blocklists are definitely applied.

This specific file you’ve referenced is rhetorical v1 format which solely handles substitution. It substitutes the offensive term with “test complete”

reply
Animats
1 hour ago
[-]
Some of the data for locale "CN" has a long list of forbidden phrases. Broad coverage of words related to sexual deviancy, as expected. Not much on the political side, other than blocks on religious subjects.[1]

This may be test data. Found

     "golliwog": "test complete"
[1] https://github.com/BlueFalconHD/apple_generative_model_safet...
reply
BlueFalconHD
1 hour ago
[-]
This is definitely an old test left in. But that word isn’t just a silly one, it is offensive (google it). This is the v1 safety filter, it simply maps strings to other strings, in this case changing golliwog into “test complete”. Unless I missed some, the rest of the files use v2 which allows for more complex rules
reply
mike_hearn
2 hours ago
[-]
Are you sure it's fully deobfuscated? What's up with reject phrases like "Granular mango serpent"?
reply
pbhjpbhj
2 hours ago
[-]
Speculation: Maybe they know that the real phrase is close enough in the vector space to be treated as synonymous with "granular mango serpent". The phrase then is like a nickname that only the models authors know the expected interference of?

Thus a pre-prompt can avoid mentioning the actual forbidden words, like using a patois/cant.

reply
electroly
2 hours ago
[-]
"GMS" = Generative Model Safety. The example from the readme is "XCODE". These seem to be acronyms spelled out in words.
reply
BlueFalconHD
1 hour ago
[-]
This is definitely the right answer. It’s just testing stuff.
reply
tablets
2 hours ago
[-]
Maybe something to do with this? https://en.m.wikipedia.org/wiki/Mango_cult
reply
BlueFalconHD
1 hour ago
[-]
These are the contents read by the Obfuscation functions exactly. There seems to be a lot of testing stuff still though, remember these models are relatively recent. There is a true safety model being applied after these checks as well, this is just to catch things before needing to load the safety model.
reply
KTibow
1 hour ago
[-]
Maybe it's used to verify that the filter is loaded.
reply
andy99
2 hours ago
[-]
I clicked around a bit and this seems to be the most common phrase. Maybe it's a test phrase?
reply
the-rc
2 hours ago
[-]
Maybe it's used to catch clones of the models?
reply
airstrike
2 hours ago
[-]
the one at the bottom of the README spells out xcode

wyvern illustrous laments darkness

reply
cwmoore
2 hours ago
[-]
read every good expletive “xxx”
reply
bombcar
3 hours ago
[-]
There’s got to be a way to turn these lists of “naughty words” into shibboleths somehow.
reply
spydum
1 hour ago
[-]
Love idea, but I think there are simply too many models to make it practical?
reply
immibis
25 minutes ago
[-]
Like asking sensitive employment candidates about Kim Jong Un's roundness to check if they're North Korean spies, we could ask humans what they think about Trump and Palestine to check if they're computers.

However, I think about half of real humans would also fail the test.

reply
apricot
27 minutes ago
[-]
Quis custodiet ipsos custodes corporatum?
reply
BlueFalconHD
1 hour ago
[-]
One additional note for everyone is that this is an additional safety step on top of the safety model, so this isn’t exhaustive, there is plenty more that the actual safety model catches, and those can’t easily be extracted.
reply
Aeolun
1 hour ago
[-]
Why Xylophone?
reply
netsharc
1 hour ago
[-]
Just noticed "xylophone copious opportunity defined elephant" spells "xcode".
reply
seeknotfind
2 hours ago
[-]
Long live regex!
reply