interpretability

an archive of posts with this tag

Jun 07, 2026	Can Activation Oracles Bypass Safety Training? Reading Harmful Knowledge from a Model That Refuses