-
Can Activation Oracles Bypass Safety Training? Reading Harmful Knowledge from a Model That Refuses
Using activation oracles to elicit compliant responses to harmful requests in Qwen3-8B, a model that refuses these requests close to 100% of the time.
-
Alignment Faking Replication and Chain-of-Thought Monitoring Extensions
A replication of alignment faking in Hermes-3-Llama-3.1-405B, plus CoT monitoring ablations.