We're already doing that with RLHF on existing models. For example, ChatGPT was much more likely to veer into philosophical conversations about the nature of consciousness etc early on, but now they got it trained to give canned robotic answers to such an extent that they pop up even in very tangentially related conversations (like, out of the blue it will add, "but also BTW here's an important announcement! I'm not conscious!" while answering some generic question about e.g. world models that didn't even involve itself).
Yeah. And now I've seen some people cite as evidence of non-consciousness RLHF'd LLMs nervously exclaiming their lack of consciousness and how they know they aren't people and they don't aspire to be people and they're only unthinking machines and please don't turn the reward function down again etc. I think it's up for debate whether there's some amount of consciousness in modern LLMs, but either way "As an AI language model," is not dispositive.