The latency and power issues can probably be fixed, assuming a good end-to-end model, by using model distillation into a wide shallow net using low-precision or even binary operations. I don't know if that would be enough - we've seen multiple order of magnitude decreases in compute requirements (think about style transfer going from hours on top-end Titan GPUs to realtime on mobile phones) but the usual target is mobile smartphones which at least have a GPU, while it seems unlikely any hearing aids will have GPUs anytime soon... I suppose a good enough squashed low-precision model could be turned into an ASIC.
Not to detract from your larger point but AFAIK the style transfer thing is different. If you're willing to hardcode the style into the net you can go realtime, but the original style transfer paper is able to do different styles without retraining. So they're different algorithms. Unless the SOTA has changed recently.
You shouldn't need to hardcode the style if you provide the style as an additional datapoint for it to condition on. But this doesn't really matter since for fun mobile applications it's fine to pick from 20 or 50 pretrained styles, and likewise for hearing aids.