It accepts as input, it encodes in 420 format. Because simply encoder hardware does not support it.
no the hardware is frame based (to my knowledhe), however there is something called split mode which reduces latency, it could be may be what you are looking, but i have not implemented it yet
you do not need to, because encoder hardware is single core, even if there were multiple cores of hardware encoders, the underlying mpp api has its own internal threads to handle this parallelism, so ffmpeg threads are not necessary, at least from todays POV.