How to speedup SSD post processing

I was able to run ssd_mobilenet_v1 demo code ssd.py successfully.
The inference time is impressive if I set do_quantization=True, which is around 22 ~ 24 ms per frame. However, the “Post Process” (got valid candidate box) could take almost 700ms per frame!

Questions:

  1. Is there a way to speed up this “Post Process” by using NPU? and how?
  2. or by using GPU? and how?
  3. or other way? would port to C code make this “Post Process” faster? by how much?

Thank you very much for your help.