I feel like the issue is more that their pathing algorithm is very inefficient. Not sure why using multiple cores would solve the problem if the cause of the lag is that their pathing algorithm is cubic time or something
I think this actually can be used to optimize the pathfinding. Basically create a fake unit that is used for the real/complex path, then have the other units follow it with basic collision or a very tuned down pathfinding algorithm similar to the hack I did where I just counted loop iterations and bailed after some threshold.