Why is Module X slow? Or is it really module Y? Debugging performance at scale