Why is implementing it correctly not performant? For context I have no idea how rounding is typically implemented anyways.
For a lot of applications the difference between a denormal and zero is small enough to be irrelevant, so if you expect near-zero values to be common, enabling a denormals-to-zero compiler flag might give you a pretty nice performance boost for free.