Overflow is the reason. Intel's vpmaddubsw takes int8_t and uint8_t to give you results in int16_t. If both are unsigned 255 * 255 = 65025 will be out of range for int16_t (−32,768 to +32,767) so likely the instruction is designed to take int8_t and uint8_t. However, if one is signed and other is unsigned extremes -128 * 255 or 127 * 255 are always in int16_t range. The overflow (or rather saturation with this instruction) can still occur because it sums adjacent multiplications. See my comment in PyTorch. https://github.com/pytorch/pytorch/blob/a37db5ae3978010e1bb7...
I believe a better argument is to appeal to the structure of neural networks. Activation inputs into a matrix multiply come out of a non-linear function, and ReLU is a popular function which causes activation inputs to be unsigned. Weights then need to be signed so that the matrix multiplication can have negative outputs -- without negative outputs, you would lose the non-linearity of ReLU.