I wish the author elaborated at all about why they felt that way. Even if it was just "existing solutions are too easy and I want to learn the hard way". They linked to a pretty big list of established microcontroller neural network frameworks. I still have my little "sparkfun" microcontroller that runs Tensforflow Lite neural networks powered by just a coin cell battery. They were free in the goodie bags at Tensorflow Summit 2019. "Edge Computing" on "Internet of Things" was the hype that year.
Edit: Ah, I see they do have elaboration linked - "By simplifying the model architecture and using a full-custom implementation, I bypassed the usual complexities and memory overhead associated with Edge-ML inference engines." Nice work!
What would one do with such a system?
[0] https://pickandplace.wordpress.com/2012/05/16/2d-positioning...
One area where very low resolution images are used is in 3d and IR sensing. For example a 8x8 depth image from a time of flight sensor like ST VL53L5CX. Could be mounted say household and detect for example human vs pet vs static object. Though the sensor is the expensive part, so one would probably afford a larger microcontroller :D
Inference for this RISC-V implementation takes 13.7ms. 7 seconds was cited from an Arduino version as a reference.
When the NN is quantized only after training, a lot of information is lost, or you have to use less aggressive quantization that will have a lot of redundancy.
I'd imagine there's differences in the two approaches?
This projects MCU runs at 48Mhz and is infering 16x16 images.
So, 3x less pixels, at 4x the Mhz. 7000 / 12 = 583ms. Versus 13.5ms, a 43x speed increase. Does seem high, depending maybe on differences between the AVR and RISC-V hardware and ISA. (Eg, might there be a RAM bottleneck on the AVR chip?)
3 cycles for load + 2 for multiplication + 1 for store = 6 clocks for multiplying a float against an array on program ROM. I just couldn't find corresponding document for CH32V003/QingKe V2A/RV32EC, but some of pdfs mention pipelines, so I suppose users are not supposed to count clock cycles and it's just vastly more efficient. That just could be it.
0: pp.70- https://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-...
This publication seems to describe more details on the arduino implementation:
https://arxiv.org/abs/2105.02953
It appears that the code is even using floats in some implementations, which have to be emulated. So I'd wager that both on algorithmic level (QAT-NN) and implementation level there are some discrepancies that lead to better performance on the CH32V003.