Implementing Neural Networks on a "10-cent" RISC-V MCU
125 points
17 days ago
| 5 comments
| cpldcpu.wordpress.com
| HN
geor9e
15 days ago
[-]
>I felt there is no solution that I really felt comfortable with

I wish the author elaborated at all about why they felt that way. Even if it was just "existing solutions are too easy and I want to learn the hard way". They linked to a pretty big list of established microcontroller neural network frameworks. I still have my little "sparkfun" microcontroller that runs Tensforflow Lite neural networks powered by just a coin cell battery. They were free in the goodie bags at Tensorflow Summit 2019. "Edge Computing" on "Internet of Things" was the hype that year.

Edit: Ah, I see they do have elaboration linked - "By simplifying the model architecture and using a full-custom implementation, I bypassed the usual complexities and memory overhead associated with Edge-ML inference engines." Nice work!

reply
_Microft
17 days ago
[-]
How would one get these 16x16 images generated in a way that does not need a lot more compute power than the inference itself? Maybe by using a sensor from an optical mouse which seems to have a similar resolution? [0] According to a quick web-search, the CH32V003 seems to support SPI and I²C out of the box [1] which the mentioned sensor supports?

What would one do with such a system?

[0] https://pickandplace.wordpress.com/2012/05/16/2d-positioning...

[1] https://www.wch-ic.com/products/CH32V003.html

reply
jononor
15 days ago
[-]
IO does tend to take considerable resources/power. In fact it is one of the reasons it is desirable to run ML as close to the sensor as possible. It allows to extract and transmit onwards just the information of interest (usually very low bitrate) instead of raw sensor data. Especially important on wireless and battery.

One area where very low resolution images are used is in 3d and IR sensing. For example a 8x8 depth image from a time of flight sensor like ST VL53L5CX. Could be mounted say household and detect for example human vs pet vs static object. Though the sensor is the expensive part, so one would probably afford a larger microcontroller :D

reply
cpldcpu
15 days ago
[-]
Indeed using a mouse sensor for data input would be quite interesting. Mayber another option would just be a row of phototransistors.
reply
imtringued
15 days ago
[-]
"ESP32 CAM" gets me a couple of hits for a camera excluding ESP32 for under $1.50 at 500 minimum quantity.
reply
mianos
14 days ago
[-]
Specially considering the esp32 has multiply for both integer and floatinf point.
reply
imtringued
15 days ago
[-]
7 seconds is an eternity, even for micro controllers.
reply
robxorb
14 days ago
[-]
Unless I'm misreading your comment, you may have misread the article.

Inference for this RISC-V implementation takes 13.7ms. 7 seconds was cited from an Arduino version as a reference.

reply
jononor
15 days ago
[-]
Image classification is a good demo/test case. However image sensors still cost multiple dollars, so one would likely spent a bit more in the microcontroller in that case. Accelerometer or microphone on the other hand adds just 30 cents to the BOM, and can be processed on similar cheap microcontroller. That is at least what I have found so far, trying to build a sub 1 dollar ML-powered system https://hackaday.io/project/194511-1-dollar-tinyml
reply
cpldcpu
15 days ago
[-]
Great project! I used MNIST because it is easy to work with as a dataset. Audio classification would be quite interesting as a follow up, but I assume one would need some kind of transform to deal with the data in an easier way.
reply
jononor
15 days ago
[-]
Thanks! Yeah transforming into a time-frequency is the standard method. Short Time Fourier Transform (STFT) using FFT is the most common, though one can use FIR/IIR filterbanks also. It is however quite challenging to do in just a few kB of RAM. It looks doable with 4 kB in total, miiight be possible with 2 kB.
reply
cpldcpu
11 days ago
[-]
Maybe something simpler, like a haar wavelet, would also work? Or DFT using Görtzel?
reply
bjornsing
15 days ago
[-]
Impressive numbers compared with the linked Arduino project. Makes me wonder, what’s the difference in approach?
reply
cpldcpu
15 days ago
[-]
The different is in using quantization aware training, where the quantization of the weights is already simulated during the training. This helps to restructure the network in a way where it can optimally store information in the allotted number of bits per weight.

When the NN is quantized only after training, a lot of information is lost, or you have to use less aggressive quantization that will have a lot of redundancy.

reply
UncleEntity
15 days ago
[-]
Does that mean you're training a lower bit-rate(?) network or you are training a full network to 'know' it will eventually be running under quantization?

I'd imagine there's differences in the two approaches?

reply
cpldcpu
15 days ago
[-]
The latter one. The network is trained in full precision (this is required for the gradient calculation), but the weights are nudged towards the quantized values.
reply
bjornsing
15 days ago
[-]
Thanks, that explains the accuracy. But it doesn’t explain why it took 7 seconds to run inference on the Arduino, and milliseconds in this project…
reply
robxorb
14 days ago
[-]
The paper [0] regarding the Arduino inplementation mentions their MCU runs at 16Mhz, and they are also running the inference on 28x28 images.

This projects MCU runs at 48Mhz and is infering 16x16 images.

So, 3x less pixels, at 4x the Mhz. 7000 / 12 = 583ms. Versus 13.5ms, a 43x speed increase. Does seem high, depending maybe on differences between the AVR and RISC-V hardware and ISA. (Eg, might there be a RAM bottleneck on the AVR chip?)

[0] https://arxiv.org/ftp/arxiv/papers/2105/2105.02953.pdf

reply
numpad0
14 days ago
[-]
Looks like SRAM load on AVR takes 3 cycles, EEPROM 4 cycles[0], 1 cycle subtracted for consecutive reads. SRAM store is 1-2 cycles. FMUL(fixed point multiply) is 2 cycles. CPU is not pipelined nor cached.

3 cycles for load + 2 for multiplication + 1 for store = 6 clocks for multiplying a float against an array on program ROM. I just couldn't find corresponding document for CH32V003/QingKe V2A/RV32EC, but some of pdfs mention pipelines, so I suppose users are not supposed to count clock cycles and it's just vastly more efficient. That just could be it.

0: pp.70- https://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-...

reply
cpldcpu
14 days ago
[-]
On the CH32V003, a load should be two cycles if the code is executed from SRAM, there are additional wait states for load from flash. The V2A does only cache a single 32 bit instruction word, so there is basically no cache.

This publication seems to describe more details on the arduino implementation:

https://arxiv.org/abs/2105.02953

It appears that the code is even using floats in some implementations, which have to be emulated. So I'd wager that both on algorithmic level (QAT-NN) and implementation level there are some discrepancies that lead to better performance on the CH32V003.

reply
mbb70
15 days ago
[-]
I believe the Arduino project used only a single hidden layer, whereas the authors quantization scheme allowed them to use multiple.
reply
ladyanita22
15 days ago
[-]
How would Rust behave here? It'd be interesting to know if it's flexible enough to work as efficiently in these machines.
reply
__s
15 days ago
[-]
Rust in these contexts can work with no_std https://docs.rust-embedded.org/book/intro/index.html
reply
persnickety
15 days ago
[-]
Apparently it has been done https://noxim.xyz/blog/rust-ch32v003/
reply