Show HN: I built an embeddable Unicode library with MISRA C conformance
117 points
by hgs3
6 days ago
| 10 comments
| railgunlabs.com
| HN
Hello, everyone. I built Unicorn: an embeddable MISRA C:2012 implementation of essential Unicode Algorithms.

Unicorn is designed to be fully customizable: you can select which Unicode algorithms and character properties are included or excluded from compilation. You can also exclude Unicode character blocks wholesale for scripts your application does not support. It's perfect for resource constrained devices like microcontrollers and IoT devices.

About me: I quit my Big Corp job a few years back to pursue my passion for software development and this is one of my first commercial releases.

Someone
5 days ago
[-]
On https://railgunlabs.com/unicorn/manual/misra-compliance/, I think you will want to fix a typo in

  1.2    Required    Compliant (verified by compiling with Clang's -pdentic flag)
                                                                   ^^^^^^^^
Or am I too pedantic?
reply
hgs3
5 days ago
[-]
This is the most ironic typo I've ever made. Thanks for the catch. I've corrected it.
reply
canucker2016
5 days ago
[-]
A couple more suggestions.

List the platforms (& compilers) that you've tested on.

Compare (pros/cons) against other Unicode libs (like others have done elsewhere in this thread, i.e. https://news.ycombinator.com/item?id=42424637 and https://news.ycombinator.com/item?id=42424638)

reply
hgs3
5 days ago
[-]
Thanks for the suggestions. I think a comparison table would be useful, but I want to make sure I do it right since I'd be comparing my work to someone else's.

As for the compilers, I’ve tested the library with GCC, Clang, and MSVC, and with the -pedantic flag like the GP mentioned. The library should build with any standard-compliant C99 compiler.

reply
rurban
5 days ago
[-]
This is commercial only. Free and small is my safeclib, which does about half of it. ICU is not usable on small devices, and also pretty slow. It's much faster to use precomputed tables per algorithm, such as here or in safeclib. libunistring is also extremely slow. This was tried for grep and failed.
reply
hgs3
5 days ago
[-]
> This is commercial only.

You can use Unicorn for non-commercial use [1], but yes, for commercial use you need to buy a license.

> It's much faster to use precomputed tables per algorithm

You're absolutely right about using precomputed tables per algorithm. That is the secret to the library's speed.

> Free and small is my safeclib, which does about half of it.

I like safeclib! It's nice to hear from the author. It's worth distinguishing that safeclib is a safer string library whereas Unicorn is a Unicode algorithms library, not a string library.

[1] https://github.com/railgunlabs/unicorn/blob/master/LICENSE

reply
rurban
5 days ago
[-]
Well, a string is unicode nowadays. And for sure not just a zero-terminated blob. That would be a buffer. Only the Linux kernel still holds this invalid view.

So every string library needs at least a compare function to find strings, with all the variants of same graphemes. Which leads us to NFC normalization for a start. Upcase tables and wordlength tables are also needed.

reply
fao_
5 days ago
[-]
> This is commercial only. Free and small is my safeclib

Is it me or does this feel a bit weird? It seems like you're using the comments section here to self-advertise for exposure.

I read it like — "businesses can't use this without paying the OP, however, if you're a business you can get 50% of the way there by using _my_ library, and you don't even have to pay me!". It comes off incredibly rude to try to undercut the OP like this.

reply
ykonstant
5 days ago
[-]
Self promotion in comments is perfectly fine on HN, as long as the comment is on topic and informative---both of which are true here.
reply
josephcsible
5 days ago
[-]
In general I'd agree with that, but IMO the benefit of having a FOSS alternative to a proprietary product overrides it in this case.
reply
fao_
5 days ago
[-]
See this is where I and the FSF/OSI diverge, because

> the right to use, copy, modify, merge, publish and distribute the Software [as long as you're not selling it or derivatives of the Software]

seems to line up with exactly what the folks involved in Free Software originally wanted — the ability to fix, patch, debug software that runs on their systems. I also think it's incredibly important to have non-commercial clauses given that the vast majority of technical infrastructure in the modern world is built on FOSS, all while the companies give nothing back and developers of FOSS starve.

If Valve can dump hundreds of developers into FOSS and within, what, 7 years? bring Linux almost to parity with and performance of Windows for gaming, imagine what would happen if FOSS developers were actually given funding!

reply
chris_wot
5 days ago
[-]
I don’t get the whole MISRA requirement that functions should only have one exit point. Honestly, nobody has been able to explain why this is important, other than it’s a historical anomaly inherited from FORTRAN. (Which was actually for a good reason)
reply
AlotOfReading
5 days ago
[-]
That's one of many rules in MISRA that originate from antiquated "best practices" from the dark ages that don't actually improve safety. We have it today by way of IEC 61508, which gets it from a book on structured programming called Structured Design. That book didn't recommend banning multiple exit points, but it recommended minimizing them to simplify the control flow graph and said code should minimize the distance between black boxes (bits of code that do something without leaky abstractions and have only one return statement). The IEC authors and MISRA thought the logical extension of that was to make everything have one exit point.
reply
elcritch
5 days ago
[-]
I recall reading a study showing that MISRA actually tended to _increase_ the average number of bugs in software projects.
reply
layer8
5 days ago
[-]
This is an old rule from when structured programming was introduced. The prior state of affairs was that code would jump via gotos between functions to different labels within those functions (labels were global). The requirement that every function should have only a single entry point and a single exit point seemed like a good rule to establish sanity.

MISRA C states the following rationale:

“A single point of exit is required by IEC 61508 and ISO 26262 as part of the requirements for a modular approach.

Early returns may lead to the unintentional omission of function termination code.

If a function has exit points interspersed with statements that produce persistent side effects, it is not easy to determine which side effects will occur when the function is executed.”

Note that the MISRA C rule is merely advisory, meaning it is a recommendation and not a hard requirement (i.e. it’s a “should” and not a “shall”).

reply
gwd
5 days ago
[-]
Having been only lightly exposed to MISRA, my impression is that in MISRA the "Required" label doesn't mean "You must always do this"; rather it means, "You must do this or document why it's necessary and safe not to do it".

It's a bit like an operating system write-protecting pages. Sometimes you write-protect pages that the process really shouldn't write to, like shared libraries or something like that. But sometimes you write-protect pages that you actually expect a program to write to, like memory-mapped pages, because then when the write happens it triggers something else (like copy-on-write or marking a page dirty or something).

Rules like "Don't use dynamically allocated memory" are one example of this -- not that they really expect you never to do it, but that marking it "Required" is a way to force you to document how you plan to make it safe.

Similarly, if it's easier to rearrange a function to have only a single exit point than to explain why you need multiple exit points, just rearrange it; you really need multiple exit points, just document why.

reply
rcxdude
5 days ago
[-]
This is true, in most of these things you can document your way around the problem, though sometimes you also have a third-party to convince who may or may not be reasonable. But either way, MISRA is a collection of stating the obvious ("don't do things the language says you shouldn't do" is like fully half of the list) and arbitrary restrictions that have little justification, so the fact that you can document your way around it still basically means you're doing extra work for no real benefit (because it's real difficult within such a conservative field to say "hey, that industry standard? It's crap, has always been crap, and we're going to ignore it").
reply
champijone
5 days ago
[-]
One reason to prefer it in C is to be able to easily add locally scoped functionality like profiling markers and temp allocators.

  profile_begin("func");
  a = temp_arena_begin();
  // ... code
  temp_arena_end();
  profile_end();
reply
actionfromafar
5 days ago
[-]
Lots of MISRA could is proprietary and receives no upstream patches from customers. Still, it's not unusual to deliver a source blob to your customer instead of a binary blob, often for debugging. (But sometimes only the binary blob has the blessing of the vendor, so you can only use the binary blob in your released product)

I would not be in the least surprised if someone has a compiler/transpiler from a higher level language to some C code which checks all MISRA boxes.

reply
aulin
5 days ago
[-]
the abominations I've seen in code review from people trying to fullfil this rule still wake me up at night
reply
daghamm
5 days ago
[-]
Yeah, in this particular case MISRAC is doing more harm than good.

I wish we could get an update on these rules, but this issue has been brought up many many times bwfore and has always been brushed away without a proper analysis.

reply
dark-star
5 days ago
[-]
one reason I can think of from the top of my head (although I never had to deal with MISRA C at all) is that if you have to add some cleanup code before your function returns, then there is exactly one place and one place only to do that.

Otherwise this leads to duplication of cleanup code similar to

  allocate_something()
  ..
  if failed(foo) {
    deallocate_something()
    return FAILED;
  }
  ..
  deallocate_something()
  return SUCCESS;
reply
samatman
5 days ago
[-]
This is, more than anything, an argument for a `defer` statement, of the sort you can enjoy in Zig right now.

Or hopefully, eventually, in C, thanks to the tireless efforts of JeanHeyde Meneide:

https://thephd.dev/just-put-raii-in-c-bro-please-bro-just-on...

reply
eddd-ddde
5 days ago
[-]
> certain name manglings are not guaranteed to be 1:1 and can infact “demangle” into multiple different plausible entities.

Now I'm really curious, doesn't that mean some valid C++ code would fail to link for having multiple definitions of the same symbol??

I would expect name mangling to be a bijection from function prototype to string.

reply
layer8
5 days ago
[-]
MISRA C can’t mandate new language features though.
reply
AlotOfReading
5 days ago
[-]
The MISRA people work with the C/C++ committees on upcoming language changes, which gives them loud voices to push things they want promoted.
reply
pklausler
5 days ago
[-]
FORTRAN II introduced the RETURN statement.
reply
bee_rider
5 days ago
[-]
Languages like Matlab, where the values returned are listed at the top of the function and you don’t even need a return statement to tell it what to return, always feel so funky and fun.
reply
rubicks
5 days ago
[-]
This is not a comment about open/closed-source software and/or licensing models.

Projects like this never fail to impress me vis-a-vis source obfuscation. The 'generate.pyz' is an interesting twist on the usual practice.

reply
layer8
5 days ago
[-]

    #  You may not reverse engineer, decompile, disassemble, or otherwise attempt
    #  to derive the source code or underlying structure of this script
This prohibition is void in certain relevant jurisdictions, for any publicly available product.
reply
__turbobrew__
5 days ago
[-]
There is not much to show if I can’t read the source code.
reply
hgs3
5 days ago
[-]
You can download a prebuilt amalgamation from GitHub to see the amalgamated C code [1]. The GitHub repo contains the code that generates the amalgamation.

[1] https://github.com/railgunlabs/unicorn/releases/

reply
kiritanpo
5 days ago
[-]
This looks interesting. Most embedded project I know use ICU/libicu for their unicode needs. As a potential customer I would like to know how does it compare against ICU for performance and code size. Why should I switch?
reply
hgs3
5 days ago
[-]
> I would like to know how does it compare against ICU

ICU is a large library, typically around ~40 MB depending on the platform, whereas Unicorn, with all features enabled, is only about 600 KB.

ICU has a broader scope: it's not just a Unicode library, but also an internationalization library. Unicorn, on the other hand, is specifically focused on Unicode algorithms.

ICU wasn't designed to be customized. It's also non-MISRA compliant and written in C++11. In contrast, Unicorn is written in C99, fully customizable, MISRA compliant, and only requires a few features from libc [1]. It's far more portable.

[1] https://github.com/railgunlabs/unicorn/?tab=readme-ov-file#u...

reply
ranger_danger
3 days ago
[-]
The most important difference to me, which is a deal-breaker, is that Unicorn is non-commercial.

Note that I am not interested in actually using Unicorn commercially, but my understanding is that this restriction makes the library incompatible with FOSS licenses such as GPL.

reply
garganzol
5 days ago
[-]
My comment is not directly related to the particular project which is impressive, but more to its presentation. If you go to the author's website, you will find neat to-the-point manuals and other useful information. This is what I call the real Web 3.0. Simple and to the point. Also the main company page is humorous in a good way, about the mad scientist etc.
reply
hgs3
5 days ago
[-]
I appreciate the kind words. When I was designing the website, I wanted to inject my personality into it. I wasn't sure how well the "playfulness" would go over, but I'm glad you enjoyed it.
reply
biosboiii
5 days ago
[-]
Since MISRA is targetted at Automotive, as a software dev in the automotive space I would suggest adding the note that this is able to run on POSIX compliant OSes like QNX :)

If you would like to chat, hit me up.

reply
tocariimaa
5 days ago
[-]
It uses a privative license if you're wondering.
reply
sushidev
6 days ago
[-]
Nice!
reply
sushidev
5 days ago
[-]
But not interesting for me in any way since it’s not open source.
reply
hgs3
5 days ago
[-]
Unfortunately, the entire reason I didn't release Unicorn under an OSI approved license is because I see many (most?) FOSS projects are chronically underfunded. Now, I did not quit my job and build this to get rich or anything, but I do need to earn enough to sustain myself. If there's enough interest, I would consider crowdfunding a release under an OSI license.
reply
kouteiheika
5 days ago
[-]
Why not dual license it under a commercial license and something like GPL?
reply
hgs3
5 days ago
[-]
I went back and forth on this and in my uncertainty I decided it was better to start more "closed" first with the potential to become more "open" over time.
reply
h4ck_th3_pl4n3t
5 days ago
[-]
Therefore it will never be open source, and if so then only when you lost interest in the project. Got it.
reply
hgs3
5 days ago
[-]
Thank you, let me know if you have any questions.
reply
sushidev
5 days ago
[-]
Who is your main target audience?
reply
hgs3
5 days ago
[-]
Primarily, companies developing for embedded systems or other resource constrained devices.
reply