FilterHN

Dark Corners of Unicode (2015)

15 points

by cratermoon

3 days ago

| past

| 5 comments

| eev.ee

| HN

▲

Chaitanya1111

2 minutes ago

[-]

heyyoo

▲

Sniffnoy

4 hours ago

[-]

Worth noting that the addition of the interlinear annotation characters was quite controversial, with many commenting that this simply is not plain text and as such does not belong in Unicode. I'm not clear on how it made it in anyway, but it sure seems like the Unicode Consortium now somewhat agrees, as while they haven't formally deprecated the characters, they have kind of discouraged their use.

▲

gudzpoz

7 hours ago

[-]

Previous discussion: https://news.ycombinator.com/item?id=13149705

▲

deathanatos

5 hours ago

[-]

And don't miss [this comment](https://news.ycombinator.com/item?id=13149912). The future is now!

▲

jakeogh

4 hours ago

[-]

Superscript:

Lowercase: ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛʷˣʸᶻ

Uppercase: ᴬᴮᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴿᵀᵁⱽᵂ

no lower q, and no upper C,F,Q,S,X,Y or Z. And depending on the font, it might be worse.

▲

fainpul

3 hours ago

[-]

Recently I compared Unicode handling in Rust, Swift and Go for my own curiosity. Sharing it here, in the hope someone finds it useful:

Get bytes representing utf8-encoding of string

Only ASCII characters map 1:1 to their utf8-encoding. Everything else expands to multiple bytes.

https://en.wikipedia.org/wiki/UTF-8#Description

  Rust
  line.bytes()

  Swift
  line.utf8

  Go
  line  // slice of bytes
  // assuming line is valid utf8, which is not enforced

Get Unicode codepoints of string

Most characters and emojis consist of a single codepoint. Some are made up of multiple codepoints.

If it isn't guaranteed that only this limited set of characters is used, this is not a safe way to iterate over what users would consider characters.

Codepoints are 4 bytes, usually stored internally as u32 or i32 but with different API for the programmer.

  Rust
  line.chars()
  // https://doc.rust-lang.org/std/primitive.char.html

  Swift
  line.unicodeScalars
  // https://developer.apple.com/documentation/swift/unicode/scalar

  Go
  []rune(line)
  // or iterate with range
  for index, runeValue := range line {
    fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
  }
  // https://go.dev/blog/strings

Get extended grapheme clusters of string

What a reader would actually consider to be a character. E.g, this character consists of two codepoints but is one grapheme cluster: a̐

  Rust
  use unicode_segmentation::UnicodeSegmentation;
  line.graphemes(true)

  Swift
  for ch in line {
    print(ch)
  }
  // This is the default view - just iterate over string (or map, filter etc.)
  // In Swift, a `Character` is a grapheme cluster.
  // https://developer.apple.com/documentation/swift/string#Accessing-String-Elements

  Go
  // https://pkg.go.dev/github.com/rivo/uniseg

Normalize strings

A character like é can be represented in different forms: either as one codepoint (U+00e9) or as a combination of e + ◌́ (U+0065, U+0301).

Some characters are defined multiple times with different names: Ω can be found as "greek capital letter omega" (U+03a9) and as "ohm sign" (U+2126).

Normalization converts a string to use only one of those forms and is required to consistently compare strings.

  Rust
  use unicode_normalization::UnicodeNormalization;
  line.nfc()
  line.nfd()

  Swift
  line.precomposedStringWithCanonicalMapping
  line.decomposedStringWithCanonicalMapping

  Go
  // https://pkg.go.dev/golang.org/x/text/unicode/norm

Remove diacritics

This can be considered a destructive form of normalization, which can be useful in some cases.

  Rust
  use diacritics::remove_diacritics;
  remove_diacritics(line)

  Swift
  line.applyingTransform(.stripDiacritics, reverse: false)
  // and others to transform between alphabets etc.
  // https://developer.apple.com/documentation/Foundation/StringTransform

▲

anonnon

2 hours ago

[-]

You probably want ICU4X if you're working with Unicode in Rust. It's fast, has a tolerable overhead, and its lead developers have experience doing i18n work at Mozilla and Google and are involved with the Unicode Consortium.