Lowercase: ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛʷˣʸᶻ
Uppercase: ᴬᴮᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴿᵀᵁⱽᵂ
no lower q, and no upper C,F,Q,S,X,Y or Z. And depending on the font, it might be worse.
Get bytes representing utf8-encoding of string
Only ASCII characters map 1:1 to their utf8-encoding. Everything else expands to multiple bytes.
https://en.wikipedia.org/wiki/UTF-8#Description
Rust
line.bytes()
Swift
line.utf8
Go
line // slice of bytes
// assuming line is valid utf8, which is not enforced
Get Unicode codepoints of stringMost characters and emojis consist of a single codepoint. Some are made up of multiple codepoints.
If it isn't guaranteed that only this limited set of characters is used, this is not a safe way to iterate over what users would consider characters.
Codepoints are 4 bytes, usually stored internally as u32 or i32 but with different API for the programmer.
Rust
line.chars()
// https://doc.rust-lang.org/std/primitive.char.html
Swift
line.unicodeScalars
// https://developer.apple.com/documentation/swift/unicode/scalar
Go
[]rune(line)
// or iterate with range
for index, runeValue := range line {
fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}
// https://go.dev/blog/strings
Get extended grapheme clusters of stringWhat a reader would actually consider to be a character. E.g, this character consists of two codepoints but is one grapheme cluster: a̐
Rust
use unicode_segmentation::UnicodeSegmentation;
line.graphemes(true)
Swift
for ch in line {
print(ch)
}
// This is the default view - just iterate over string (or map, filter etc.)
// In Swift, a `Character` is a grapheme cluster.
// https://developer.apple.com/documentation/swift/string#Accessing-String-Elements
Go
// https://pkg.go.dev/github.com/rivo/uniseg
Normalize stringsA character like é can be represented in different forms: either as one codepoint (U+00e9) or as a combination of e + ◌́ (U+0065, U+0301).
Some characters are defined multiple times with different names: Ω can be found as "greek capital letter omega" (U+03a9) and as "ohm sign" (U+2126).
Normalization converts a string to use only one of those forms and is required to consistently compare strings.
Rust
use unicode_normalization::UnicodeNormalization;
line.nfc()
line.nfd()
Swift
line.precomposedStringWithCanonicalMapping
line.decomposedStringWithCanonicalMapping
Go
// https://pkg.go.dev/golang.org/x/text/unicode/norm
Remove diacriticsThis can be considered a destructive form of normalization, which can be useful in some cases.
Rust
use diacritics::remove_diacritics;
remove_diacritics(line)
Swift
line.applyingTransform(.stripDiacritics, reverse: false)
// and others to transform between alphabets etc.
// https://developer.apple.com/documentation/Foundation/StringTransform