Shouldn't it be a "tit" if you want to be consistent? In fact, please let's make that happen. The HR execs are gonna go crazy when they hear the devs talking about tits all day.
- BInary digiT [BIT]
- BInary digIT [BIT[ (i.e. "merge on the shared vowel")
- Binary digIT [BIT]
That's consistent with the common usage of trit if you take it as a "trinary digit":
- TRInary digiT [TRIT]
- TRInary digIT [TRIT]
- TRinary digIT [TRIT]
But if you're one to stick to ternary digit then you can end up with all sorts of unique combos:
- TErnary digiT [TET]
- TErnary digIT [TEIT] (no shared vowel to merge on)
- Ternary digIT [TIT] (the above humor)
Personally, the middle options for the first 2 sections (BIT and TRIT via shared vowel) are what I always mentally jumped to. If I was going to stick with "ternary" terminology I think, from a pure mapping of mental mechanics, I'd have ended up on TET (i.e. when there is no shared vowel assume the vowel was from the first word instead of the 2nd). The problem with that being TET conflicts with base 4, leaving you with either the awkward double vowel or the awkward end vowel pull. Or just calling it a trinary digit where you get the same consistency results as from BIT.
Thank you for coming to my TED talk (the things a hacker thinks about while burning their PTO I suppose?)
If I had been trying to come up with something, I probably would have started with an approach of picking something that was as similar-sounding to "bit" as possible so it would be recognizable and then looking for a justification I could use to argue why it was consistent rather than trying to figure out what the origin of "bit" is and then utilizing the same formula. Interestingly, this would have ended up at basically the same place as you, since I would have first thought of "tit", rejected it due to it not seeming likely to get adopted due to the connotations, and then picked "trit" as the next closest thing.
A 'ter'nary dig'it' would be a 'terit', which elides to 'trit'. That sounds like a natural construction, probably why it came about in the first place. You can't drop the 'r' willy nilly because the proto-Indo-European root etymon is 'tr-' not 't-', same in most descendant languages, and 'ter-' is one unexplained corruption of 'tr-' somewhere in the Latin line.
Of course, just for GIF alone I wouldn't mind your method was the universally obvious way... but now I might as well talk about religion and politics too!
Personally, I'm fascinated by boobies (especially Sula nebouxii).
also, bits have the dual nature of being boolean logic values, and integers (in fact, arithmetic is done by logic). Bits deserve the honor of their name, and pretenders to the throne are hoping for stolen valor.
https://sweis.medium.com/revisiting-radix-economy-8f642d9f3c...
In this case, the context are {-1, 0, 1} weights in a LLM model, which I don't think is being used for any hardware efficiency argument. I think it's just quantizing weights into 3 states.
At the outset, it seems like it is more about storing ternary weights in binary without having to convert a multidigit ternary number? That is, if you have 20 weights, if this was binary, you could store all 20 weights as a single 20 bit number. I'm assuming you could convert the 3^20 number to binary and do the same? Is that roughly what is happening, here, with most of the effort in how to get out each trit?
Edit: I should add that I was largely misunderstanding parts of the article. I was reading this as akin to BCD, but with ternary. Reading it without that misunderstanding, I can see that they are largely doing what I said here.
3^5 ~ 8b
5^3 ~ 7b
2 * 5 * 7^2 ~ 9b
2 * 3 * 5 ~ 5b
11 * 11 ~ 7b
3 * 5 * 7 ~ 7b
etc.
Famously? Where can I read about Doom's odd trigonometry?
In this case, a fixed-size block of pixels stores one or more pairs of colors using a lower resolution than your usual 8bpp. Then, these color "endpoints" are interpolated using weights that allow you to approximate the original colors. For storing the endpoints, sometimes you can get a better trade-off of resolution/bit if you can pack ternary numbers in the bitstream instead of being restricted to power-of-two boundaries.
[1]: https://developer.arm.com/documentation/102162/0430/The-ASTC...
They are more robust to certain errors.
Also, I forget which RAM format it was, but I recall that one of the RAM formats still in use is essentially ternary at the physical hardware level and uses the "sign" of the (-1, 0, 1) trit as a "check half-digit" for very fast checksums in hardware (and then the absolute value as the bit), because the physical hardware was going to be three-state anyway and cheap checksums are something RAM will always want.
Maybe we should also have a post about storing bits in trits, then.
That is, the confusion I had in my thread was I was thinking "pack numbers" meant something more than just storing a radix three number. Where the "packing" was just taking advantage of the standard positional values.
For example, consider that the number 123 can be seen as three numbers 0-9 as opposed to one number. How do you extract the three numbers? It can take a few instructions for similar reasons as the ternary case.
Stated differently, if you can see how the numbers 0x12, 18, 022 (octal), and 0b10010 (binary) are all the same number, then you can see how the storage of them is roughly all the same.
Now, for fun, you can also see how getting out the different "digits" of a number can be very different based on the base and limiting yourself to only working with binary shifts.
For the other side perspective, as amelius asked: the historic Setun used 18-trit words and that is surprisingly efficient at storing 28-bit words (0.643 bits per trit; the most efficient up to 32-bits of possible word alignment is 30-bits in 19-trits at 0.633 repeating bits per trit) if for some reason you needed to do interop. Of course 28-bit words aren't a standard binary word-size and if you wanted to try to find a word alignment that fits for both modern binary computers and Setun, might be 16-bits in 11-trits (0.6875 bits per trit).
I suppose if you were building ternary hardware today you might shoot for a 41-trit word to align with modern 64-bit words (0.641 bits per trit).
For a software emulation of ternary, I have no idea.
[1]: Mihai Patrascu, Succincter, FOCS 2008 best student paper.
For inference, trits/sec decoded into L1 cache is a lot more important than random access, and is going to be hardware specific (e.g. what SIMD instructions are available). 8bit seems an odd choice given most available instructions, but the Succincter paper's use of mod and individual decoding is (unless I am misreading it) much slower.
So why would it be different in ternary?
It's a fun rabbit hole. I'm glad it's (seemingly) finding some practical use cases.
[2] http://nedopc.org/ternary/tunguska/math.tools.for.balanced.b...
[3] http://www.nedopc.org/forum/viewtopic.php?f=111&t=143
Also low key missing the days of special interest forums where you'd paste cool source code at people.
The scheme I came up with assumes a representation of ternary numbers as bit pairs 00, 01, and 10 in a byte. Since there are 4 bit pairs in a byte, we can represent 4 ternary digits t0, t1, t2, and t3 no issue.
Since 11 doesn't map to a valid ternary digit, I had an idea that the presence and position of 11 bit pairs could change how the other bit pairs are interpreted - so that's where t4 comes from. If there are no 11 bit pairs, t4 is 0.
Working the combinations I just needed 3 cases - 0, 1, and 2 11-bit pairs in the byte. The byte values with 3 11-bit pairs didn't have to be used.
The 4 11-bit pair, all ones, I decided could be used to signal end of stream.
I was lost on the first sentence. Article should start with the definition of a ternary number. And without added context, "3 possible values, which could actually be anything" doesn't make any sense.
Edit: probably not when easy retrieval is needed.
243 is pretty good. You have to go up to 3^17 = 129140163 < 2^27 = 134217728 to beat it:
1 3 4 2 1 75.0%*
2 9 16 4 7 56.2%
3 27 32 5 5 84.4%*
4 81 128 7 47 63.3%
5 243 256 8 13 94.9%*
6 729 1024 10 295 71.2%
7 2187 4096 12 1909 53.4%
8 6561 8192 13 1631 80.1%
9 19683 32768 15 13085 60.1%
10 59049 65536 16 6487 90.1%
11 177147 262144 18 84997 67.6%
12 531441 1048576 20 517135 50.7%
13 1594323 2097152 21 502829 76.0%
14 4782969 8388608 23 3605639 57.0%
15 14348907 16777216 24 2428309 85.5%
16 43046721 67108864 26 24062143 64.1%
17 129140163 134217728 27 5077565 96.2%*
18 387420489 536870912 29 149450423 72.2%
19 1162261467 2147483648 31 985222181 54.1%
20 3486784401 4294967296 32 808182895 81.2%
21 10460353203 17179869184 34 6719515981 60.9%
22 31381059609 34359738368 35 2978678759 91.3%
23 94143178827 137438953472 37 43295774645 68.5%
24 282429536481 549755813888 39 267326277407 51.4%
25 847288609443 1099511627776 40 252223018333 77.1%
26 2541865828329 4398046511104 42 1856180682775 57.8%
27 7625597484987 8796093022208 43 1170495537221 86.7%
28 22876792454961 35184372088832 45 12307579633871 65.0%
29 68630377364883 70368744177664 46 1738366812781 97.5%*
30 205891132094649 281474976710656 48 75583844616007 73.1%
31 617673396283947 1125899906842624 50 508226510558677 54.9%
32 1853020188851841 2251799813685248 51 398779624833407 82.3%
33 5559060566555523 9007199254740992 53 3448138688185469 61.7%
34 16677181699666569 18014398509481984 54 1337216809815415 92.6%
35 50031545098999707 72057594037927936 56 22026048938928229 69.4%
36 150094635296999121 288230376151711744 58 138135740854712623 52.1%
37 450283905890997363 576460752303423488 59 126176846412426125 78.1%
38 1350851717672992089 2305843009213693952 61 954991291540701863 58.6%
39 4052555153018976267 4611686018427387904 62 559130865408411637 87.9%
Generated with this: max_use = 0
for i in range(1, 40):
trinary = 3**i
power = math.ceil(math.log2(trinary))
binary = 2**int(power)
diff = binary - trinary
use = (trinary / binary)
max_star = ''
if use > max_use:
max_use = use
max_star = '*'
print(f' {i:2} {trinary:19} {binary:19} {power:2} {diff:18} {use:.1%}{max_star}')
You're looking at the alphabet utilization rate. Whereas the standard way to measure efficiency (which is the way that the author did it) is to look at entropy.
Here's how I would phrase it. Obviously, if you have a message that has 256 possible states and each one has the same probability of occurring, then the entropy is 8 bits per message.
If you have a message having 243 uniformly distributed states, then its entropy is defined as log base-2 of 243 = log_2 (243) = ln(243) / ln(2) ≈ 7.925 bits. But if you used 8 bits to express this message, then you've wasted 0.075 bits (or 75 millibits, mb) per message, so your efficiency is 7.925/8 ≈ 99.06%.
How would you achieve 7.925 bits per message in practice? By packing more messages together and sharing wasted space. By using some continued fraction magic, let me proposed to store 15867× of these base-243 messages together. There are 243^15867 possibilities, which is approximately equal to 2^125742.999994713, or just a hair under 125743 bits. So, a block of 125743 bits can store 15867× base-243 messages losslessly, which means each base-243 message uses 125743/15867 ≈ 7.925 bits on average.
For a binary number with N bits, you can represent 2^N values. Easy example: There are 2^8 = 256 possible values that can be represented by an 8-bit value. You can go the other way, too, and ask “how many bits do I need to represent a given value?” by using some math.
2^N = 256
Log2(2^N) = Log2(256)
N = Log2(256) = Log(256)/Log(2) = 8
And you can use this to figure out how many bits you need to represent an arbitrary number: N = Log2(1000) = Log(1000)/Log(2) = 9.966. That makes intuitive sense because a 10-bit number has 1024 values.
To get to the theoretical limit you’re asking about we do the same thing. For an arbitrary number x, how many trits do we need to represent it in trinary and how many bits N do we need to represent it in binary? What is the ratio of trits to bits?
x = 3^M and x = 2^N
Log3(x) = M
Log(x) / Log(3) = M
Log2(x) = N
Log(x) / Log(2) = N
And then we can just take that ratio N/M to figure out how many bits N we need to represent M trits:
R = N/M = Log(x)/Log(2) x Log(3)/Log(x) = Log(3)/Log(2)
3^N <= 2^M
log(3^N) <= log(2^M)
N*log(3) <= M*log(2)
log(3)/log(2) <= M/N
where M/N is the bits per ternary digit.So, for powers of two, it's obviously log_2(N).
Now we just extrapolate. For three things, on average, you need to ask log_2(3) questions. Sometimes you ask just one, sometimes you must ask two (depending on where you split the group). Either way, the formula is the same.
log(3)/log(2) is just a way of writing log_2(3) (since the value of log3/log2 is the same regardless of which base the log is in).
Note that two isn't special. It's only because we've limited ourselves to 'how many yes/no/questions'. If you wanted to know how many multiple choice questions of 3 responses would you need to exactly determine which object of a set of 4, the answer is log_3(4).
[1] You can think of any byte, word, double word, quad word as simply a path in a binary tree of all numbers in the range 0..2^N-1 (where N is bit width). The path uniquely identifies the number.
lim_(n→∞) bits_needed(3ⁿ) / n
= lim_(n→∞) ⌈log2(3ⁿ)⌉ / n
= lim_(n→∞) ⌈n log2(3)⌉ / n
= log2(3)
= log(3) / log(2)
where bits_needed(m) is the minimum number of bits needed to encode m distinct values.For fun, consider how much space a base 10 number takes in binary. It takes about 4 bits with slack. Specifically, about 3.32 bits. Which is log(10)/log(2).