(https://dl.acm.org/doi/epdf/10.1145/192844.192905 although they don't call it cosine similarity; they do compute a "correlation coefficient" between two people by adding together the products of scores each gave to a post)
That being said, weirdly, the normalization by standard deviation happens outside the call to `cov` in the paper (page 181, column 1, equations (unnumbered) 1 and 2). And in equation 2 they've expanded `cov` to be the sum of pointwise multiplication of the (scores - average score) people have given to posts.
Again, not my area of expertise, just looking at the math here.
I've heard the term "cosine similarity" before but not really looked into it. What does this computation have to do with trigonometry?
(Strictly speaking we have that the angle is actually defined in terms of the dot/inner product in more abstract spaces like function spaces or L^p/l^p)
I just wish I could scroll further down the "Similar to you" list.
So you’re left with things you “should” star, but there very well could be a reason you didn’t.
- The Idea: People use GitHub Stars as bookmarks. This is an excellent signal for understanding which repositories are semantically similar.
- The Data: Processed ~1TB of raw data from GitHub Archive (BigQuery) to build an interest matrix of 4 million developers.
- The ML: Trained embeddings for 300k+ repositories using Metric Learning (EmbeddingBag + MultiSimilarityLoss).
- The Frontend: Built a client-only demo that runs vector search (KNN) directly in the browser via WASM, with no backend involved.
- The Result: The system finds non-obvious library alternatives and allows for semantic comparison of developer profiles.
"Error: Repository not found: Lerc/stackie"
People complain about The Algorithm but it can be useful...
If GitHub started using the submissions GitStars to recommend repos in people's GitHub feed, I don't think people would get their pitchforks out about "The Algorithm" in that case. But if GitHub started to make the feed so you spend as much time there as possible, by whatever means and potentially irrelevant stuff, then the GitHub feed would start being considered as one of "The Algorithms" by many, would be my guess.
Makes me wonder if there is something in his stars that is skewing the results.
I don't know how to feel about this lmao
I have got around 1800k projects starred. Usually its just that I had lost my bookmarks once and I lost a lot of github projects so I decided to use stars as my bookmarks or even as whatever I was feeling that time so I have starred some 100 projects or so just because I think they were interesting just enough and nobody starred them so to show my support
Supporting is also another aspect, I really like to share my support and I feel like even these tiny actions at scale really help these projects whether gaining legitimacy or otherwise
I have been such a star fanatic that I have even opened up a github issue about who are the people who have starred the most projects just to give a clear referrence
I have even downloaded all the readme.mds of my github projects that I have starred and made a simple html vibe coded project so that I can view them manually and search them similar to algolia you could say.
Oh btw there are some gists which can help you list all the stars of a person in github which I used to get the star list (or list of repos) then downloading all their readme.mds and converting that as such. Its on my other computer but I should probably back it up as well
I wish there was something like github stars for the whole web in whole. Yes bookmarks exist but a more public form of bookmarks in a way similar to github stars without monetizing in the front (yes I know they are doing AI shenanigans in the background)
Github is still an Okay platform so much so that I nowadays am thinking of uploading media in github wiki for projects instead of youtube. Especially for open source projects, plus even github wiki's can be downloaded via git whereas youtube tries to do everything in control to make you stop making it download so much so that recently they made some changes downstream that even yt-dlp now requires deno or npm engine and the solution is always hacky/ cat and mouse game of sorts.
I don't think that there are any services which can provide the amount of free bandwidth github provides in the way it does. Sure one can get
To be honest, if someone wants, they can probably use ovh or upcloud's zero have unlimited egress with fair use policy
that fair use policy though is basically just that your server would first have I think around 1gbps or 500mbps or like high bandwidth access but then they would cap it to something to 100mbps and ovh can throttle
Upcloud has like an extremely high fair use acceptance policy around 24 TB I think after which they throttle a 1gbps connection to 100mbps which in many vps's could be the highest connection itself and 100mbps aint bad
But also pardon me for this but I asked chatgpt and it seems that civo provides completely unrestricted
Extra Small 1 GB 1 core 30GB NVMe FREE $5.43
Upcloud's around (3.50 euros for the same thing) but if your project is getting even more than 24 TB and you want like other options there are always options
So like in a sense, there just isn't a point in either self hosting and I feel like github can be the freemium thing from youtube to something which can be transitioned to.
Just me rambling but I feel like in the early days Youtube used one of the deals to get their bandwidths as well. I feel as if there are companies which can do that too and Youtube is moving in backwards direction and things like fediverse peertube with genuinely unlimited bandwidth are very much possible for very cheap.
Youtube's monopoly only so much as we wish, its the channels monopolies and the viewers, architecturally its not much big issue as I mentioned previously.
EDIT: Looks like I got side tracked but overall, I am really impressed by your project and its really good, kudos!