This is actually super easy. The data is available in BigQuery.[0] It's up to date, too. I tried the following query, and the latest comment was from yesterday.
SELECT
id,
text,
`by` AS username,
FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', TIMESTAMP_SECONDS(time)) AS timestamp
FROM
`bigquery-public-data.hacker_news.full`
WHERE
type = 'comment'
AND EXTRACT(YEAR FROM TIMESTAMP_SECONDS(time)) = 2025
ORDER BY
time DESC
LIMIT
100
https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1s...For example:
SELECT * FROM hackernews_history ORDER BY time DESC LIMIT 10;
https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUICogRl...I subscribe to this issue to keep up with updates:
https://github.com/ClickHouse/ClickHouse/issues/29693#issuec...
And ofc, for those that don't know, the official API https://github.com/HackerNews/API
1. Create a table with styles by authors:
CREATE TABLE hn_styles (name String, vec Array(UInt32)) ENGINE = MergeTree ORDER BY name
2. Calculate and insert style vectors (the insert takes 27 seconds): INSERT INTO hn_styles WITH 128 AS vec_size,
cityHash64(arrayJoin(tokens(lower(decodeHTMLComponent(extractTextFromHTML(text)))))) % vec_size AS n,
arrayMap((x, i) -> i = n, range(vec_size), range(vec_size)) AS arr
SELECT by, sumForEach(arr) FROM hackernews_history GROUP BY by
3. Find nearest authors (the query takes ~50 ms): SELECT name FROM hn_styles ORDER BY cosineDistance(vec, (SELECT vec FROM hn_styles WHERE name = 'antirez')) LIMIT 25
┌─name────────────┬─────────────────dist─┐
1. │ antirez │ 0 │
2. │ geertj │ 0.009644324175144714 │
3. │ mrighele │ 0.009742538810774581 │
4. │ LukaAl │ 0.009787061201638525 │
5. │ adrianratnapala │ 0.010093164015005152 │
6. │ prmph │ 0.010097599441156513 │
7. │ teilo │ 0.010187607877663263 │
8. │ lukesandberg │ 0.01035981357655602 │
9. │ joshuak │ 0.010421492503861374 │
10. │ sharikous │ 0.01043547391491162 │
11. │ lll-o-lll │ 0.01051205287096002 │
12. │ enriquto │ 0.010534816136353875 │
13. │ rileymat2 │ 0.010591026237771195 │
14. │ afiori │ 0.010655186410089112 │
15. │ 314 │ 0.010768594792569197 │
16. │ superice │ 0.010842615688153812 │
17. │ cm2187 │ 0.01105111720031593 │
18. │ jorgeleo │ 0.011159407590845771 │
19. │ einhverfr │ 0.011296755160620009 │
20. │ goodcanadian │ 0.011316316959489647 │
21. │ harperlee │ 0.011317367800365297 │
22. │ seren │ 0.011390119122640763 │
23. │ abnry │ 0.011394133096140235 │
24. │ PraetorianGourd │ 0.011508457949426343 │
25. │ ufo │ 0.011538721312575051 │
└─────────────────┴──────────────────────┘
But I also have seen some accounts that seem to be from other non-native English speakers. They may even have a Latin language as their native one (I just read some of their comments, and, at minimum, some of them seem to also be from the EU). So, I guess, that it is also grouping people by their native language other than English.
So, maybe, it is grouping many accounts by the shared bias of different native-languages. Probably, we make the same type of mistakes while using English.
My guess will be that native Indian or Chinese speakers accounts will also be grouped together, for the same reason. Even more so, as the language is more different to English and the bias probably stronger.
It would be cool that Australians, British, Canadians tried the tool. My guess is that the probability of them finding alt-accounts is higher as the populations is smaller and the writing more distinctive than Americans.
Thanks for sharing the projects. It is really interesting.
Also, do not trust the comments too much. There is an incentive to lie as to not acknowledge alt-accounts if they were created to remain hidden.
That is most likely the case. Case in point: My native language doesn't have articles, so locally they're a common source of mistakes in English.
The project referenced in the post put me next to Brits on the similarity list and indeed I am using an English(UK) dictionary. Meanwhile this iteration aligns me with Americans despite the only change being the vendor (formerly Samsung, now Google).
I guess the Samsung keyboard corrects to proper Bri'ish.
I picked up the language as a child from a collection of people, half of whom weren't native speakers, so I don't speak any specific dialect.
But, if I do the reverse (search using my original account), this one shows up as #2.
The main difference between the accounts is this one has a lot more posts, and my original account was actively posting ~11 years ago.
A <-> B: 80%
A <-> C: 90%
B <-> C: 70%
When you search for A the best match will be C, but if you start with B it will be A. If one of the accounts has a smaller sample set as in GP's case, the gap could be quite big.My comments underindex on "this" - because I have drilled into my communication style never to use pronouns without clear one-word antecedents, meaning I use "this" less frequently that I would otherwise.
They also underindex on "should" - a word I have drilled OUT of my communication style, since it is judgy and triggers a defensive reaction in others when used. (If required, I prefer "ought to")
My comments also underindex on personal pronouns (I, my). Again, my thought on good, interesting writing is that these are to be avoided.
In case anyone cares.
I suppose it's possible the "analyze"-reported proportions are a lot more precise and reliably diagnostic than I imagine. I haven't yet looked in detail at the statistical method.
Also, of course, it would require integration with NLP tooling such as WordNet (or whatever's SOTA there something like a decade and a half on) and a bit of Porter stemming to do part-of-speech tagging. If one 0.7GB dataset is heavyweight where this is running, that could be a nonstarter; stemming is trivial and I recall WordNet being acceptably fast if maybe memory hungry on a decade ago's kinda crappy laptop, but I could see it requiring some expensive materialization just to get datasets to inspect. (How exactly do we define "more common" for eg "smooth?" Versus semantic words, all words, both, or some combination? Do we need another dataset filtered to semantic words? Etc.)
If we're dreaming and I can also have a pony, then it would be neat to see both the current flavor, one focused on semantics as above, and one focused specifically on syntax as this one coincidentally often seems to act like. I would be tempted to offer an implementation, but I'm allergic to Python this decade.
> I would prefer the "analyze" feature focus on content rather than structure words. I forget the specific linguistic terms but to a first approximation, nouns and verbs would be of interest, prepositions and articles not. Let's call the former "semantic" and the latter "syntactic."
"Ought to" is essentially a synonym. Anyone that gets upset when you said they should do something but is fine when you say that they ought to do something is truly a moron.
> If someone [test] then he is an idiot.
> Anyone [test] when [test] is truly a moron.
These structures are worse habits in communication than subtle, colloquially interchangeable word choices.Most people are emotion-first, how the words make them feel is more important than the definitions of them. Being emotion-first doesn't make them stupid.
Otherwise, if someone wants to take the time to dissect meaning from add-on meaningless words like should in a sentence, they should find something better to do with their time. Or just ask instead of being a moron.
> I use "this" less frequently that I would otherwise
Isn't it "less than" as opposed to "less that"?
I prefer to avoid such absolutes and portray causality instead.
For example, in place of “you should not do drugs at work” I prefer “if you take drugs at work you’ll get in trouble”.
I think this mentality is also where the term 'sheeple' comes from.
I too like when others use it, since a very easy and pretty universal retort against "you ought to..." is "No, I don't owe you anything".
And the construct "you owe it to <person> to <verb>" still exists even today but is not nearly as popular as "you should <verb>" precisely because it has to state to whom exactly your owe the duty; with "should" it sounds like an impersonal, quasi-objective statement of fact which suits the manipulative uses much better.
“You should” has a much more generic and less persuasive sentiment. “Why should I?” is a common and easy response which now leaves the suggester having to defend their suggestion to a skeptical audience.
The only place today I see "shall" used correctly where most would say "should" or "will," is in legal documents and signage.
> used to indicate duty or correctness
A duty to others is something you owe them; think, a duty of care and its lack, which is negligence.
You mean, ”I think this should be avoided”? ;)
You would need more computation to hash, but I bet adding frequency of the top 50 word-pairs and top 20 most common 3-tuples would be a strong signal.
( The nothing the accuracy is already good of course. I am indeed user eterm. I think I've said on this account or that one before that I don't sync passwords, so they are simply different machines that I use. I try not to cross-contribute or double-vote. )
I then realised that "[period] <word>" would likely dominate most common pairs, and that a lot of time could be saved by simply recording the first word of sentences as their own vector set, in addition but separate to the regular word vector.
Whether this would be a stronger or weaker signal per-vector-space than the tail of words in the regular common-words vector I don't know.
When I ran it, it gave me 20 random users, but when I do the analyze, it says my most common words are [they because then that but their the was them had], which is basically just the most common English words.
Probably would be good to exclude those most common words.
I figured it maybe would cluster me with other non-native speakers but it doesn't appear to. Of all the accounts where I could identify a country of origin, all were American.
you, are, have, they, at, an, we, if, do, to
I'm frankly not quite sure how I've avoided them given how common they are.
https://antirez.com/hnstyle?username=pg&threshold=20&action=...
Instead of just HN, now do it with the whole internet, imagine what you'd find. Then imagine that it's not being done already.
Using throwaways whenever possible mitigates a lot of the risk, too.
But if i were a government agency I would be pressing AI providers for data, or fingerprinting the output with punctuation/whitespace or something more subtle.
Tho i guess with open models that people can run on device that’s mitigated a lot.
Taking a look at comments from those users, I think the issue is that the algorithm focuses too much on the topic of discussion rather than style. If you are often in conversations about LLMs or Musk or self driving cars then you will inevitably end up using a lot of similar words as others in the same discussions. There's only so many unique words you can use when talking about a technical topic.
I see in your post that you try to mitigate this by reducing the number of words compared, but I don't think that is enough to do the job.
It focuses on topic a lot, that's true.
I’m not going to try comparing it with normalising apostrophes, but I’d be interested how much of a difference it made. It could easily be just that the sorts of people who choose to write in curly quotes are more likely to choose words carefully and thus end up more similar.
- remove super high frequency non specific words from the comparison bags, because they don’t distinguish much, have less semantic value and may skew the data
- remove stop words (NLP definition of stop words)
- perform stemming/tokenization/depluralization etc (again, NLP standard)
- implement commutativity and transitivity in the similarity function
- consider words as hyperlinks to the sets of people who use them often enough, and do something Pageranky to refine similarity
- consider word bigrams, etc
- weight variations and misspellings higher as distinguishing signals
What are your ideas ?
https://antirez.com/hnstyle?username=dang&threshold=20&actio...
> Please don't post unsubstantive comments to HN. [link to guidelines]
My guess is it was a parody/impersonator account.
You can enable "showdead" in your profile to see [dead] comments ans posts. Most of them are crap, but there are some false positives an errors from time to time.
HN silently black holes any comment made through a VPN, so I would expect a decent amount of false positives.
Anyway, I guess this would be useful cluster the "Matt Walsh"-y commenters together.
Secondly, if you want to make an alt account harder to cross-correlate with your main, would rewriting your comments with an LLM work against this method? And if so, how well?
Maybe some "like attracts like" phenomena
don't site comment we here post that users against you're
Quite a stance, man :)
And me clearly inarticulate and less confident than some:
it may but that because or not and even these
I noticed that randomly remembered usernames tend to produce either lots of utility words like the above, or very few of them. Interestingly, it doesn't really correlate with my overall impression about them.
Is there anything that can be inferred from that? Is my writing less unique, so ends up being more similar to more people?
Also, someone like tptacek has a top 20 with matches all >0.87. Would this be a side-effect of his prolific posting, so matches better with a lot more people?
Thanks for the interesting tool!
- aaronsw and jedberg share danielweber
- aronsw and jedberg share wccrawford
- aaronsw and pg share Natsu
- aaronsw and pg share mcphage
Well, and worked a lot with americans over text based communication...
This is impressive and scary. Obviously I had to create a throwaway to say this.
https://scikit-learn.org/stable/modules/generated/sklearn.ma...
I think other methods are more fashionable today
https://scikit-learn.org/stable/modules/manifold.html
particularly multi-dimension scaling, but personally I think tSNE plots are less pathological (they don't have as many of these crazy cusps that make me think it's projecting down from a higher-dimensional surface which is near-parallel to the page)
After processing documents with BERT I really like the clusters generated by the simple and old k-Means algorithm
https://scikit-learn.org/stable/modules/generated/sklearn.cl...
It has the problem that it always finds 20 clusters if you set k=20 and a cluster which really oughta be one big cluster might get treated as three little clusters but the clusters I get from it reflect the way I see things.
You have three points nearby, and a fourth a bit more distant. 4 best match is 1, but 1 best match is 2 and 3.
redis-cli -3 VSIM hn_fingerprint ELE pg WITHSCORES | grep montrose
montrose 0.8640020787715912
redis-cli -3 VSIM hn_fingerprint ELE montrose WITHSCORES | grep pg
pg 0.8639097809791565
So why cosine similarity is commutative, the quantization steps lead to a small different result. But the difference is .000092 that is in practical terms not important. Redis can use non quantized vectors using the NOQUANT option in VADD, but this will make the vectors elements using 4 bytes per component: given that the recall difference is minimal, it is almost always not worth it.
don't +0.9339
It's also a tool for wannabe impersonators to hoan their writing style mimic skills!
(The Satawalese language has 460 speakers, most of who live in Satawal Island in the Federated States of Micronesia.)
What a profiler would do to identify someone, I imagine, requires much more. Like the ability to recognize someone's tendency of playing the victim to leverage social advantage in awkward situations.
But, limiting to the top couple hundred words, probably does limit me to sounding like a pretentious dickhole, as I often use "however", "but", and "isn't". Corrections are a little too frequent in my post history.
I'd expect things might be a tiny bit looser with precisions if something small like stop words were removed. Though, it'd be interesting to do the opposite. If you were only measuring stopwords, would that show a unique cadence?
tablespoon is close, but has a missing top 50 mutual (mikeash). In some ways, this is an artefact of the "20, 30, 50, 100" scale. Is there a way to describe the degree to which a user has this "I'm a relatively closer neighbour to them than they are to me" property? Can we make the metric space smaller (e.g. reduce the number of Euclidean dimensions) while preserving this property for the points that have it?
Also note that vector similarity is not reciprocal, one thing can have a top scoring item, but such item may have much more items nearer, like in the 2D space when you have a cluster of points and a point nearby but a bit far apart.
Unfortunately I don't think this technique works very well for actual duplicated accounts discovery because often times people post just a few comments in fake accounts. So there is not enough data, if not for the exception where one consistently uses another account to cover their identity.
EDIT: at the end of the post I added the visual representations of pg and montrose.
I worked on a search engine for patents that used the first, our evaluations showed it was much better than other patent search engines and we had no trouble selling it because customers could feel the difference in demos.
I tried dimensional reduction on the BERT vectors and in all cases I tried I found this made relevance worse. (BERT has learned a lot already which is being thrown away, there isn't more to learn from my particular documents)
I don't think either of these helps with the "finding articles authored by the same person" because one assumes the same person always uses the same words whereas documents about the topic use synonyms that will be turned up by (1) and (2). There is a big literature on the topic of determining authorship based on style
https://en.wikipedia.org/wiki/Stylometry
[1] With https://sbert.net/ this is so easy.
Edit: ChatGTP, my bad
not very useful for more newer users like me :/
https://antirez.com/hnstyle?username=gfd&threshold=20&action...
zawerf (Similarity: 0.7379)
ghj (Similarity: 0.7207)
fyp (Similarity: 0.7197)
uyt (Similarity: 0.7052)
I typically abandon an account once I reach 500 karma since it unlocks the ability to downvote. I'm now very self conscious about the words I overuse...
I suspect, antirez, that you may have greater success removing some of the most common English words in order to find truly suspicious correlations in the data.
cocktailpeanuts and I for example, mutually share some words like:
because, people, you're, don't, they're, software, that, but, you, want
Unfortunately, this is a forum where people will use words like "because, people, and software."
Because, well, people here talk about software.
<=^)
Edit: Neat work, nonetheless.
Yes, that's good! I didn't state my interest clearly, though. I'd like to see the "analyze" result with the stop words excluded, not for the style comparison part, but for the reasons you state and others.
The usage frequency of simple words is a powerful tell.
There are so many people that write like me apparently, that simple language seems more like a way to mask yourself in a crowd.