Show HN: I wrote a full text search engine in Go
102 points
16 hours ago
| 12 comments
| github.com
| HN
Xeoncross
12 hours ago
[-]
I really liked the README, that was a good use of AI.

If you're interested in the idea of writing a database, I recommend you checkout https://github.com/thomasjungblut/go-sstables which includes sstables, a skiplist, a recordio format and other database building blocks like a write-ahead log.

Also https://github.com/BurntSushi/fst which has a great Blog post explaining it's compression (and been ported to Go) which is really helpful for autocomplete/typeahead when recommending searches to users or doing spelling correction for search inputs.

reply
fuzztester
10 hours ago
[-]
>>I wrote a full text search engine in Go

>I really liked the README, that was a good use of AI.

Human intelligences, please start saying:

(A)I wrote a $something in $language.

Give credit where is due. AIs have feelings too.

reply
novocayn
9 hours ago
[-]
> AIs have feelings too

Ohh boi, that’s exactly how the movie "Her" started! XD

reply
novocayn
12 hours ago
[-]
tysm, i love this, FST is vv cool
reply
kaycey2022
1 hour ago
[-]
I don't care you vibe coded it.. run some benchmarks on it to show how it compares to other stuff.

We are soon entering into the territory of "no one cares if you did it, but can you say something interesting?". I created X software is soon leaving the ranks of cool stuff.

reply
eudoxus
14 hours ago
[-]
Would love to hear how this compares to another popular go based full text search engine (with a not too dissimilar name) https://github.com/blevesearch/bleve?
reply
novocayn
13 hours ago
[-]
Bleve is an absolute beast! built with <3 at Couchbase Fun fact: the folks who maintain it sit right across from me at work
reply
Copenjin
13 hours ago
[-]
Did you vibe code this? A few things here and there are a bit of a giveaway imho.
reply
novocayn
13 hours ago
[-]
On my way to make a Dexter meme on this

When you think OP vibe-coded the project but can’t prove it yet

https://x.com/FG_Artist/status/1974267168855392371

reply
Imustaskforhelp
8 hours ago
[-]
OH YES! I Love it/this meme lol!!

Dexter's memes have been popping up recently and I am loving them

I don't know who bay harbor butcher is though :sob: but I don't want spoilers, I will watch it completely some day

My friend says that he watched complete dexter just via clips lol.

reply
novocayn
7 hours ago
[-]
ohh boy, second this :D
reply
ge96
12 hours ago
[-]
Another possible tell (not saying this is vibe coded) is when every function is documented, almost too much comments
reply
novocayn
12 hours ago
[-]
Ohh, I thought that inline comments would make it grokkable and be a low-friction way in. Seems this didn’t land the way I intended :'

Should a multi-part blog would've been better?

reply
ok_dad
9 hours ago
[-]
I like it, I comment exactly like you do. Comments are free, storage is plentiful, why not add comments everywhere?!
reply
Copenjin
3 hours ago
[-]
Can't you just say that you used Claude and be done with it? I mean, I don't really care but in general I would refrain from posting "Show HN" with stuff that has been vibed, or state that you did since in that case it's even more interesting from my point of view, like you could describe how you kept everything under control during generation, etc...

I mean, no biggie and I don't understand the reaction but hey, I'm old so...

reply
kaycey2022
1 hour ago
[-]
They seem to be vibing the HN comments too? Prompt: Do not admit AI involvement under any circumstance.
reply
haute_cuisine
12 hours ago
[-]
I put Overview section from the Readme into an AI content detector and it says 92% AI. Some comment blocks inside codebase are rated as 100% AI generated.
reply
novocayn
12 hours ago
[-]
> comment blocks inside codebase

Is vibe-commented a thing yet? :D

Wanted to give fellow readers a good on-ramp for understanding the FTS internals. Figured leaning into readability wouldn’t hurt

For me this makes the structure super easy to grok at a glance

https://github.com/wizenheimer/blaze/blob/27d6f9b3cd228f5865...

That said, totally fair read on the comments. Curious if they helped/landed the way I intended. or if a multi-part blog series would’ve worked better :)

reply
haute_cuisine
58 minutes ago
[-]
Thanks for the link, very interesting data structure.

I'm wondering is it really worth dumping a general knowledge articles into code comments? To me it feels like the wrong place. Would just the wikipedia link be enough here?

I also notice a lot of comments like this

  // IsEnd checks if this is the EOF sentinel
  //
  // Example usage:
  //
  // if pos.IsEnd() {
  //     // We've reached the end, stop searching
  // }
  func (p *Position) IsEnd() bool {
      return p.Offset == EOF
  }
Is it really necessary to have a text description for a code like "a == b"? It would be really annoying to update comment section on every code change.

This is one of the typical issues when AI creates "code comments", because it always describes "What" is happening. A good comment should answer the question "Why" instead.

For the linked skip list module, a good comment could say why skip list was chosen over b-tree or other data structure and which trade offs were made. AI will never know that.

reply
novocayn
12 hours ago
[-]
Claude: "You're absolutely right" :D
reply
fatty_patty89
13 hours ago
[-]
What makes you think so?
reply
Copenjin
4 hours ago
[-]
I wonder if I should really explain or if that would provide a list of things to sanitize before publishing stuff.

If someone has ever written any code is well aware of what can be done in a weekend and especially that no one doing something "in a weekend" will ever add all those useless comments everywhere, literally a few thousand lines of comments. That takes more time than writing code. Comments in Claude style. Other claude-isms all around.

It's ok to vibe things, but just say so, no shame.

And yes, after 5 minutes of looking around I had enough evidence to "prove it". Any moderately competent engineer could.

reply
niux
13 hours ago
[-]
Probably the commit history.
reply
novocayn
13 hours ago
[-]
Yayiee, the “cant prove it” Doakes Dexter meme, making it to HN
reply
throwaway7783
8 hours ago
[-]
You are neither confirming nor denying, why won't you just tell if you vibe-coded it or not?
reply
ludicrousdispla
5 hours ago
[-]
you posted the Dexter meme earlier, why are you acting surprised?
reply
kdawkins
15 hours ago
[-]
This is very cool! Your readme is intersting and well written - I didn't know I could be so interested in the internals of a full text search engine :)

What was the motivation to kick this project off? Learning or are you using it somehow?

reply
novocayn
14 hours ago
[-]
I’m learning the internals of FTS engines while building a vector database from scratch. Needed a solid FTS index, so I built one myself :)

It ended up being a clean, reusable component, so I decided to carve it out into a standalone project

The README is mostly notes from my Notion pages, glad you found it interesting!

reply
n_u
14 hours ago
[-]
What are you building a vector database from scratch for?
reply
novocayn
14 hours ago
[-]
Mostly wanted a refresher on GPU accelerated indexes and Vector DB internals. And maybe along the way, build an easy on-ramp for folks who want to understand how these work under the hood
reply
mwsherman
11 hours ago
[-]
Shameless plug, you may wish to do Lucene-style tokenizing using the Unicode standard: https://github.com/clipperhouse/uax29/tree/master/words
reply
novocayn
11 hours ago
[-]
Got to admit, initial impressions, this is pretty neat, would spend sometime with this. Thanks for the link :)
reply
atrettel
8 hours ago
[-]
This is pretty interesting.

Could you explain more why you avoided parsing strings to build queries? Strings as queries are pretty standard for search engines. Yes, strings require you to write an interpreter/parser, but the power in many search engines comes from being able to create a query language to handle really complicated and specific queries.

reply
novocayn
7 hours ago
[-]
You're right, string-based queries are very expressive. I intentionally avoided that here so readers could focus on how FTS indexes work internally. Adding a full query system would have shifted the focus away from the internals.

If you notice there are vv obvious optimizations we could make. I’m planning to collect them and put a these as code challenges for readers, and building string-based queries would make a great one :)

reply
n_u
14 hours ago
[-]
Cool project!

I see you are using a positional index rather than doing bi-word matching to support positional queries.

Positional indexes can be a lot larger than non-positional. What is the ratio of the size of all documents to the size of the positional inverted index?

reply
novocayn
14 hours ago
[-]
Observation is spot on. Biword matching would definitely ease this. Stealing bi-word matching for a future iteration, tysm :D
reply
n_u
13 hours ago
[-]
Well bi-word matching requires that you still have all of the documents stored to verify the full phrase occurs in the document rather than just the bi-words. So it isn't always better.

For example the phrase query "United States of America" doesn't occur in the document "The United States is named after states of the North American continent. The capital of America is Washington DC". But "United States", "states of" and "of America" all appear in it.

There's a tradeoff because we still have to fetch the full document text (or some positional structure) for the filtered-down candidate documents containing all of the bi-word pairs. So it requires a second stage of disk I/O. But as I understand most practitioners assume you can get away with less IOPS vs positional index since that info only has to fetched for a much smaller filtered-down candidate set rather than for the whole posting list.

But that's why I was curious about the storage ratio of your positional index.

reply
add-sub-mul-div
15 hours ago
[-]
Why did you create this new account if there's already 3 existing accounts promoting your stuff and only your stuff?
reply
novocayn
14 hours ago
[-]
Because running a three-account bot‑net farm is fun :D Okay, jk, please don’t mod me out.

One’s for browsing HN at work, the other’s for home, and the third one has a username I'm not too fond of.

I’ll stick to this one :) I might have some karma on the older ones, but honestly, HN is just as fun from everywhere

reply
wolfgarbe
15 hours ago
[-]
Great work! Would be interesting to see how it compares to Lucene performance-wise, e.g. with a benchmark like https://github.com/quickwit-oss/search-benchmark-game
reply
novocayn
14 hours ago
[-]
Thanks! Honestly, given it's hacked together in a weekend not sure it’d measure up to Lucene/Bleve in any serious way.

I intended this to be an easy on-ramp for folks who want to get a feel for how FTS engines work under the hood :)

reply
llllm
13 hours ago
[-]
Not _that_ long ago Bleve was also hacked together over a few weekends.

I appreciate the technical depth of the readme, but I’m not sure it fits your easy on-ramp framing.

Keep going and keep sharing.

reply
pstuart
8 hours ago
[-]
You'll need to license it if you want others to consider using it.
reply
novocayn
7 hours ago
[-]
Ohh good catch, I think I missed it. Thanks for the note :)
reply
oldgregg
13 hours ago
[-]
looks great! would love to see benchmark with bleve and a lightweight vector implementation.
reply
novocayn
12 hours ago
[-]
tysm, would try pairing it with HNSW and IVF, halfway through :)
reply