FilterHN

Show HN: Replace "hub" by "ingest" in GitHub URLs for a prompt-friendly extract

185 points

by cyclotruc

1 year ago

| past

| 26 comments

| gitingest.com

| HN

Gitingest is a open-source micro dev-tool that I made over the last week.

It turns any public Github repository into a text extract that you can give to your favourite LLM easily.

Today I added this url trick to make it even easier to use!

How I use it myself: - Quickly generate a README.md boilerplate for a project - Ask LLMs questions about an undocumented codebase

It is still very much work in progress and I plan to add many more options (file size limits, exclude patterns..) and a public API

I hope this tool can help you Your feedback is very valuable to help me prioritize And contributions are welcome!

▲

wwoessi

1 year ago

[-]

Hi, great tool!

I've made https://uithub.com 2 months ago. Its speciality is the fact that seeing a repo's raw extract is a matter of changing 'g' to 'u'. It also works for subdirectories, so if you just want the docs of Upstash QStash, for example, just go to https://uithub.com/upstash/docs/tree/main/qstash

Great to see this keeps being worthwhile!

▲

Arcuru

1 year ago

[-]

That looks awesome. You didn't mention it but uithub.com also has an API, I can definitely see myself using this for a new tool.

▲

helsinki

1 year ago

[-]

I wonder why nobody uses jsonl format to represent an entire codebase? It’s what I do and LLMs seems to prefer it. In fact, an LLM suggested this strategy to me. Uses less characters, too.

▲

addaon

1 year ago

[-]

Are you suggesting that there's a correlation between what input formats provide best performance for an LLM input, and what sequence of tokens the same LLM outputs when prompted about what input formats provide best performance? Why would that be?

▲

wwoessi

1 year ago

[-]

I don't think it's much difference, but I've read that Markdown codeblocks (or YAML, or XML) is better for code than JSON, for example: https://aider.chat/2024/08/14/code-in-json.html

I think it makes sense.

YAML is shorter and easier to read, Markdown codeblocks have no added syntax between the lines compared to normal code.

But JSON vs JSONL I can't come up with any big advantages for the LLM, it's mostly the same.

▲

TeMPOraL

1 year ago

[-]

Why wouldn't that be? We've had several generations of LLMs since ChatGPT took the world by storm; current models are very much aware of LLMs that came before them, as well as associated discussions on how to best use them.

▲

wwoessi

1 year ago

[-]

You can use JSON using the accept parameter of the API. The url structure remains the same. It also supports YAML and I found that's easiest to read for LLMs.

Previous example but in JSON:

https://uithub.com/upstash/docs/tree/main/qstash?accept=appl...

Is there any reason to prefer JSONL besides it being more efficient to edit? I'm happy to add it to my backlog if you think it has any advantages for LLMs

▲

prophesi

1 year ago

[-]

Since the site was hugged to death by HN, this appears to be the repo[0] for anyone wanting to run it locally.

[0] https://github.com/cyclotruc/gitingest

▲

bryant

1 year ago

[-]

and of course, using the repo as an input for the service renders this[1]

[1] https://gitingest.com/cyclotruc/gitingest

▲

mdaniel

1 year ago

[-]

  // Fetch stars when page loads
  fetchGitHubStars();

I do not understand why in the world so much of the code is related to poking the GH api to fetch the star count

▲

johnisgood

1 year ago

[-]

Probably generated by AI, prompted by no- or junior dev. This is my opinion, of course, but it looks like code generated by an LLM.

▲

cyclotruc

1 year ago

[-]

I know the code is not great, but contributions are very much welcome because there's a lot of low hanging fruits

▲

ugexe

1 year ago

[-]

But why did you code it to fetch stars at all? You would have had to go out of your way to do that. If AI has written most of this I suspect people will be less inclined to contribute.

▲

Mockapapella

1 year ago

[-]

https://uithub.com is also a good one for this. They also have an API with more options.

▲

Fokamul

1 year ago

[-]

Nothing against gitingest.com, but this is really peak of technology. Having LLMs which require feeding them info with copy&paste, peak of effectivity too. OMFG.

▲

evmunro

1 year ago

[-]

Great idea to make it just a simple URL change. Reminds me of the youtube download websites.

I made a similar CLI tool[0] with the added feature that you can pass `--outline` and it'll omit function bodies (while leaving their signatures). I've found it works really well for giving a high-level overview of huge repos.

You can then progressively expand specific functions as the LLM needs to see their implementation, without bloating up your context window.

[0] https://github.com/everestmz/llmcat

▲

lukejagg

1 year ago

[-]

Is the unicode really the best way to display the file structure? The special unicode characters are encoded into 2 tokens, so I doubt it would function better overall for larger repos.

▲

shawnz

1 year ago

[-]

Also, even if different characters were used, the 2D ascii art style representation of the directory tree in general strikes me as something that's not going to be easily interpreted by an LLM, which might not have a conception of how characters are laid out in 2D space

▲

Jet_Xu

1 year ago

[-]

Interesting approach! While URL-based extraction is convenient, I've been working on a more comprehensive solution for repository knowledge retrieval (llama-github). The key challenge isn't just extracting code, but understanding the semantic relationships and evolution patterns within repositories.

A few observations from building large-scale repo analysis systems:

1. Simple text extraction often misses critical context about code dependencies and architectural decisions 2. Repository structure varies significantly across languages and frameworks - what works for Python might fail for complex C++ projects 3. Caching strategies become crucial when dealing with enterprise-scale monorepos

The real challenge is building a universal knowledge graph that captures both explicit (code, dependencies) and implicit (architectural patterns, evolution history) relationships. We've found that combining static analysis with selective LLM augmentation provides better context than pure extraction approaches.

Curious about others' experiences with handling cross-repository knowledge transfer, especially in polyrepo environments?

▲

ComputerGuru

1 year ago

[-]

Instead of a copy icon, it would be better to just generate the entire content as plaintext in the result (not in an html div on a rich html page) so the entire url could be used as an attachment or its contents piped directly into an agent/tool.

Ctrl-a + ctrl-c would remain fast.

▲

vallode

1 year ago

[-]

Agreed, missing opportunity to be able to change a url from github.com/cyclotruc/gitingest to gitingest.com/cyclotruc/gitingest and simply recieve the result as plain text. A very useful little tool nonetheless.

▲

cyclotruc

1 year ago

[-]

Yeah I'm going to do that very soon with the API :)

▲

wwoessi

1 year ago

[-]

for that you can use https://uithub.com (g -> u)

- for browsers it shows html - for curl is gets raw text

▲

nfilzi

1 year ago

[-]

Looks neat! From what I understood, it's like zipping up your codebase in a streamlined TXT version for LLMs to ingest better?

What'd you say are the differences with using sth like Cursor, which has access to your codebase already?

▲

cyclotruc

1 year ago

[-]

It's in the same lane, just sometimes you need a quick and handy way to get that streamlined TXT from a public Repo without leaving your browser

▲

fastball

1 year ago

[-]

Might be good to have some filtering as well. I added a repo that has a heap of localized docs that don't make much sense to ingest into an LLM but probably use up a majority of the tokens.

▲

cyclotruc

1 year ago

[-]

Hey! OP here: gitingest is getting a lot of love right now, sorry if it's unstable but please tell me what goes wrong so I can fix it!

▲

smcleod

1 year ago

[-]

I wrote a tool some time ago called ingest ... to do exactly this from both local directories, files, web urls etc... as well as estimating tokens and vram usage: https://github.com/sammcj/ingest

▲

nonethewiser

1 year ago

[-]

I implemented this same idea in bash for local use. Useful but only up to a certain size of codebase.

▲

Cedricgc

1 year ago

[-]

Does this use the txtar format created for developing the go language?

I actually use txtar with a custom CLI to quickly copy multiple files to my clipboard and paste it into an LLM chat. I try not to get too far from the chat paradigm so I can stay flexible with which LLM provider I use

▲

maleldil

1 year ago

[-]

If I understand correctly, this sounds like https://github.com/simonw/files-to-prompt/.

It's quite useful, with some filtering options (hidden files, gitignore, extensions) and support for Claude-style tags.

▲

wonderfuly

1 year ago

[-]

You can also use https://chathub.gg/repo2txt

▲

anamexis

1 year ago

[-]

It seems to be broken, getting errors like "Error processing repository: Path ../tmp/pallets-flask does not exist"

▲

cyclotruc

1 year ago

[-]

Thank you, I'll look into it

▲

modelorona

1 year ago

[-]

Very cool! I will try this over the weekend with a new android app to see what kind of README I can generate.

Do you have any plans to expand it?

▲

cyclotruc

1 year ago

[-]

Yes I want to add a way to target a token count to control your LLM costs

▲

bosky101

1 year ago

[-]

For some reason was giving a large file instead of reading from the readme

▲

Exuma

1 year ago

[-]

isnt there a limit on prompt size? how would you actually use this? Im not very up to speed on this stuff

▲

xnx

1 year ago

[-]

Gemini Pro has a 2 million character context window which is ~1000 pages of code.

▲

lolinder

1 year ago

[-]

Most projects would be way too big to put into a prompt—even if technically you're within the official context window, those are often misleading—the actual window where input is actually useful is usually much smaller than advertised.

What you can do with something like this is store it in a database and then query it for relevant chunks, which you then feed to the LLM as needed.

▲

tom1337

1 year ago

[-]

I wonder if building a local version of this which resolves dependency paths of the file your currently working on to a certain level so the LLM gains more context of related files instead of just the whole repo (which could be insane if you use a monorepo)

▲

jackstraw14

1 year ago

[-]

Ideally let the LLM chunk it up and figure out when to use those chunks.

▲

matt3210

1 year ago

[-]

The example buttons are a nice touch

▲

hereme888