Show HN: Replace "hub" by "ingest" in GitHub URLs for a prompt-friendly extract
185 points
20 days ago
| 26 comments
| gitingest.com
| HN
Gitingest is a open-source micro dev-tool that I made over the last week.

It turns any public Github repository into a text extract that you can give to your favourite LLM easily.

Today I added this url trick to make it even easier to use!

How I use it myself: - Quickly generate a README.md boilerplate for a project - Ask LLMs questions about an undocumented codebase

It is still very much work in progress and I plan to add many more options (file size limits, exclude patterns..) and a public API

I hope this tool can help you Your feedback is very valuable to help me prioritize And contributions are welcome!

wwoessi
20 days ago
[-]
Hi, great tool!

I've made https://uithub.com 2 months ago. Its speciality is the fact that seeing a repo's raw extract is a matter of changing 'g' to 'u'. It also works for subdirectories, so if you just want the docs of Upstash QStash, for example, just go to https://uithub.com/upstash/docs/tree/main/qstash

Great to see this keeps being worthwhile!

reply
Arcuru
20 days ago
[-]
That looks awesome. You didn't mention it but uithub.com also has an API, I can definitely see myself using this for a new tool.
reply
helsinki
20 days ago
[-]
I wonder why nobody uses jsonl format to represent an entire codebase? It’s what I do and LLMs seems to prefer it. In fact, an LLM suggested this strategy to me. Uses less characters, too.
reply
addaon
20 days ago
[-]
Are you suggesting that there's a correlation between what input formats provide best performance for an LLM input, and what sequence of tokens the same LLM outputs when prompted about what input formats provide best performance? Why would that be?
reply
wwoessi
20 days ago
[-]
I don't think it's much difference, but I've read that Markdown codeblocks (or YAML, or XML) is better for code than JSON, for example: https://aider.chat/2024/08/14/code-in-json.html

I think it makes sense.

YAML is shorter and easier to read, Markdown codeblocks have no added syntax between the lines compared to normal code.

But JSON vs JSONL I can't come up with any big advantages for the LLM, it's mostly the same.

reply
TeMPOraL
19 days ago
[-]
Why wouldn't that be? We've had several generations of LLMs since ChatGPT took the world by storm; current models are very much aware of LLMs that came before them, as well as associated discussions on how to best use them.
reply
wwoessi
20 days ago
[-]
You can use JSON using the accept parameter of the API. The url structure remains the same. It also supports YAML and I found that's easiest to read for LLMs.

Previous example but in JSON:

https://uithub.com/upstash/docs/tree/main/qstash?accept=appl...

Is there any reason to prefer JSONL besides it being more efficient to edit? I'm happy to add it to my backlog if you think it has any advantages for LLMs

reply
prophesi
20 days ago
[-]
Since the site was hugged to death by HN, this appears to be the repo[0] for anyone wanting to run it locally.

[0] https://github.com/cyclotruc/gitingest

reply
bryant
20 days ago
[-]
and of course, using the repo as an input for the service renders this[1]

[1] https://gitingest.com/cyclotruc/gitingest

reply
mdaniel
20 days ago
[-]

  // Fetch stars when page loads
  fetchGitHubStars();
I do not understand why in the world so much of the code is related to poking the GH api to fetch the star count
reply
johnisgood
20 days ago
[-]
Probably generated by AI, prompted by no- or junior dev. This is my opinion, of course, but it looks like code generated by an LLM.
reply
cyclotruc
20 days ago
[-]
I know the code is not great, but contributions are very much welcome because there's a lot of low hanging fruits
reply
ugexe
20 days ago
[-]
But why did you code it to fetch stars at all? You would have had to go out of your way to do that. If AI has written most of this I suspect people will be less inclined to contribute.
reply
Mockapapella
20 days ago
[-]
https://uithub.com is also a good one for this. They also have an API with more options.
reply
Fokamul
20 days ago
[-]
Nothing against gitingest.com, but this is really peak of technology. Having LLMs which require feeding them info with copy&paste, peak of effectivity too. OMFG.
reply
evmunro
20 days ago
[-]
Great idea to make it just a simple URL change. Reminds me of the youtube download websites.

I made a similar CLI tool[0] with the added feature that you can pass `--outline` and it'll omit function bodies (while leaving their signatures). I've found it works really well for giving a high-level overview of huge repos.

You can then progressively expand specific functions as the LLM needs to see their implementation, without bloating up your context window.

[0] https://github.com/everestmz/llmcat

reply
lukejagg
20 days ago
[-]
Is the unicode really the best way to display the file structure? The special unicode characters are encoded into 2 tokens, so I doubt it would function better overall for larger repos.
reply
shawnz
20 days ago
[-]
Also, even if different characters were used, the 2D ascii art style representation of the directory tree in general strikes me as something that's not going to be easily interpreted by an LLM, which might not have a conception of how characters are laid out in 2D space
reply
Jet_Xu
19 days ago
[-]
Interesting approach! While URL-based extraction is convenient, I've been working on a more comprehensive solution for repository knowledge retrieval (llama-github). The key challenge isn't just extracting code, but understanding the semantic relationships and evolution patterns within repositories.

A few observations from building large-scale repo analysis systems:

1. Simple text extraction often misses critical context about code dependencies and architectural decisions 2. Repository structure varies significantly across languages and frameworks - what works for Python might fail for complex C++ projects 3. Caching strategies become crucial when dealing with enterprise-scale monorepos

The real challenge is building a universal knowledge graph that captures both explicit (code, dependencies) and implicit (architectural patterns, evolution history) relationships. We've found that combining static analysis with selective LLM augmentation provides better context than pure extraction approaches.

Curious about others' experiences with handling cross-repository knowledge transfer, especially in polyrepo environments?

reply
ComputerGuru
20 days ago
[-]
Instead of a copy icon, it would be better to just generate the entire content as plaintext in the result (not in an html div on a rich html page) so the entire url could be used as an attachment or its contents piped directly into an agent/tool.

Ctrl-a + ctrl-c would remain fast.

reply
vallode
20 days ago
[-]
Agreed, missing opportunity to be able to change a url from github.com/cyclotruc/gitingest to gitingest.com/cyclotruc/gitingest and simply recieve the result as plain text. A very useful little tool nonetheless.
reply
cyclotruc
20 days ago
[-]
Yeah I'm going to do that very soon with the API :)
reply
wwoessi
20 days ago
[-]
for that you can use https://uithub.com (g -> u)

- for browsers it shows html - for curl is gets raw text

reply
nfilzi
20 days ago
[-]
Looks neat! From what I understood, it's like zipping up your codebase in a streamlined TXT version for LLMs to ingest better?

What'd you say are the differences with using sth like Cursor, which has access to your codebase already?

reply
cyclotruc
20 days ago
[-]
It's in the same lane, just sometimes you need a quick and handy way to get that streamlined TXT from a public Repo without leaving your browser
reply
fastball
20 days ago
[-]
Might be good to have some filtering as well. I added a repo that has a heap of localized docs that don't make much sense to ingest into an LLM but probably use up a majority of the tokens.
reply
cyclotruc
20 days ago
[-]
Hey! OP here: gitingest is getting a lot of love right now, sorry if it's unstable but please tell me what goes wrong so I can fix it!
reply
smcleod
19 days ago
[-]
I wrote a tool some time ago called ingest ... to do exactly this from both local directories, files, web urls etc... as well as estimating tokens and vram usage: https://github.com/sammcj/ingest
reply
nonethewiser
20 days ago
[-]
I implemented this same idea in bash for local use. Useful but only up to a certain size of codebase.
reply
Cedricgc
20 days ago
[-]
Does this use the txtar format created for developing the go language?

I actually use txtar with a custom CLI to quickly copy multiple files to my clipboard and paste it into an LLM chat. I try not to get too far from the chat paradigm so I can stay flexible with which LLM provider I use

reply
maleldil
20 days ago
[-]
If I understand correctly, this sounds like https://github.com/simonw/files-to-prompt/.

It's quite useful, with some filtering options (hidden files, gitignore, extensions) and support for Claude-style tags.

reply
wonderfuly
20 days ago
[-]
reply
anamexis
20 days ago
[-]
It seems to be broken, getting errors like "Error processing repository: Path ../tmp/pallets-flask does not exist"
reply
cyclotruc
20 days ago
[-]
Thank you, I'll look into it
reply
modelorona
20 days ago
[-]
Very cool! I will try this over the weekend with a new android app to see what kind of README I can generate.

Do you have any plans to expand it?

reply
cyclotruc
20 days ago
[-]
Yes I want to add a way to target a token count to control your LLM costs
reply
bosky101
17 days ago
[-]
For some reason was giving a large file instead of reading from the readme
reply
Exuma
20 days ago
[-]
isnt there a limit on prompt size? how would you actually use this? Im not very up to speed on this stuff
reply
xnx
20 days ago
[-]
Gemini Pro has a 2 million character context window which is ~1000 pages of code.
reply
lolinder
20 days ago
[-]
Most projects would be way too big to put into a prompt—even if technically you're within the official context window, those are often misleading—the actual window where input is actually useful is usually much smaller than advertised.

What you can do with something like this is store it in a database and then query it for relevant chunks, which you then feed to the LLM as needed.

reply
tom1337
20 days ago
[-]
I wonder if building a local version of this which resolves dependency paths of the file your currently working on to a certain level so the LLM gains more context of related files instead of just the whole repo (which could be insane if you use a monorepo)
reply
jackstraw14
20 days ago
[-]
Ideally let the LLM chunk it up and figure out when to use those chunks.
reply
matt3210
20 days ago
[-]
The example buttons are a nice touch
reply
hereme888
16 days ago
[-]
It's like a web version of Repomix
reply
spencerchubb
20 days ago
[-]
Github already has a way to get the raw text files
reply
barbazoo
20 days ago
[-]
All of them in one operation? How?
reply
johnisgood
20 days ago
[-]
I think he is confusing "plain" or "raw" view, so probably not all of them.
reply
seventytwo
20 days ago
[-]
It’s dead Jim
reply
gardenhedge
20 days ago
[-]
Very clever!
reply
dim13
20 days ago
[-]
It did not digest https://github.com/torvalds/linux ¯\_(ツ)_/¯
reply
moralestapia
20 days ago
[-]
This is really nice, congrats on shipping.

I also really like this idea in general of APIs being domains, eventually making the web a giant supercomputer.

Edit: There is literally nothing wrong with this comment but feel free to keep downvoting, only 5,600 clicks to go!

reply