Obligatory mention for RedBean, the server that you can package along with all assets (incl db, scripting and TLS support) into a single multi-platform binary.
RedBean uses magic from @jart to run the same binary on multiple platforms.
Although UNIX philosophy posits that it's good to have many small files, I like your idea for its contribution to reduceing clutter (imagine running 'tree' in both scenarios) and also avoiding running out of inodes in some file systems (maybe less of a problem nowadays in general, not sure as I haven't generated millions of tiny files recently).
When I last looked Firefox didn't support it natively but it was a requested feature.
That sounds familiar, unfortunately
> Although UNIX philosophy posits that it's good to have many small files, I like your idea for its contribution to reduceing clutter (imagine running 'tree' in both scenarios) and also avoiding running out of inodes in some file systems (maybe less of a problem nowadays in general, not sure as I haven't generated millions of tiny files recently).
Pretty rare for any website to have many files, as they optimize to have as few files as possible(less network requests, which could be slower than just shipping a big file). I have crawled react docs as a test, and it's a zip file of 147mb with 3.803 files (including external resources).
https://docs.solidjs.com/ is 12mb (including external resources) with 646 files
Glad to see easier methods!
wget \
--header "Cookie: <cf or other>"
--user-agent="<UA>"
--recursive \
--level 5 \
--no-clobber \
--page-requisites \
--adjust-extension \
--span-hosts \
--convert-links \
--domains <example.com> \
--no-parent \
<example.com\sub>
> Can resume if process exit, save checkpoint every 250 urls
- status codes 200-299 are all OK
- status codes 300-399 are redirects, and also can be OK eventually
- 403 in my experience occurs quite often, where it is not an error, but suggestion that your user agent is not OK
- robots.txt should be scanned to check if any resource is prohibited, or if there are speed requirements. It is always better to be _nice_. I plan to add something like that and also missing it in my project
- It would be interesting to generate hash from app, and update only if hash is different?
I thought about robots.txt but as this is a software that you are supposed to run against your own website I didn't consider it worthy. You have a point on speed requirements and prohibited resources (but is not like skipping over them will add any security).
I haven't put much time/effort into an update step. Currently, it resumes if the process exited via checkpoints(it saves current state every 250 URLs, if any is missing then it can continue, else it will be done)
Thanks, btw what's your project!? Share!
You might be interested in reddit webscraping thread https://www.reddit.com/r/webscraping/
My passion project is https://github.com/rumca-js/Django-link-archive
Currently I use only one thread for scraping, I do not require more. It gets the job done. Also I know too little to play more with python "celery" threads.
My project can be used for various things. Depends on needs. Recently I am playing with using it as a 'search engine'. I am scraping the Internet to find cool stuff. Results are in https://github.com/rumca-js/Internet-Places-Database. No all domains are interesting though.
What? Why? Regardless of the programming language used to generate content, the standard, well known HTTP status codes should be returned as expected . If your JS served site, gives me a 200 code when it should be a 404, you're wrong.
I am not sure if HTTTrack progressed from fetching resources, long time since I used it for last time, but what my project does, is spin a real web-browser(chrome in headless mode which means it's hidden) and then it lets the JavaScript on that website execute, which means it will display/generate some fancy HTML that you can then save it as is into an index.html. It saves all kind of files, it doesn't care the extension or mime types of files, it tries to save them all.
That’s awesome to know, I will give it a try. One website I remember I tried to download and has all sorts of animations with .riv extension and it didn’t work well with HTTrack, will try it with this soon, thanks for sharing it!
I never dug deeper whether I can unzip and decode the packing, but saving as simple ZIP does somewhat guarantee future-proofing.
In Chrome Devtools, network tab, last icon that looks like an arrow pointing into a dish (Export har file)
I guess a .har file as ton more data though I used it to extract data from sites that either intensionally or unintentionally make it hard to get data. For example, signing up for an apartment the apartment management site used pdf.js and provided no way to save the PDF. So saved the .har file and extracted the PDF.
- irrelevant http headers (including cache control) taking up too much space
- auth data / cookies, credentials, personal infos that you don't want saved for privacy and security reasons especially if you want to share your archive.
HAR is also not very efficient for this use case: it's a list of request represented in json. A folder representation is far better, for storage efficiency as well as for reading the archive (you basically need to implement some custom logic that rcan read the page replaying the requests).
I would love to see better support for SPAs, where we can't just start from a sitemap. If you're interested in, you can check out some of the code from my old app for inspiration on how to crawl pages (it's Electron, so it will share a lot of interfaces with Puppeteer) [1].
[0] https://github.com/CGamesPlay/chronicler/tree/master [1] https://github.com/CGamesPlay/chronicler/blob/master/src/mai...
If what I care about is the website (and that's usually going to be the case), then there's a single familiar box containing all the messy details. I don't have to see all the files I want to ignore.
That might not be a benefit for you and not having used it, it is only a theoretical benefit in an unlikely future for me.
But just from the title of the post, I had a very clear piccture of the mechanism and it was not obvious why I would want to start with a different mechanism (barring ordinary issues with open source projects).
But that's me and your mileage may vary.
I have also updated the local server for fetching from origin missing resources. For example, a webapp may load some JS modules only when you click buttons or links, when that happens and the requested file is not on the zip, it will fetch it from origin and update the zip. So mostly you can back up an SPA by crawling it first and then using it for a bit for fetching the missing resources/modules.
My main use case is that the docs site https://pota.quack.uy/ , Google cannot index it properly. On here https://www.google.com/search?q=site%3Apota.quack.uy you will see some tiles/descriptions won't match what the content of the page is about. As the full site is rendered client side, via JavaScript, I can just crawl myself and save the html output to actual files. Then, I can serve that content with nginx or any other web server without having to do the expensive thing of SSR via nodejs. Not to mention, that being able to do SSR with modern JavaScript frameworks is not trivial, and requires engineering time.
For example if you look at this site on the browser https://pota.quack.uy/ and do `curl https://pota.quack.uy/` do you see any of the text that is rendered in the browser as output of the curl command?
You don't, because curl doesn't execute JavaScript, and that text comes from JavaScript. One way to fix this problem, is by having a Node.js instance running that does SSR, so when your curl command connects to the server, a node instance executes JavaScript that is streamed/served to curl. (node is running a web server)
Another way, without having to execute JavaScript in the server is to crawl yourself, let's say in localhost, (you do not even need to deploy) then upload the result to a web server that could serve the files.
I have no access to source and can't even republish the site directly without violating Squarespace's copyright.
But having the old site frozen in amber will be great for the redesign.
It would be a good backup for the backup, & you designer will thank you.