New budget financial API, based on EDGAR data
7 points
1 day ago
| 2 comments
| HN
Hey everyone,

I'm the developer of an open-source (MIT License) python package to convert SEC submissions into useful data. I've recently put a bunch of stuff in the cloud for a nominal convenience fee.

Cloud:

1. SEC Websocket - notifies you of new submissions as they come out. (Free)

2. SEC Archive - download SEC submissions without rate limits. ($1/100,000 downloads)

3. MySQL RDS ($1/million rows returned)

- XBRL

- Fundamentals

- Institutional Holdings

- Insider Transactions

- Proxy Voting Records

Posting here, in case someone finds it useful.

Links:

Datamule (Package) GitHub: https://github.com/john-friedman/datamule-python

Documentation: https://john-friedman.github.io/datamule-python/datamule-python/sheet/sheet/

Get an API Key: https://datamule.xyz/dashboard2.html

conditionnumber
22 hours ago
[-]
Cool, EDGAR is an amazing public service. I think they use Akamai as their CDN so the downloads are remarkably fast.

A few years ago I wrote an SGML parser for the full SEC PDS specification (super tedious). But I have trouble leveraging my own efforts for independent research because I don't have a reliable securities master to link against. I can't take a historical CUSIP from 13F filings and associate it to a historical ticker/return. Or my returns are wrong because of data errors so I can't fit a factor model to run an event study using Form 4 data.

I think what's missing is a serious open source effort to integrate/cleanse the various cheapo data vendors into something reasonably approximating the quality you get out of a CRSP/Compustat.

reply
jgfriedman1999
7 hours ago
[-]
Yep! Pretty sure it is still Akamai. Via testing I've noticed they cap downloads at ~6mbps from e.g. home internet, but not GitHub or AWS.

SGML parsing is fun! - I've opensourced a sgml parser here https://github.com/john-friedman/secsgml

Securities master to link against - Interesting. Here's a pipeline off the top of my head 1. Get CUSIP, nameOfIssuer, titleOfClass using the Institutional Holdings database 2. Use the company metadata crosswalk to link CUSIP + titleOfClass to nameOfIssuer to get cik https://github.com/john-friedman/datamule-data/blob/master/d... (recompiled daily using GH actions) 3. Get e.g. us-gaap:EarningsPerShareBasic from the XBRL database. Link using cik. Types of stock might be a member - so e.g. Class A, Class B? Not sure there.

For form 4, not sure what you mean by event study. Would love to know!

reply
conditionnumber
4 hours ago
[-]
Event study: A way to measure how returns respond to events. Popularized by Fama in "The Adjustment of Stock Prices to New Information" but ubiquitous in securities litigation, academic financial economics, and equity L/S research. The canonical recipe is MacKinlay's "Event Studies in Economics and Finance". Industry people tend to just use residual returns from Axioma / Barra / in house risk model.

So let's say your hypothesis is "stock go up on insider buy". Event studies help you test that hypothesis and quantify how much up / when.

Cool metadata table, I'm curious about the ticker source (Form4, 10K, some SEC metadata publications?).

My comment about CUSIP linking was trying to illustrate a more general issue: it's difficult to use SEC data extractions to answer empirical questions if you don't have a good securities master to link against (reference data + market data).

Broadly speaking a securities master will have 2 kinds of data: reference data (identifiers and dates when they're valid) and market data (price / volume / corporate actions... all the stuff you need to accurately compute total returns). CRSP/Compustat (~$40k/year?) is the gold standard for daily frequency US equities. With a decent securities master you can do many interesting things. Realistic backtests for the kinds of "use an LLM to code a strategy" projects you see all over the place these days. Or (my interest) a "papers with code" style repository that helps people learn the field.

What you worry about with bad data is getting a high tstat on a plausible sounding result that later fails to replicate when you use clean data (or worse, try to trade it). Let's say your securities master drops companies 2 weeks before they're delisted... just holding the market is going to have serious alpha. Ditto if your fundamental data reflects restatements.

On the reference data front, the Compustat security table has (from_date, thru_date, cusip, ticker, cik, name, gics sector/industry, gvkey, iid) etc all lined up and ready to go. I don't think it's possible to generate this kind of time-series from cheap data vendors. I think it could be possible to do it using some of the techniques you described, and maybe others. Eg get (company-name, cik, ticker) time-series from Form4 or 10K. Then get (security-name, cusip) time-series from the 13F security lists SEC publishes quarterly (pdfs). Then merge on date/fuzzy-name. Then validate. To get GICS you'd need to do something like extract industry/sector names from a broad index ETF's quarterly holdings reports, whose format will change a lot over the years. Lots of tedious but valuable work. Also a lot of surface area to leverage LLMs. I dunno, at this point it may be feasible to use LLMs to extract all this info (annually) from 10Ks.

On the market data front, the vendors I've seen have random errors. They tend to be worst for dividends/corporate-actions. But I've seen BRK.A trade $300 trillion on a random Wednesday. Haven't noticed correlation across vendors, so I think this one might be easy to solve. Cheap fundamental data tends to have similar defects to cheap market data.

Sorry for the long rant, I've thought about this problem for a while but never seriously worked on it. One reason I haven't undertaken the effort: validation is difficult so it's hard to tell if you're actually making progress. You can do things like make sure S&P500 member returns aggregate to SPY returns to see if you're waaay off. But detailed validation is difficult without a source of ground truth.

reply
jgfriedman1999
3 hours ago
[-]
Love the long rant.

re: metadata table - it's constructed from the SEC's submissions.zip, which they update daily. What my script does is download the zip, decompress just the bytes where the information (ticker, sic code, etc) is stored, then convert into a csv.

And yep! Agree with most of this. Currently, I'd say my data is in the stage where it's useful for startups / phd research and some hedge funds / quant stuff (at least that's who is using it so far!)

I've seen the trillion dollar trades, and they're hilarious! You see it every so often in Form 3,4,5 disclosures.

re: LLMs, this is something I'm planning to move into in a month or two. I'm mostly planning to use older NLP methods which are cheaper and faster, while using LLMs for specific stuff like structured output. e.g. wrds boardex data can be constructed from 8-k item 5.02s.

I think the biggest difficulty wrt to data is just the raw data ingest is annoying AF. My approach has been to make each step easy -> use it to build the next step.

reply
jgfriedman1999
1 day ago
[-]
How it works:

Websocket:

1. Two AWS ec2 t4g.nano instances polling the SEC's RSS and EFTS endpoints. (RSS is faster, EFTS is complete). 2. When new submissions are detected, they are sent to the Websocket (t4g.micro websocket, using Go for greater concurrency). 3. Websocket sends signal to consumers.

Archive:

1. One t4g.micro instance. Receives notifications from websocket, then gets submissions SGML from the SEC. 2. If submission is over size threshold, compresses with zstandard. 3. Uploads submissions to Cloudflare R2 bucket. (Zero egress fee, just class A / B operations). 4. Cloudflare R2 bucket is proxied behind my domain, with caching.

RDS

1. ECS Fargate instances set to run daily at 9 AM UTC. 2. Downloads data from archive, then parses them, and uploads them into AWS dbt.medium MySQL RDS. 3. Also handles reconciliation for the archive in case any filings were missed.

reply