Instead of that, companies are sucking up as much crap as possible, and tokenizing it and then scrubbing it, and adding “safety” to it.
Reality is always much stranger than fiction.
The biggest technical hurdle to sharing the work among interested parties is the web only authenticates the pipe, not the content.
"Our goal is to democratize the data so that everyone, not just big companies, can do high-quality research and analysis."
Because they share it openly including with those doing AI, they wind up on "AI crawler" lists, which are increasingly used by blocking tools that just "use the AI list", by people who don't like AI, or, quite ironically, people who are trying to prevent the excess traffic that poorly mannered AI crawlers cause. (Common Crawl's crawler is well mannered, uses good user-agent, respects robots.txt including crawl-delay, etc)
No, it's really not, as most of the people who actually spend the time and effort to produce that content did not consent to it being used to train AI.
> copyright & capitalism
That's a really disingenuous way to say "the creators of that data didn't consent to training or commercial use and I want to steal their effort".
To clarify: the creators of the majority of online content haven't consented to their content being used to build AI models for any company or organization. For US-based "creators", that includes both domestic companies like Anthropic, OpenAI, Google, and foreign companies like ByteDance.
I don’t see how copyright survives long term in this sort of context
Narrow tariffs, competitive subsidies, sanctions, divestment, export restrictions are all viable deterrents.
Aside from this behavior, China has been subsidizing industries and dumping products into economic rivals for years. Never mind all the IP theft. It’s absurd the US has responded so weakly for the last 30 years.
We've known for decades that protectionist policies and subsidies make industries less competitive, not more. It is literally textbook stuff [1].
A typical result of protectionism is somethint like GM in the US, where they grow uncompetitive as they don't have to compete with foreign markets for domestic demand.
You get similar uncompetitive dependent behavior from subsidies - just look at Intel right now.
1. https://web.pdx.edu/~ito/Krugman-Obstfeld-Melitz/8e-text-PDF...
Note the author here is Paul Krugman, who literally won a Nobel Prize for his work on infant industry protection.
2. VCs routinely use the strategy of subsidizing their startups to "disrupt" industries until they dominate a market. China does the same thing.
3. The costs of sacrificing domestic supply chains and development capacities do not fit neatly into macroeconomic models. National security issues present similar difficulties. Do arguments around comparative advantage apply to hostile adversaries that routinely break laws (i.e. ByteDance, IP theft) and provide natural resources to enemies?
4. While the US did not succeed with Intel, China has routinely subsidized industries while enforcing antitrust with far more success than the US. See the Alibaba breakup or the recently implemented antimonopoly laws as examples: https://www.gibsondunn.com/antitrust-in-china-2023-year-in-r...
An argument could be made that any increase in competition is a side-effect, rather than the main goal, of their antimonopoly changes. Until China explains the full Jack Ma story, anything alibaba-related will be seen as political driven, rather than economic.
The point remains that in general if a foreign country is over subsidizing an industry it's a good idea -- even if you don't like them -- to just buy a ton of the stuff.
2. It remains to be proven how good of an idea the silicon valley VC model has been in the zero-interest rate environment since it changed. Uber has had all of 5 profitable quarters in 15 years. Twitter had something like 4.
Many of those VC hypergrowth companies, except a dozen or so, are effectively a big game of hot potato The gap between investment and profit made is often still in the 9 or 10 figure range.
I'd wait another decade or so before proclaiming it's a good strategy. Predatory pricing doesn't even work in theory - there's was effectively a chapter in an industrial organization class I took, though I'd have to find the material again. It might work in practice if there's other effects not taken into account in the theoretical model, though.
3. I would agree with you there, and I think both banning tiktok and subsidizing intel (foundries only) are ideas I agree with even though controversial.
4. I wouldn't argue that the alibaba breakup was a good example - this sort of move creates a huge chilling effect on investors and entrepreneurs in China. The breakup was much more about Xi consolidating his grasp on power than anything else to be realpolitik about it.
Reminder that in the last 30 years economists have variously told us that there's a (high) natural rate of unemployment that we couldn't change (has recently been completely debunked). That raising the minimum wage costs jobs. That bank deregulation is good. And so on. It's just not an empirical field, so I don't believe what's in the textbook. It's also open to lobbying from for-profit entities for specific viewpoints, in a way that a real science usually isn't
Agreed, especially when the textbook being referenced was written by a polarizing figure like Kruggman. I wonder, how much "textbook stuff" was removed from textbooks after 2008?
https://www.nobelprize.org/prizes/economic-sciences/2008/kru...
His NYT column might be controversial, but his work in international trade absolutely isn't.
It's like saying Chomsky is controversial to counter argue his work in linguistics. Chomsky might be a political hack, but his opinion on formal grammars is probablyt sound.
> I wonder, how much "textbook stuff" was removed from textbooks after 2008?
Basically nothing, to be honest? What should have changed?
The banks that collapsed into a financial crisis were effectively committing fraud. The federal reserve were publishing opinions that the housing sector was at risk as early as 2005-2006.
Also, it's difficult for a central bank to know the extent of the mispricing when there's active concealment of risks (eg. backroom deals with insurers and risk assessors) - you need full on auditing to spot that.
The bigger problem with 2008 is that almost no one responsible went to jail.
If that's your opinion it's pretty clear your engagement with the field of macroeconomics is several degrees removed from the actual research.
Assuming you're here in good faith, I would ask you to actually browse a dozen or so of any recent, randomly picked papers in the field you claimed is "not empirical", skim them and note if it's empirical work or theoretical work.
Then come back here and seriously argue that the field is "not empirical". I'll give you a jump start, here's two good sources for recent macro papers:
NBER Macro preprints: https://www.nber.org/topics/macroeconomics?page=1&perPage=50
AEJ Macro: https://www.aeaweb.org/journals/mac/forthcoming
Of course that won't be your current view of the field if your knowledge comes from the opinion section of newspapers and HN comments. But, again, I'm assuming you want to challenge your views in good faith here.
> Reminder that in the last 30 years economists have variously told us that there's a (high) natural rate of unemployment that we couldn't change (has recently been completely debunked).
Not sure where you get that opinion from, the NAIRU published by the CBO went from a high of 6.2% in the energy crisis of the 1970s to around 4.4% today:
https://fred.stlouisfed.org/series/NROU
30 years ago the NAIRU was 5.4% and today it's 4.4%, saying it was "completely debunked" makes no sense and I'm seriously wondering which source you got this claim from.
Moreover, the concept of a natural rate of unemployment that's somewhere above 0% is uncontroversial: there's naturally a time gap when looking for a new job, even in an economy at "full employment capacity".
> That raising the minimum wage costs jobs.
Unless your economics education stopped at the first week of microeconomics 101, or comes entirely from the political discourse or reddit, this isn't something that is the position of basically any economist.
Seriously, here's the first recent (2024) highly cited research review I could find from 4 seconds of googling:
https://www.nber.org/system/files/working_papers/w32878/w328...
First, note the review is 123 pages long. There's clearly some subtility past "minimum wage bad, unemployment high!" But we can skim and jump to the conclusion. To quote:
""" While the evidence is not unanimous, a reasonable conclusion from the existing literature is that minimum wage policies have had limited direct employment effects while significantly increasing the earnings of low-wage workers—at least at certain levels and in particular economic contexts.
"""
Also, by the way, the minimum wage labor effect is studied in your labor economics class, which is micro, not macro. Which points again to the question of where you're sourcing your claims from.
I don't see a proper response for other countries when dealing with such entities. Most likely it going to be an equal blurry mess of trade policies, foreign policy, and military policy.
Google Intel subsidy
US subsidises and protects tons of industries (agriculture, chips, automobiles, aerospace).
Does that mean that other countries can impose tariffs and sanctions on the US to punish this obviously anticompetitive and anti-free market behaviour? Or is it just the normal stuff we'd expect a country to do?
Which is why their chunk of amazon asia is currently behind a ban.
I kinda feel like when people say "indiscriminate" they really mean it. There is no regard for courtesy or common sense.
I think there was a story a while ago, possibly apocryphal, about someone who ran a disposable email service with a bunch of random looking domains, and they noticed bot traffic repeatedly hitting the page that shows one of their domains but not clicking through to actually activate an address. Guessing that the scraper was trying to find and block all these domains from being used to sign up for their services, the admin of the disposable email site added a function where if it detected bot traffic it would occasionally return domains like "gmail.com" in the text field.
Of course they can. They already do.
The only thing that stops this is when a nation has more to lose than to gain... and that will happen soon as other emerging economies follow the grand tradition of cheating their way to prosperity. Then slowing down the comepetition will be the only play.
All of us born after 1970 have seen Japan, Taiwan, Hong Kong (before being re-absorbed), and China run the cheat your way to the top playbook and will see it at least a few more times.
American industry was built on IP theft[1]
British empire was built on IP theft [2]
Byzantine empire was built on IP theft [3]
[1] https://en.m.wikipedia.org/wiki/Samuel_Slater
[2] https://www.smithsonianmag.com/history/the-great-british-tea...
It was originally a way to motivate creation of artistic works, since they used to involve a lot of effort.
Copyright isn't required if you use a tool built upon violating copyright?
Breathing isn't required if someone strangles everyone to death.
(Now we can all transcend breathing, in the new post-living higher plane of existence. Which surely is viable and great, and totally won't be abused to enrich the worst people, to the detriment of everyone else.)
Do you believe that is going to happen?
That's why its copyright and not artistright.
not a lot of effort by the parasites
You are aware of the way things are trending, right? Is the trend showing any sign that it might reverse, for the rest of human civilization's time?
I liked when things were simpler too, but the reality (for better or worse) seems to be that AI is not going away.
AI is in a hype cycle at the moment. Once tech companies realise that they're not going to be able to recoup the billions of dollars they've dumped into the money hole, they'll either raise prices or withdraw products (or a mixture of both).
Consumers, by and large, don't like generative AI. Or at least they don't like it enough to make it pay for itself.
we'll see what happens once the parasites have killed the host
So true, then abstract expressionism appeared and suddenly copyright wasn't a thing anymore.
This is a way to permanently entrench their positions while maintaining ownership. Not an eradication of copyright.
TikTok[’s] parent launched a [web] scraper [that’s] gobbling up [the] world’s [online] data 25x[-times] faster than OpenAI
But if you’re going to do it, do it properly. I would have hung it off the Like button with an ungodly ZooKeeper ensemble and trained a GBDT on which parts of which URLs I could just obliterate with Proxygen.
We’d have it all in about 4 days. Don’t ask me how I know.
The second worse thing about the AI megacorps after being evil is being staffed by people who use Cursor.
Edit: on the back of the valued feedback of a valued commenter I’d like to acknowledge that I made a sloppy mistake and have corrected in haste, making no excuses. It would be super great if the largest private institutions in history of the world took the care with give or take everything that I do with trolling on a forum.
Top shelf unintentional irony.
"Free market" to them = the market where they get to write the rulebook.
Any specific reason?
- web design (basic features take years to implement, and when done break the website on mobile)
- UI/UX patterns (cookie cutter component library elements forced into every interface without any tailoring to suit how the product is actually used, also makes a Series C venture indistinguishable from something setup in a weekend)
- backend design (turns out they've been hemorrhaging money on serverless Vercel function calling instead of using Lambda and spending a minute implementing caching for repeat requests)
- developer docs (even when crucial to business model, often seems AI generated, incomplete, incoherent)
And this usually comes from hiring much less developers than is needed, and those that are hired are 10x Cursor/GPT developers which trust it to have done a comprehensive job at what seems like a functional interface on the surface, and have little frame of reference or training for what constitutes good design in any of these aspects.
Don’t downvote the person who submitted a substantial comment far more valuable than it’s GP.
Oh but why can't the AI do basic backend programming anymore? /s
I meant people who don’t work at Cursor.
What's Cursor?
I would not be surprised if there are still some auto generated link directories left from the "golden ages" of blackhat.
Does any of these scrapers uniquely and unambiguously identify themselves as a bot?
Or are those days long over?
Whether those days are over or not will greatly depend on the outcome of the ongoing New York Times vs OpenAI lawsuit. If OpenAI wins, then it pretty much green lights all the other scrappers to feast upon the web
They have dedicated user agents for search crawling, when a user directly asks about a site and for training data.
Maybe that's their intent, but this was only a month ago: https://www.gamedeveloper.com/business/-this-was-essentially...
> "The homepage was being reloaded 200 times a second, as the [OpenAI] bot was apparently struggling to find its way around the site and getting stuck in a continuous loop," added Coates. "This was essentially a two-week long DDoS attack in the form of a data heist."
The ones that don't are the ones people are trying to block the most. Sometimes Google or Bing go crazy and start scraping the same resource over and over again, but most scraping tools causing load peaks are the badly written/badly configured/malicious ones.
I realize this is somewhat off-topic, but the big companies kind of destroyed the internet with all the JavaScript frameworks and whatnot.
It seems like all of them do, yeah: https://github.com/eob/isai/blob/b9060db7dc1a7789b322b8c2838...
Not sure if they're really "scrapers" though, if they're initiated by a user for a single webpage/website, more like "user-agents" in that case, unless it automatically fans out from there to get more content.
It would be nice then for the investigators to help people with the identifying markers for such crawlers. Apart from a mention of darkvisitors, which it seems is a paid service to "Block agents who try to ignore your robots.txt"
I'm not sure how much that could be trusted given their business model also.
Which does not respect robots.txt and definitely is just scraping.
AS blocks are the only really effective tool now, there are many scrapers that do not even respect user agent
Today is actually pretty good, there’s some real looking UA traffic in the top 10.
Uh, no... bytespider has been around for a long time...