I guess the question to leadership is that two of the three pillars , namely security and quality are at odds with the third pillar— AI innovation. Which side do you pick?
(I know you mean well and I love you, Scott Hanselman but please don't answer this yourself. Please pass this on to the leadership.)
Microsoft was unique among the companies I worked for in that they gave you some guidelines and then let you blog without having to go through some approval or editing process. It made blogging much more personal and organic IMO; company-curated blog posts read like marketing.
I didn’t see the original post but it looks like somebody made a bad judgment call on what to put in a company blog post (and maybe what constitutes ethical activity) and that it was taken down as soon as someone noticed.
I care much less about whether the person exercised good judgment in posting, and don’t care (and am happy) that there was not some process that would have caught it pre-publication.
I care much more if the person works in a team that believes that copyright infringement for AI training is a justifiable behavior in a corporate environment.
And now we know that is a thing, and I suspect that there will be some hard questions asked by lawyers inside the company, and perhaps by lawyers outside the company.
If you or anyone else who sees this wants to see the original post, it's still available in the Wayback Machine: https://web.archive.org/web/20260105115129/https://devblogs....
It feels out of character for a company like Microsoft to have such a policy, but I agree that it's insanely cool that some very cool folks get to post pretty freely. Raymond Chen could NEVER run his blog like that at FAANG.
Bruce Dawson was publishing debugging stories (including things debugged about Google products done as part of his job) for the entire time he was working at Google: https://randomascii.wordpress.com/
I was/am a nobody, I have no idea how that happened and it was mind blowing that MS was interacting with me.
Why do you assume that reviewing docs is a lower bar than reviewing code, and that if docs aren't being reviewed it's somehow less likely that code is being reviewed?
There's a formal process for reviewing code because bugs can break things in massive ways. While there may not be the same degree of rigor for reviewing documentation because it's not going to stop the software from working.
But one doesn't necessarily say anything about the other.
I realize BSOD is no longer nearly as common as it once was, but let's not forget that Windows used to be very fragile indeed.
It was more robust 5 years ago than it is today.
Or at least that's been my impression. I can't back that up with hard data.
I have never even heard of a software company that acts otherwise (except IBM, and much of the world of Silicon Valley software engineering is reactionary to IBM's glacial pace).
I'm not saying docs == code for importance is a bad way to be, just that if you can name firms that treat them that way other than IBM (or aerospace), I'd be interested to learn more.
What I'm saying is, you have to review code to get it out the door with a certain degree of quality. That's your core product. That's the minimum standard you have to pass, the lowest bar.
In contrast, reviewing documentation is usually less core. You do that after the code gets reviewed. If there's time. If it doesn't get done, that's not necessarily saying anything about code quality.
Even if it's easier to review documentation, that doesn't mean it's getting prioritized. So it's not a lower bar in the sense that lower bars get climbed first.
You reason in circles
Organizations are large, so much so that different levels of rigor across different parts of the organization. Furthermore, more rigorous controls would be applied to code than for documentation (you would assume).
I wasn't mad, just disappointed.
https://www.kaggle.com/datasets/shubhammaindola/harry-potter...
More than just using the data, it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.
Also interesting, this blog post has been up since November of 2024, very surprising to me that Microsoft hasn't taken it down yet.
Would it? Sounds to me like the blame lies on the person uploading the dataset under that license, unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'
Yes there's an expectation that you put in some minimum amount of effort. The license issue here is not subtle, the Kaggle page says they just downloaded the eBooks and converted them to txt. The author is clearly familiar enough with HP to know that it's not old enough to be public domain, and the Kaggle page makes it pretty clear that they didn't get some kind of special permission.
If you want to get more specific on the legal side then copyright infringement does not require that you _knew_ you were infringing on the copyright, it's still infringement either way and you can be made to pay damages. It's entirely on you to verify the license.
Why wouldn't that apply?
So in short, I kept my mouth shut. I assumed I would lose my job if my public comment reached the right people.
https://github.com/Azure-Samples/azure-sql-db-vector-search/...
https://devblogs.microsoft.com/azure-sql/?p=4796
"Build a RAG App in 5 Minutes
Ever tried setting up an Al-powered project on
Azure and felt overwhelmed? As a student or first- time user to cloud computing, I've been there too. The idea of creating a chatbot or search app using GPT sounds exciting, but the process of setting up everything right from the vector database, provisioning OpenAl models, to integrating them,
it can f..."
I'm disappointed people continue to use it.
I'm surprised that JKR's people haven't come down like a tonne of bricks on Kaggle / Microsoft.
Does anyone know whether there is some special reason why this has lasted so long without being taken down?
[1] https://news.ycombinator.com/item?id=47057829, "Microsoft morged my diagram". It was in a discussion there that someone pointed out this article linking to full downloads of the Harry Potter novels, which I thought deserved more visibility.
But this is just a lie.
Approximately nobody is prosecuted for copyright infringement.
We’re moving the goalposts from the government systematically targeting normal people “if caught”, to only a handful of civil cases.
(done, contacted her lawyers too)
If anything Kaggle would be on the hook for including the data as CC0. Or perhaps to Shubham Maindola for uploading it. In fact the "provenance" listed would give me chills. Crazy how this got a 10.0 score. "I downloaded the ebooks of Harry Potter. Then converted them to txt files."
Archived copy: https://web.archive.org/web/20260105115129/https://devblogs....
It is very worrying that people with no ethics work for these trillion dollar companies who are supposed to be shaping the technology of tomorrow.
Disrespecting the copyright on a multi-billion dollar franchise hardly comes close to the major unethical behavior the trillion dollar companies are committing.
[1] actual indian
So far, the only thing I've found AI to be consistently good at is entertainment of the humourous kind.
Everything new is AI slop, and there seems to be no coming back from it.
Very low code. Infinite scale. Name a better AI startup to invest.
https://news.microsoft.com/source/2004/02/12/statement-from-...
In case the new anti-copyright Microslop memory-holes that link:
https://web.archive.org/web/20260215220230/https://news.micr...
The tutorial could have used that leaked source code for "educational purposes", as many here claim.
Although this seems is not reciprocal. Rule for thee, but not for me.
https://web.archive.org/web/20260105115129/https://devblogs....
Nevertheless pretty egregious oversight (incompetence?) and something that shouldn't have been published.
If it comes from a site claiming it was under a licence when it was not, the misdeed is done by the person who provided the version carrying the licence.
Even if MS could claim that they were acting in good faith there really isn't much legal wiggle room for that. But it doesn't even come to that because I don't think anyone would buy that they really thought that the Harry Potter books were under the CC0
Same thing applies here.
Up to 80% off all works that are in copyright terms are accidentally in the public domain. A well known example is Night of the Living Dead. It is not your job to check that the copiright on a work you use is the correct one.
And it is your job to check that you have the rights to use other people's work. Ignorance is not a defence.
Which ones? As far as I was aware, it's a crime to redistribute copyrighted works, not receive.
A search index might also contain copyrighted material. As long as it's used for search queries as opposed to regurgitation there's no problem. Search indexes and LLMs are both clearly very beneficial tools to have access to.
Since we're talking about an electronic system the search index example is the more directly relevant one. Anyone who wants to object to LLMs is going to need to take care to ensure consistency with his views on Google's search index.
Also can you point out how copyright law changes because we're using an "electronic system" as opposed to an "analog system?"
Why exactly?
the merge commits in those repositories are all digitally signed by GitHub public key, so the previous history is fully authenticated and non-repudiable
so any copies now can be trivially proven to be genuine output by Microslop
hoisted by your own petard
signed merge commit is: 987eee6af61788647ae0cab82ae8a5d9402a5bd0
PGP signature (using GitHub's key: B5690EEEBB952194) is:
for posterity:
-----BEGIN PGP SIGNATURE-----
wsFcBAABCAAQBQJnPIphCRC1aQ7uu5UhlAAAUgMQACyp7apkh0e413K7ipGd7Z+K
JCMq93GoJm4OSgzzZzCp1DbeEq2u1mX1ZAXLq5XKqM0cL6cTg13IF4oumq8QmTzQ
bFykqKfrkCDSTIa2v5CucJedmIoJl976jX96bnV8YXgoKx8/43044galo23bjoJ8
9tUcVnC10FYj7NTI9/uCN9C3f2Up3t9xUaJzJv3OdgjJ9B3cNwYBfF6sDCj3QnUu
AWRNdGIyqyO1WKnj2XL2Qo9jMWNX3uHSBYYGqIvZqu2bjpYS89Dt3X086JlLdQG9
Pef2PHX6VeZ6j8J4NPqi28mB2n9Dn7V6q0SQIF1z4hsa9fLC0kljyrrO3T/RT6Ut
D8r3Y7vjGUHPNkVXSo1oNCiNMV9LjDQwiJc/AuF6smupxivIFCKe8nDPBlCvi6gr
uPz5KK5MfpmG5rO2+NA0LcrUPAk6F3nxDI46+Lsu2nCvO+pOauQQ+oUvxJNCnI3Y
5PAReulGOZHXbiCj/9j6+H7rUBCGk2phVtXOsXxitCorigNXAeAJ8hP2cgjXZH25
NGGtjyp75VVBydzSCz9yY+VypITovsDmEC1CxfbJRS7SaTdU7bGCLN08JcmfOzNb
u/3iPkKMXXWMNYO6J1bUeAqVpueGkqsAqnhY32NylIni07Oz/he8nEsQCXC+4ueG
uYgSpEu8IaERBIQLVntK
=yDvq
-----END PGP SIGNATURE-----The biggest irony would be if the page itself was generated by an LLM.
I'm sure the scripts of Star Wars would be similarly ignored if they were used.
ftfy
If I write an article on training an LLM on the leaked Windows XP source code, blithely mark the source code repo as in 'the public domain', but used Azure resources for the how-to steps, would that would make it OK Microsoft? You know, your Azure division might get some money...
Seriously, this is just so...blatant. It's like we've all collectively decided that copyright just doesn't matter anymore. Just readin this article, I feel like I'm taking crazy pills.
But come on … these guides really are for learning purposes. Doesn’t seem like a big deal to me at all. They aren’t even hosting it, just pointing to kaggle who is hosting it.
On principle copyright law should allow this kind of learning use case anyway.
This however is a very, VERY poor situation when you end up placing your employer at risk because you think copyright doesn't matter and everything on the internet is fair game.
This is probably the most polite way I would describe this to most, UG. For the rest, jus stop acting like cheating through a situation to get a step up is the norm, it's just dirty behaviour.
Rowling is known for actively protecting her rights as an author, they couldn't have picked a worse author to slop up
Everyone should torrent and rip off those books, anyway.
In fact if you do this as a nonprofit or at an educational institution in a teaching context it’s explicitly allowed by fair use already.
If you do it individually, idk I’m not a lawyer. But it should be allowed on principle.
But if you then go take your trained AI and deploy it for commercial purposes that’s a different story and should have protections for the original rights holders.