Having over a decade of open source software I've written freely available online, I actually really appreciate the value that AI && LLMs have provided me.
The thing that leaves a bad taste in my mouth is the fact that my works were likely included in the training data and, if it doesn't violate my licenses (GNU 2/3), it certainly feels against the spirit of what I intended when distributing my works.
I was made redundant recently "due to AI" (questionable) and it feels like my works in some way contributed to my redundancy where my works contributed to the profits made by these AI megacorps while I am left a victim.
I wish I could be provided a dividend or royalty, however small, for my contribution to these LLMs but that will never happen.
I've been looking for a copy-left "source available" license that allows me to distribute code openly but has a clause that says "if you would like to use these sources to train an LLM, please contact me and we'll work something out". I haven't yet found that.
I'm guessing that such a license would not be enforceable because I am not in the US, but at least it would be nice to declare my intent and who knows what the future looks like.
FWIW, a lot of open source caused other people to lose their jobs too, all pre AI. So what goes around comes around. The Free Software movement was from day one built on cloning proprietary programs - UNIX was a commercial OS that AT&T sold, the early Linux desktop environments all looked exactly like a mashup of Windows 95 and commercial DEs, etc. Every commercial UNIX got wiped out except Apple, do you think that didn't lead to layoffs? Because it very much did. Nor did it ever really change. SystemD started out as "heavily inspired" by Launch Services. Wayland is basically the same ideas as SkyLight in macOS, etc.
And who was it who benefited from this stuff? A lot of the benefit went to "megacorps" who took the savings and banked the higher profits.
So I don't think open source, which for many years was unashamedly about just cloning designs that were funding other people's salaries, can really cry too much about LLMs. And I say that as someone who has written a lot of open source software, including working on Wine.
Fwiw, AIX and to a far lesser extent Solaris still exist. I'm not exactly sure why people are using them (AIX I can maybe understand because "no one got fired for buying IBM" or whatever but there really isn't any excuse to be running Solaris nowadays since ZFS runs on Linux and and 2 of the BSD based systems and oracle seems desperate to let it die)
So wait, sparc solaris is the only production unix with hardware memory tagging but also linux has it? Are we talking strict SUS compliant systems (current or former because for some reason solaris is no longer listed as such despite ostensibly still being compliant unless the SRUs have seriously FUBARed some things) or unices in general? because I'd argue anyone running SUS compliant systems out of anything other than their choice happening to be compliant is arguably even more niche than running AIX or Solaris for anything else
Why would someone use unsupported OpenBSD on SPARC for the clients that pay for it? Probably the same reason so many servers run on Rocky or Alma instead of RHEL, money. Perhaps they bought the hardware without the support contract for Solaris, or they don't want to keep paying for it
I think there's no meaningful case by the letter of the law that use of training data that include GPL-licensed software in models that comprise the core component of modern LLMs doesn't obligate every producer of such models to make both the models and the software stack supporting them available under the same terms. Of course, it also seems clear in the present landscape that the law often depends more on the convenience of the powerful than its actual construction and intent, but I would love to be proven wrong about that, and this kind of outcome would help
> I think there's no meaningful case by the letter of the law that use of training data that include GPL-licensed software in models that comprise the core component of modern LLMs doesn't obligate every producer of such models to make both the models and the software stack supporting them available under the same terms.
Why do you think "fair use" doesn't apply in this case? The prior Bartz vs Anthropic ruling laid out pretty clearly how training an AI model falls within the realm of fair use. Authors Guild vs Google and Authors Guild vs HathiTrust were both decided much earlier and both found that digitizing copyrighted works for the sake of making them searchable is sufficiently transformative to meet the standards of fair use. So what is it about GPL licensed software that you feel would make AI training on it not subject to the same copyright and fair use considerations that apply to books?
I’m not a lawyer, but I read the decision, and how is this section not a ruling on fair use?
“To summarize the analysis that now follows, the use of the books at issue to train Claude
and its precursors was exceedingly transformative and was a fair use under Section 107 of the
Copyright Act. And, the digitization of the books purchased in print form by Anthropic was
also a fair use but not for the same reason as applies to the training copies. Instead, it was a
fair use because all Anthropic did was replace the print copies it had purchased for its central
library with more convenient space-saving and searchable digital copies for its central
library — without adding new copies, creating new works, or redistributing existing copies.
However, Anthropic had no entitlement to use pirated copies for its central library. Creating a
permanent, general-purpose library was not itself a fair use excusing Anthropic’s piracy.”
Or in the final judgement, “This order grants summary judgment for Anthropic that the training use was a fair use.
And, it grants that the print-to-digital format change was a fair use for a different reason.”
> it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library
It is only fair use where Anthropic had already purchased a license to the work. Which has zero to do with scraping - a purchase was made, an exchange of value, and that comes with rights.
The second, which involves a section of the judgement a little before your quote:
> And, as for any copies made from central library copies but not used for training, this order does not grant summary judgment for Anthropic.
This is where the court refused to make any ruling. There was no exchange of value here, such as would happen with scraping. The court made no ruling.
I believe you are misinterpreting the ruling. Remember that a copyright claim must inherently argue that copies of the work are being made. To that end, the case analyzes multiple "copies" alleged to have been made.
1) "Copies used to train specific LLMs", for which the ruling is:
> The copies used to train specific LLMs were justified as a fair use.
> Every factor but the nature of the copyrighted work favors this result.
> The technology at issue was among the most transformative many of us will see in our lifetimes.
Notable here is that all of the "copies used to train specific LLMs" were copies made from books Anthropic purchased. But also of note is that Anthropic need not have purchased them, as long as they had obtained the original sources legally. The case references the Google Books lawsuit as an example of something Anthropic could have done to avoid pirating the books they did pirate where in Google obtained the original materials on loan from willing and participating libraries, and did not purchase them.
2) "The copies used to convert purchased print library copies into digital library copies", where again the ruling is:
> justified, too, though for a different fair use. The first factor strongly
> favors this result, and the third favors it, too. The fourth is neutral. Only
> the second slightly disfavors it. On balance, as the purchased print copy was
> destroyed and its digital replacement not redistributed, this was a
> fair use.
Here one might argue where the use of GPL code is different in that in making the copy, no original was destroyed. But it's also very likely that this wouldn't apply at all in the case of GPL code because there was also no original physical copy to convert into a digital format. The code was already digitally available.
3) "The downloaded pirated copies used to build a central library" where the court finds clearly against fair use.
4) "And, as for any copies made from central library copies but not used for training" where as you note Judge Alsup declined to rule. But notice particularly that this is referring to copies made FROM the central library AND NOT for the purposes of training an LLM. The copies made from purchased materials to build the central library in the first place were already deemed fair use. And making copies from the central library to train an LLM from those copies was also determined to be fair use.The copies obtained by piracy were not. But for uses not pertaining to the training of an LLM, the judge is declining to make a ruling here because there was not enough evidence about what books from the central library were copied for what purposes and what the source of those copies was. As he says in the ruling:
> Anthropic is not entitled to an order blessing all copying “that Anthropic has ever made after obtaining the data,” to use its words
This declination applies both to the purchased and pirated sources, because it's about whether making additional copies from your central library copies (which themselves may or may not have been fair use), automatically qualifies as fair use. And this is perfectly reasonable. You have a right as part of fair use to make a copy of a TV broadcast to watch at a later time on your DVR. But having a right to make that copy does not inherently mean that you also have a right to make a copy from that copy for any other purposes. You may (and almost certainly do) have a right to make a copy to move it from your DVR to some other storage medium. You may not (and almost certainly do not) have a right to make a copy and give it to your friend.
At best, an argument that GPL software wouldn't be covered under the same considerations of fair use that this case considers would require arguing that the copies of GPL code obtained by Anthropic were not obtained legally. But that's likely going to be a very hard argument to make given that GPL code is freely distributed all over the place with no attempts made to restrict who can access that code. In fact, GPL code demands that if you distribute the software derived from that code, you MUST make copies of the code available to anyone you distribute the software to. Any AI trainer would simply need to download Linux or emacs and the GPL requires the person they downloaded that software from to provide them with the source code. How could you then argue that the original source from which copies were made was obtained illicitly when the terms of downloading the freely available software mandated that they be given a copy?
> How could you then argue that the original source from which copies were made was obtained illicitly when the terms of downloading the freely available software mandated that they be given a copy?
By the license and terms such copies are under.
> For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
You _must_ show the terms. If you copy the GPL code, and it inherits the license, as the terms say it does, then you must also copy the license.
The GPL does not give you an unfettered right to copy, it comes with terms and conditions protecting it under contract law. Thus, you must follow the contract.
The GPL goes to some lengths to define its terms.
> A "covered work" means either the unmodified Program or a work based on the Program.
> Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well.
It is not just the source code that you must convey.
> So what is it about GPL licensed software that you feel would make AI training on it not subject to the same copyright and fair use considerations that apply to books?
The poster doesn't like it, so it's different. Most of the "legal analysis" and "foregone conclusions" in these types of discussions are vibes dressed up as objective declarations.
You seem like the type of person that will believe anything as long as someone cites a case without looking into it. Bartz v Anthropic only looked at books, and there was still a 1.5 billion settlement that Anthropic paid out because it got those books from LibGen / Anna's Archive, and the ruling also said that the data has to be acquired "legitimately".
Whether data acquired from a licence that specifically forbids building a derivative work without also releasing that derivative under the same licence counts as a legitimate data gathering operation is anyone's guess, as those specific circumstances are about as far from that prior case as they can be.
> This License acknowledges your rights of fair use or other equivalent, as provided by copyright law.
It is legitimate to acquire GPL software. The requirements of the license only occur if you're distributing the work AND fair use does not apply.
Training certainly doesn't count as distribution, so the buck passes to inference, which leaves us dealing with substantial similarity test, and still, fair use.
If a human reads GPL code and outputs a recreation of that code (derivative work) using what they learned - that is illegal.
If an AI reads GPL code and outputs a recreation of that code using what it "learned" - it's not illegal?
If that is the case, then copyright holds no weight any more. I should be allowed to train an LLM on decompiled firmware (say, Playstation, Switch, iPhone) in countries where decompilation is legal - then have the LLM produce equivalent firmware that I later use to build an emulator (or competing open source firmware).
> If that is the case, then copyright holds no weight any more. I should be allowed to train an LLM on decompiled firmware (say, Playstation, Switch, iPhone) in countries where decompilation is legal - then have the LLM produce equivalent firmware that I later use to build an emulator (or competing open source firmware).
It's funny you mention that, because one of the biggest fair use cases that effectively cemented "fair use" for emulators is Sony Computer Entertainment Inc v. Connectix Corp.[1] where the copying of PlayStaion BIOS files for the purposes of reverse engineering and creating an emulator was explicitly ruled to be fair use, including running that code through a disassembler.
You and I are not a fucking judge, our opinions on this don't matter one bit. We might as well print it on a piece of paper and wipe our asses with it.
As long as they don't distribute the model's weights, even a strict interpretation of the GPL should be fine. Same reason Google doesn't have to upstream changes to the Linux kernel they only deploy in-house.
But wouldn't that be like some company using gpl licensed code to host a code generator for something? At least in a legal interpretation. Or is that different?
I mean, is the case you're making that you can run a SaaS business on GPL-derived code without fulfilling GPL obligations because you're not distributing a binary?
If true that would seem to invalidate the entire GPL, but even by that logic, a website (such as chatGPT) distributes javascript that runs the code, and programs like claude code also do so. Again, if you can slip the GPL's requirements through indirection like having your application go phone home to your server to go get the infringing parts, the GPL would essentially unenforceable in... most contexts
That's where the AGPL comes in. The GPL(v2) does not require eg Google or Facebook to release any of the changes they've made to the Linux kernel. That they do so is not because of a legal obligation to do so. The "to get parts" thing is the relevant detail to be very specific on. If those parts are a binary that is used, then the GPL does kick in, but for distributing source code that's possibly derived, possibly not covered by copyright, it's not been decided in a court of law yet.
You sound like you're citing the general Internet understanding of "fair use", which seems to amount to "I can do whatever I like to any copyrighted content as long as maybe I mutilate it enough and shout 'FAIR USE!' loudly enough."
On the real measures of "fair use", at least in the US: https://fairuse.stanford.edu/overview/fair-use/four-factors/ I would contend that it absolutely face plants on all four measures. The purpose is absolutely in the form of a "replacement" for the original, the nature is something that has been abundantly proved many times over in court as being something copyrightable as a creative expression (with limited exceptions for particular bits of code that are informational), the "amount and substantiality" of the portions used is "all of it", and the effect of use is devastating to the market value of the original.
You may disagree. A long comment thread may ensue. However, all I really need for my point here is simply that it is far, far from obvious that waving the term "FAIR USE!" around is a sufficient defense. It would be a lengthy court case, not a slam-dunk "well duh it's obvious this is fair use". The real "fair use" and not the internet's "FAIR USE!" bear little resemblance to each other.
A sibling comment mentions Bartz v. Anthropic. Looking more at the details of the case I don't think it's obvious how to apply it, other than as a proof that just because an AI company acquired some material in "some manner" doesn't mean they can just do whatever with it. The case ruled they still had to buy a copy. I can easily make a case that "buying a copy" in the case of a GPL-2 codebase is "agreeing to the license" and that such an agreement could easily say "anything trained on this must also be released as GPL-2". It's a somewhat lengthy road to travel, where each step could result in a failure, but the same can be said for the road to "just because I can lay my hands on it means I can feed it to my AI and 100% own the result" and that has already had a step fail.
"Real" fair use is perhaps one of the most nebulous legal concepts possible. I haven't dived deep into software, but a cursory look at how it "works (I use that term as loosely as possible)" in music with sampling and interpolation etc immediately reveals that there's just about nothing one can rely on in any logical sense.
I'm not really sure why you think my comment specifically citing the recent rulings by Judge Alsup and also the prior history with respect to the Google Books project is somehow declaring "I can do whatever I like to any copyrighted content", but I assure you I'm not. I'm very specifically talking about the various cases that have come about in the digital age dealing with fair use as it has been interpreted by US courts to apply to the use of computers to create copies of works for the purposes of creating other works.
I'm referring to the long history of carefully threaded fair use rulings and settlements, many of which we as an industry have benefitted greatly from. From determinations that cloning a BIOS can be fair use (see IBM PC bios cloning, but also Sony v. Connectix), or that cloning an entire API for the purposes of creating a parallel competitive product (Google v. Oracle), or digitizing books for the purposes of making those books searchable and even displaying portions of those books to users (Authors Guild v. Google) or even your cable company offering you "remote DVR" copying of broadcast TV (20th Century Fox v. Cablevision). Time and again the courts have found that copyright, and especially copyright with respect to digital transformations is far more limited than large corporations would prefer. Further they have found in plenty of cases that even a direct 1:1 copy of source can be fair use, let alone copies which are "transformative" as LLM training was found to be in Bartz.
Realistically, I don't see how anyone can have watched the various copyright cases that have been decided in the digital age, and seen the battles that the EFF (and a good part of the tech industry) have waged to reduce the strength of copyright and not also see how AI training can very easily fit within that same framework.
Not to cast aspersions on my fellow geeks and nerds, but it has been very interesting to me to watch the "hacker" world move from "information wants to be free" to "copyright maximalists" once it was their works that were being copied in ways they didn't like. For an industry that has brought about (and heavily promoted and supported) things like DeCSS, BitTorrent, Handbrake, Jellyfin/Plex, numerous emulators, WINE, BIOS and hardware cloning, ad blockers, web scrapers and many other things that copyright owners have been very unhappy about, it's very strange to see this newfound respect for the sanctity of copyright.
> I can easily make a case that "buying a copy" in the case of a GPL-2 codebase is "agreeing to the license" and that such an agreement could easily say "anything trained on this must also be released as GPL-2".
And I would argue that obtaining a legal copy of the GPL source to a program requires no such agreement. By downloading a copy of a GPLed program I am entitled by the terms under which that software was distributed to obtain a copy of the source code. I do not have to agree to any other terms in order to obtain that source code, downloading from someone authorized to distribute that code is in and of itself sufficient to entitle me to that source code. You can not, by the very terms of the GPL itself deny me a copy of the source code for GPL software you have distributed to me, even if you believe I intend to make distributions that are not GPL compliant. You can decline to distribute the software to me in the first place, but once you have distributed it to me, I am legally entitled to a copy of the source code. From there, now that I have a legal copy, the question becomes is making additional copies for the purposes of training an AI model fair use? So far, the most definitive case we have on the matter (Bartz) says yes it is.
So either we have to make the case that the original copy was somehow acquired from a source not authorized to make that copy, or we have to argue that the output of the AI model or the AI model is itself infringing. Given the ruling that copies made for training an AI model was ruled "exceedingly transformative and was a fair use under Section 107 of the Copyright Act"[1] it seems unlikely that the AI model itself is going to be found to be infringing. That leaves the output of the model itself, which Bartz does not rule on, as the authors never alleged the output of the model was infringing. GPL software authors might be able to prevail on that point, but they would have a pretty uphill battle I think in demonstrating that the model generated infringing output and not simply functional necessary code that isn't covered by copyright. The ability of code to be subject to copyright has long been a sort of careful balance between protecting a larger creative idea, and also not simply walling off whole avenues of purely functional decisions from all competitors.
Broadly speaking, GPL is a license that has specific provisions about creating derivative software from the licensed work, and just saying "fair use" doesn't exempt you from those provisions. More specifically, an advertised use case (in fact, arguably the main one at this stage) of the most popular closed models as they're currently being used is to produce code, some of which is going to be GPL licensed. As such, the code used is part of the functionality of the program. The fact that this program was produced from the source code used by a machine learning algorithm rather than some other method doesn't change this fundamental fact.
The current supreme court may think that machine learning is some sort of magic exception, but they also seem to believe whatever oligarchs will bribe them to believe. Again, I doubt the law will be enforced as written, but that has more to do with corruption than any meaningful legal theory. Arguments against this claim seem to ignore that courts have already ruled these systems to not have intellectual property rights of their own, and the argument for fair use seems to rely pretty heavily on some handwavey anthropomorphization of the models.
> Broadly speaking, GPL is a license that has specific provisions about creating derivative software from the licensed work, and just saying "fair use" doesn't exempt you from those provisions.
Broadly speaking, yes it does. The whole point of fair use is that you don’t need a license.
Claiming LLMs are fair use is ridiculous bordering on ignorant or disingenuous.
Here’s the 4 part test from 17 U.S.C. § 107:
1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
Fail. The use is to make trillions of dollars and be maximally disruptive.
2. the nature of the copyrighted work;
Fail. In many cases at least, the copy written code is commercial or otherwise supports livelihoods; and is the result much high skill labor with the express stipulation for reciprocity.
3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
Fail. They use all of it.
4. the effect of the use upon the potential market for or value of the copyrighted work.
Fail to the extreme. There is already measurable decline in these markets. The leaders explicitly state that they want to put knowledge workers out of business.
- - -
Hell, LLMs don’t even pass the sniff test.
The only reason this stuff is being entertained is some combination of the prisoner’s dilemma and more classic greed.
This comment highlights a basic dilemma about how and where to spend your time.
Here's a basic rule of thumb I recommend people apply when it comes to these sorts of long, contentious threads where you know that not every person showing up to the conversation is limiting themselves to commenting about things they understand and that involve some of the most tortured motivated reasoning about legal topics:
If the topic is copyright and someone who is speaking authoritatively has just used the words "copy written", then ignore them. Consider whether you need to be anywhere in the conversation at all, even as a purely passive observer. Think about all the things you can do instead of wasting your time here, where the stakes for participation are so low because nothing that is said here really matters. Go do something productive.
Yet you still wasted your own time and everyone else’s time with a reply that has even less substance.
I was making an argument based on quotes from the actual legal code and you’re saying pions who don’t use the exact correct terminology shouldn’t even consider what should or shouldn’t be legal? What a load of junk. This is a democracy. We’re supposed to be engaging with it.
You’re mixing up “using” with “copying”. You are allowed to “use” all of a book or movie or code by listening to or watching or reviewing the whole thing. Copyright protects copies. The legal claim here is than training an LLM is sufficiently transformative such that it cannot be construed as a copy.
> Fail. The use is to make trillions of dollars and be maximally disruptive.
Fair use has repeatedly been found even in cases where the copies were used for commercial purposes. See Sony v. Connectix for example, where the cloning and disassembly of the PlayStation BIOS for the purposes of making a commercially sold (at retail, in a box) emulator of a then currently sold game console was determined to be fair use.
> Fail. In many cases at least, the copy written code is commercial or otherwise supports livelihoods; and is the result much high skill labor with the express stipulation for reciprocity.
Again, see Sony V. Connectix where the sales of PlayStation consoles support the livelihoods and skilled labor of Sony engineers.
> Fail. They use all of it.
And again, see Sony V. Connectix, where the entire BIOS was copied again and again until a clone could be written that sought to reproduce all the functionality of the real BIOS. Or see Google V. Oracle where cloning the entire Java API for a competing commercial product was also deemed fair use. Or the Google Books lawsuits, where cloning entire books for the purposes of making them searchable online was deemed fair use. Or see any of the various time/format shifting cases over the years (Cassette tapes, VCRs, DVRs, MP3 encoders, DVD ripping etc) where making whole and complete copies of works is deemed fair use.
> Fail to the extreme. There is already measurable decline in these markets. The leaders explicitly state that they want to put knowledge workers out of business.
Again, see Sony v. Connectix where the commercial product deemed to be fair use was directly competing with an actively sold video game console. Copyright protects the rights of creators to exploit their own works, it does not protect them against any and all forms of competition.
Or perhaps instead of referring you to the history of legislation around copyright in the digital age, I should instead simply point you at Judge Alsup's ruling in the Bartz case where he details exactly why the facts of the case and prior case law find that training an AI on copyrighted material is fair use [1]. Of particular interest to you might be the fact that each of the 4 factors is not a simple "pass/fail" metric, but a weighing of relative merits. For example, when examining factor 1, Judge Alsup writes:
> That the accused is a commercial entity is indicative, not dispositive. That
> the accused stands to benefit is likewise indicative. But what matters most
> is whether the format change exploits anything the Copyright Act reserves to
I appreciate the detailed reply and that there’s subtlety here.
I read the linked Bartz case. It’s disappointing that it seems limited to only the copying of books into a data set and not the result of training LLM on protected works. This is not the “use” that I was discussing and not very interesting.
The plaintiffs didn’t even challenge that the outputs of the LLMs infringe. They judge seems to agree (at least by omission) that fair use wouldn’t apply but that the outputs were transformative and in cases where they aren’t:
> [anthropic] placed additional software between the user and the underlying LLM to ensure that no infringing output ever reached the users.
So this is not true:
> he [the judge] details exactly why the facts of the case and prior case law find that training an AI on copyrighted material is fair use
The plaintiffs also make really awful arguments about “memorizing” and “learning” that falsely anthropomorphize LLMs. Which the judge shoots down.
If we’re going to give LLMs the same rights as humans, there’s unlikely to much of an argument.
I think there’s potential for an argument about how LLMs use “compressed” versions of protected works to _mechanically_ traverse language space. It would be subtle and technical so maybe not likely to work in our current context.
I think that the claim that they make is that once a model is "contaminated" with GPL code, every output it ever produces should be considered derived from GPL code, therefore GPL-licensed as well.
So GitHub and Windows and IDEs need to be open source because they can output FOSS code? That's obviously rediculous.
If an AI outputs copyrighted code, that is a copyright violation. And if it does and a human uses it, then you are welcome to sue the human or LLM provider for that. But you don't get to sue people for perceived "latent" thought crimes.
First of all, I'm not advocating for this claim, I'm merely trying to clarify what other people say.
That being said, I don't think that your analogy is valid in this case.
> GitHub and Windows and IDEs need to be open source because they can output FOSS code
They can output FOSS code, but they themselves are not derived from FOSS code.
It can be argued that the weights of a model is derived from training data, because they contain something from the training data (hard to say what exactly: knowledge, ideas, patterns?)
It can also be argued that output is derived from weights.
If we accept both of those claims, then GPL training data -> GPL weighs -> every output is GPL
> If an AI outputs copyrighted code
Again, the issue is not what exactly does AI output, but where it comes from.
If that theory holds - have to ensure that the models have not been trained on any code that is licensed incompatibly with the GPL, in which case the models could not be distributed at all
Intellectual property never made much sense to begin with. But it certainly makes no sense now, where the common creator has no protections against greedy corporate giants who are happy to wield the full weight of the courts to stifle any competition for longer than we'll be alive.
Or, in the case of LLMs, recklessly swing about software they don't understand while praying to find a business model.
hey just don't try to copy their LLM by distilling it, cause that's "theft", if we weren't all doomed anyways this industry would have never been allowed to exist in the first place, but I guess this is just what the last few decades of our civilization will look like.
Poor billionaire Rowling has no protections against the evil corporations. Everyone using this argument has no clue about artists and and writers.
Yes, corporations take a large cut, but creative people welcomed copyright and made the bargain and got fame in the process. Which was always better for them than let Twitch take 70% and be a sharecropper.
Silicon Valley middlemen are far worse than the media and music industry.
The individuals who get rich from copyright are a rarity.
Most mid-list authors make very little from copyright. A lot of the "authors" who make a lot of money from writing are celebs who slap their name on a ghost written work.
> Which was always better for them than let Twitch take 70% and be a sharecropper.
Copyright predates Twitch or giant corporations and was designed to protect the profits of the publishers from the start.
The reason you mention 'poor billionaire Rowling' is most likely because she's the only billionaire author that you know by name. If authors regularly became billionaires you'd have left out that name.
Sure, but that's more a result of policy decisions than an inevitable result of some natural law. Corporate lawlessness has been reined in before and it can be again
If there was going to be a case, it's derivative works. [1]
What makes it all tricky for the courts is there's not a good way to really identify what part the generated code is a derivative of (except in maybe some extreme examples).
One could carefully calculate exactly how much a given document in the training set has influenced the LLM's weights involved in a particular response.
However, that number would typically be very very very very small, making it hard to argue that the whole model is a derivative of that one individual document.
Nevertheless, a similar approach might work if you took a FOSS project as a whole, e.g. "the model knows a lot about the Linux kernel because it has been trained on its source code".
However, it is still not clear that this would be necessarily unlawful or make the LLM output a derivative work in all cases.
It seems to me that LLMs are trained on large FOSS projects as a way to teach them generalisable development skills, with the side effect of learning a lot about those particular projects.
So if I used a LLM to contribute to the kernel, clearly it would be drawing on information acquired during its training on the kernel's code source. Perhaps it could be argued that the output in that case would be a derivative?
But if I used a LLM to write a completely unrelated piece of software, the kernel training set would be contributing a lot less to the output.
> One could carefully calculate exactly how much a given document in the training set has influenced the LLM's weights involved in a particular response.
Not really.
Think of, for example, a movie like "who framed roger rabbit". It had intellectual property from all over. Had the studios not gotten the rights from each or any of those properties, they could have been sued for copyright infringement. It's not really a question of influence.
So yeah, while the LLM might have been trained on the kernel, it was also likely trained on code with commercial licenses. Conversely, because was trained on code with GPL licenses, that might mean commercial software with LLM contributions need to inherit the GPL to be legal (and a bunch of other licenses).
It's a big old quagmire and I think lawyers haven't caught up enough with how LLMs work to realize this.
That's always what laws existed for, a law is just a formal way of saying "we will use violence against you if you do something we don't like" and that has always going to be primary written by and for the people that already have the power to do that, it's not the worst, certainly better than Kings just being able to do as they please.
> certainly better than Kings just being able to do as they please
That's debatable. In case of a king you always know whom to blame and who has full responsibility. No opportunity to hide behind "well, you voted for this" or "I'm not making the laws, I'm merely enforcing them".
If you use GitHub, you’re automatically opted into having your code used for training. Private repo or not. You have to actually opt out and even then, will they honor that? No…
The other day I was working with some shaders GLSL signed distance field functions. I asked Claude to review the code and it immediately offered to replace some functions with "known solutions". Turns out those functions were basically a verbatim copy of Inigo Quilez's work.
His work is available with a permissible license on the Internet but somehow it doesn't seem right that a tool will just regurgitate someone else's work without any mention of copyright or license or original authorship.
Pre-LLM world one would at least have had to search for this information, find the site, understand the license and acknowledge who the author is. Post LLM the tool will just blatantly plagiarize someone else work which you can then sign off on as your own. Disgusting.
> Turns out those functions were basically a verbatim copy of Inigo Quilez's work.
Are they? A lot of these were used by people >20 years before Inigo wrote his blog posts. I wrote RenderMan shaders for VFX in the 90's professionally; you think about the problem, you "discover" (?) the math.
So they were known because they were known (a lot of them are also trivial).
Inio's main credit is for cataloging them, especially the 3D ones, and making this knowledge available in one place, excellently presented.
And of course, Shadertoy and the community and giving this knowledge a stage to play out in that way. I would say no one deserves more credit for getting people hooked on shader writing and proceduralism in rendering than this man.
But I would not feel bad about the math being regurgiated by an LLM.
There were very few people writing shaders (mostly for VFX, in RenderMan SL) in the 90's and after.
So apart from the "Texturing and Modeling -- A Procedural Approach" book, the "The RenderMan Companion" and "Advanced RenderMan", there was no literature. The GPU Gems series closed some gaps in later years.
The RenderMan Repository website was what had shader source and all pattern stuff was implict (what we call 2D SDFs today) beause of the REYES architecture of the renderers.
But knowledge about using SDFs in shaders mostly lived in people's heads. Whoever would write about it online would thus get quoted by an LLM.
Yeah, I find this super rude - in this example, the author distributed the code under a very permissive license, basically just wanting you to cite him as an author.
BAM, the LLM just strips all that out, basically pretending it just conjured an elegant solution from the thin air.
No wonder some people started calling the current generation of "AI" plagiarism machines - it really seems more fitting by the day.
LLMs have already told you these are "known solutions", which implicitly means they are established, non-original approaches. So the key point is really on the user side—if you simply ask one more question, like where these "known solutions" come from, the LLM will likely tell you that these formulas are attributed to Inigo Quilez.
So in my view, if you treat an LLM as a tool for retrieving knowledge or solutions, there isn't really a problem here. And honestly, the line between "knowledge" and "creation" can be quite blurry. For example, when you use Newton's Second Law (F = ma), you don't explicitly state that it comes from Isaac Newton every time—but that doesn't mean you're not respecting his contribution.
> Pre-LLM world one would at least have had to search for this information, find the site, understand the license and acknowledge who the author is. Post LLM the tool will just blatantly plagiarize someone else work which you can then sign off on as your own
These don't contradict each other though, you could "blatantly plagiarize someone else work" before as well. LLMs just add another layer in between.
Copyright violation would happen before LLMs yes, but it would have to be done by a person who either didn’t understand copyright (which is not a valid defence in court), or intentionally chose to ignore it.
With LLMs, future generations are growing up with being handed code that may or not be a verbatim copy of something that someone else originally wrote with specific licensing terms, but with no mention of any license terms or origin being provided by the LLM.
It remains to be seen if there will be any lawsuits in the future specifically about source code that is substantially copied from someone else indirectly via LLM use. In any case I doubt that even if such lawsuits happen they will help small developers writing open source. It would probably be one of the big tech companies suing other companies or persons and any money resulting from such a lawsuit would go to the big tech company suing.
> I was made redundant recently "due to AI" (questionable) and it feels like my works in some way contributed to my redundancy where my works contributed to the profits made by these AI megacorps while I am left a victim.
I think anyone here can understand and even share that feeling. And I agree with your "questionable" - its just the lame HR excuse du jour.
My 2c:
- AI megacorps aren't the only ones gaining, we all are. the leverage you have to build and ship today is higher than it was five years ago.
- It feels like megacorps own the keys right now, but that’s a temporary. In a world of autonomous agents and open-weight models, control is decentralized.inference costs continue to drop, you dont need to be running on megacorp stacks. Millions (billions?) of agents finding and sharing among themselves. How will megacorps stop?
- I see the advent of LLMs like the spread of literacy. Scribes once held a monopoly on the written word, which felt like a "loss" to them when reading/writing became universal. But today, language belongs to everyone. We aren't losing code; we are making the ability to code a universal human "literacy."
> AI megacorps aren't the only ones gaining, we all are.
No, no we are not.
> the leverage you have to build and ship today is higher than it was five years ago.
I don’t want more “leverage to build and ship”, I want to live in a world where people aren’t so disconnected from reality and so lonely they have romantic relationships with a chat window; where they don’t turn off their brains and accept any wrong information because it comes from a machine; where propaganda, mass manipulation, and surveillance aren’t at the ready hands of any two-bit despot; where people aren’t so myopic that they only look at their own belly button and use case for a tool that they are incapable of recognising all the societal harms around them.
> We aren't losing code; we are making the ability to code a universal human "literacy."
No, no we are not. What we are, however, is making ever increasingly bad comparisons.
Literacy implies understanding. To be able to read and write, you need to be able to understand how to do both. LLMs just spit text which you don’t need to understand at all, and increasingly people are not even caring to try to understand it. LLM generated code in the hands of someone who doesn’t read it is the opposite of literacy.
>I don’t want more “leverage to build and ship”, I want to live in a world where people aren’t so disconnected from reality and so lonely they have romantic relationships with a chat window; where they don’t turn off their brains and accept any wrong information because it comes from a machine; where propaganda, mass manipulation, and surveillance aren’t at the ready hands of any two-bit despot; where people aren’t so myopic that they only look at their own belly button and use case for a tool that they are incapable of recognising all the societal harms around them.
Preach. Every time I read people doing this weird LARP on this website of "you have so much more leverage, great time to be a founder" I want to put my head through the drywall.
Agree. Do we not understand how LLMs work? Some of us understand better than others, just like literacy is also not guaranteed just because you learned the alphabet.
Accepting the output of an LLM is really materially not different from accepting books, newspapers, opinion makers, academics at face value. Maybe different only in speed of access?
> LLM generated code in the hands of someone who doesn’t read it is the opposite of literacy.
"A popsi article title or paper abstract/conclusion in the mind of someone who doesn't read is the opposite of literacy."
I’m not sure I understand your point. Mind clarifying? It seems you might be trying to contradict what I said but are in fact only adding to it.
> just like literacy is also not guaranteed just because you learned the alphabet.
I didn’t claim learning the alphabet equals literacy, you did. Your argument comes down to “you’re not literate if you’re not literate”. Which, yes, of course.
> Accepting the output of an LLM is really materially not different from (…)
Multiple things can be true at once. If someone says “angry stupid people with machine guns are dangerous”, responding “angry stupid people with explosives are dangerous” does nothing to the original point. The angry stupid people are part of the problem, sure, but so are the tool which are enabling them to be dangerous. If poison is being dumped in a river and slowly killing the ecosystem, then someone else comes along wanting to dump even more of a different poison, the correct response is to stop both, not shrug it off and stop none.
What the bloody heck are you on about? That first quote is completely fabricated. I’d also like to live in a world where people don’t argue in bad faith, but since I have no pretence that will happen, at least I’m thankful when bad faith actors do such a poor job of concealing it.
But LLMs can also explain code, in fact they're fantastic at that. They can also be used to build anti-censorship, surveillance-avoidance and fact-checking tools. We are all empowered by them, it's just up to us to employ them so as to nudge society towards where we'd like it to go. Instead of giving up prematurely.
I’m not sure if the analogy is yours, but the scribe note really struck a chord with me.
I’m not a professionally trained SWE (I’m a scientist who does engineering work). LLMs have really accelerated my ability to build, ideate, and understand systems in a way that I could only loosely gain from sometimes grumpy but mostly kind senior engineers in overcrowded chat rooms.
The legality of all of this is dubious, though, per the parent. I GPL licensed my FOSS scientific software because I wanted it to help advance biomedical research. Not because I wanted it to help a big corp get rich.
But then again, maybe code like mine is what is holding these models back lol.
Sharing for advancing humanity / benefit of society, and megacorps getting rich off it, is not either-or. On the contrary, megacorps are in part how the benefit to society materializes. After all, it's megacorps that make and distribute the equipment and the software stacks I am using to write code on, that you are using to do your research on, etc.
I find the whole line of thinking, "I won't share my stuff because then a megacorp may use it without paying me the fractional picobuck I'm entitled to", to be a strong case of Dog in the Manger mindset. And I meant that even before LLM exploded, back when people were wringing their hands about Elasticsearch being used by Amazon, back in 2021 or so.
Sharing is sharing. One can't say "oh I'm sharing this for anyone to benefit", and then upon seeing someone using it to make money, say "oh but not like that!!". Or rather, one can say, but then they're just lying about having shared the thing. "OSS but not for megacorps/aicorps" is just proprietary software. Which is perfectly fine thing to work on; what's not fine is lying about it being open.
> "OSS but not for megacorps/aicorps" is just proprietary software
why? it's not like it's binary. It could well be that it's open source but can't be used by a company of X size. I'm not a lawyer but why couldn't a license have that clause? I would still class that as being open, for some definition of open
LLMs are one thing, but when you bring ES in AWS example, as outlined in the article, the problem is not the software being used; it's being _made proprietary_. It's about free and open software remaining free and open. Especially to the end user.
Basically, the selling point of LLMs is that you no longer need to think about problems, you can skip directly to results. Anything that you have to think about while using them today is somewhere on the product roadmap, or will be.
> It feels like megacorps own the keys right now, but that’s a temporary.
Remains to be seen. Hardware prices are increasing. Manufacturers are abandoning the consumer sector to serve the all consuming AI demands. Not to mention the constant attempts to lock down the computers so that we don't own them.
What does the future hold for us? Unknown. It's not looking too good though. What good is hardware if we're priced out? What good are open models and free software if we're unable to run them?
The trend I see if older hardward beeing able to run models that are increasing miniturized.
The real (but not new) danger is us giving up to the idea that we cant do it ourselves or that we must use megacorp latest shiny toy for us to "succeed"
welcome to late capital, please enjoy the ride while people are trying to tell you that LLMs are the only future (you have no future) while SOTA models can barely do shit on their own consistently outside of carefully designed benchmarks, and have to be made available at a loss otherwise no-one would use them.
On your right you can see the CEOs justifying longer hours and lower pay because AI will replace your job one day anyways, and then asking you why you aren't 10x more productive with Claude. On the left you can see the AI companies deciding who will be in charge of the fascist regime once they no-longer need workers other than for the coal mines. They reckon they can get 120 good years before they biosphere is uninhabitable, which they are worried about because what if the next LLM figures out immortally for them, maybe they will have to close the coal mines too after all.
Can't say I disagree with you. I do recognize that we seem to be heading towards a technofeudalist cyberpunk dystopia. The only way out for humanity is to automate everything to the point we transcend capitalism into a post-scarcity society where the very concept of an economy has been abolished. If we can't do that, we'll become soylent.
>But today, language belongs to everyone. We aren't losing code; we are making the ability to code a universal human "literacy."
Literacy require training though. It’s not the same to be able to make voice rendition of a text, understand what the text is about, have a critical analysis toolbox of texts, and having the habit to lookup for situated within a broader inferred context.
Just throwing LLMs into people hands won’t automatically make them able to use it in relevant manner as far as global social benefits can be considered.
The literacy issue is actually quite independent of the fact that LLMs used are distributed or centralised.
> We aren't losing code; we are making the ability to code a universal human "literacy."
LLMs making the ability to code a universal human “literacy” is like saying that Markov chain is making the ability to write a universal human “literacy”.
Cheap books too hundreds of years to be accessible. Already we have models that run on "legacy" hardware. Just like large scale publishing never disappeared large scale models and infra also wont. But does it mean that simple paper and pen was pointless to be distributed?
The foreman had pointed out his best man - what was his name? - and, joking with the puzzled machinist, the three bright young men had hooked up the recording apparatus to the lathe controls. Hertz! That had been the machinist's name - Rudy Hertz, an old-timer, who had been about ready to retire. Paul remembered the name now, and remembered the deference the old man had shown the bright young men.
Afterward, they'd got Rudy's foreman to let him off, and, in a boisterous, whimsical spirit of industrial democracy, they'd taken him across the street for a beer. Rudy hadn't understood quite what the recording instruments were all about, but what he had understood, he'd liked: that he, out of thousands of machinists, had been chosen to have his motions immortalized on tape.
And here, now, this little loop in the box before Paul, here was Rudy as Rudy had been to his machine that afternoon - Rudy, the turner-on of power, the setter of speeds, the controller of the cutting tool. This was the essence of Rudy as far as his machine was concerned, as far as the economy was concerned, as far as the war effort had been concerned. The tape was the essence distilled from the small, polite man with the big hands and black fingernails; from the man who thought the world could be saved if everyone read a verse from the Bible every night; from the man who adored a collie for want of children; from the man who . . . What else had Rudy said that afternoon? Paul supposed the old man was dead now - or in his second childhood in Homestead.
Now, by switching in lathes on a master panel and feeding them signals from the tape, Paul could make the essence of Rudy Hertz produce one, ten, a hundred, or a thousand of the shafts.
You can't avoid big corps training on your data if it's available, because "fair use".
But I hope this same 'fair use' will allow distilling of their private models into open weight models, so users are never locked in into any particular vendor. Giving back power to the user.
It shouldn't be needed. I would argue that is more than "against the spirit" and should not be considered fair use. Instead of creating a derivative work, they created a machine that creates derivative works.
And even if it would be enforceable, would you be able and willing to go through the energy and monetary expenses to enforce it? Especially against a big corporation willing to fight you.
These companies pirated their training material and reached settlements with the copyright holders. I imagine they’d do the same with software licenced under Not For Training terms too. It’d be up to you to find out it is happening and then pursue them legally for compensation.
> it certainly feels against the spirit of what I intended when distributing my works
You can own the works, but not the vibes. If everyone owned the vibes we would all be infringing others. In my view abstractions should not be protected by copyright, only expression, currently the abstraction-filtration-comparison standard (AFC) protects abstractions too, non-literal infringement is a thing.
Trying to own the vibes is like trying to own the functionality itself, no matter the distinct implementation details, and this is closer to patents than copyrights. But patents get researched for prior art and have limited duration, copyright is automatic and almost infinite duration.
Reading this I hear The Roots playing The Seed 2.0[1] in my mind.
It’s a wild thought to think that of all the things that will remain on this earth after you’re gone, it’ll be your GPL contributions reconstituting themselves as an LLM’s hallucinations.
If we're being clear, it's going to be a lot more than that.
Our comments here on HN are almost certainly going to live in fame/infamy forever. The twitter firehose is a pathway to 140-character immortality essentially.
You can already summon an agent to ingest essentially an entire commenter's history, correlate it across different sites based on writing style or similar nicknames, and then chat with you as that persona, even more so with a finetune or lora. I can do that with my gmail and text message history and it becomes eerily similar to me.
History is going to be much more direct and personal in the future. We can also do this with historical figures with voluminous personal correspondence, that's possible now.
It's very interesting because I think the era before mass LLM usage but also after digitalization is going to be the most intensely studied. We've lived through a thing that is going to be on the cusp of history, for better or worse.
tokens will stop being given away for free at some point, writing software was always a pretty simple white collar job, so it makes sense it's one of the earlier ones to be automated, but really the axis of evil has it's shot now at ruling the world or whatever now, but if they miss it they will eventually be subject to the market and you really will need to automate a lot more than just software developers for models this large to be worth the cost.
Of course we should really be talking about using the state or otherwise to make training larger and larger models impossible. It's not in the public good if LLMs actually get good enough to replace a lot of human labor, only a small handful of billionaires and their cronies will ever benefit from that. The Luddites were not wrong after all.
Taken to a hallucinated but logical conclusion, we might define a word such as "cene" to riff off of "meme" and "gene".
The c is for code. If adopted we could spend forever arguing how the c is pronounced and whether the original had a cedilla, circonflex or rhymes with bollocks, which seems somehow appropriate. Everyone uses xene instead. x is chi but most people don't notice.
Me too, and I use LLMs often for personal and professional work. Knowing that colleagues are burning through $700/day worth of tokens, and a small fraction of those tokens were likely derived from my work while I get made redundant is a bit shite.
Yeah that's the thing making my head spin, tack a 30% profit margin on that and it's 550usd per day?
Probably going to be more than that for rocketship growth and investor expectations.
Is that the game? Lock in companies to this "new reality" with cheap tokens then once they fire all their devs, bait and switch to 2X the cost.
If you read history widely (across millennia and geographies), you'll note that most of the power-contests follow this pattern[0]. In the modern industrial world, the pattern becomes exponential rather than incremental. What I'm saying is that this is not unique to AI Labs[1]. This is caused by the deeply flawed and unbalanced system that we have constructed for ourselves.
[0]: The pattern, or, as gamers would call it, the "meta", is that every ambitious person/entity wants to control as much of the economic/material surplus as possible. The most effective and efficient (effort per control) way of doing this is to make yourself into as much of a bottle-neck as humanly possible. In graph-theory this corresponds to betweenness-centrality, and you want to maximize that value. To put it in mundane terms, you want to be as much of a monopoly as you can be (Thiel is infamous for saying this, but it does check out, historically). To maximize betweenness, or to maximize monopoly, is to maximize how much society/economy depends on you. This is such a dominant strategy (game-theory term, but in modern gaming world, they might call this a "cheesy strat" -- which just means that the game lacks strategic variety, forcing players to hone that one strategy), that we even have some old laws (anti-trust, etc) designed to prevent it. And it makes a lot of sense: Standard Oil was reviled because everything in the economy either required oil or required something that did. 20th-century USA did a lot to mitigate this. It forced monopolies like ATT to fund general research like Bell Labs (still legendary) towards a public good (a kind of tax, but probably much more socially-beneficial). It also broke up the monopolies, and passed anti-profit laws (e.g. hospitals were not allowed to make a profit until 1978; I have seen in the last 10 years a tiny cancer clinic grow into a massive gleaming hospital -- a machine that transforms sickness and grief into Scrooge McDuck vaults of cash). This monopolistic tendency of the commercial sector, is a tendency towards centralization, which yields efficiency, sure, but also creates the conditions for control and rent-seeking and exploitation.
[1]: Much of the cloud-computing craze was similar in character (and also failed to deliver on some of its promises, such as reducing/replacing IT overhead (they just renamed IT to DevOps)). And Web2 itself was about creating and monopolizing a new kind of ad-channel and lead-generation-machine. There is a funny twist, that a capitalist society like the USA, has much more deeply rooted incentives to create a panopticon than communist states of the past ever did. Neither is pretty of course. The communists demanded conformity and loyalty, while the capitalists demand consumption and rent.
My personal take is that LLMs are so transformative that they are likely not going to qualify under derivative works and therefore GPL wouldn't hold sway. There's already some evidence that courts will consider training on copyrighted material fair use, so long as it is otherwise obtained legally, which would be the case with software licensed under GPL.
I realize this is an unpopular opinion on HN, but I believe it is best because it's a weakener interpretation of copyright law, which is overall a good thing in my view.
You can train models locally now and use open source ones and there's a robust community of people training, retraining, and generally pulling data from anywhere. And then new models get trained on old models. The models in use now are already several generations deep even further trained on code freely given by the entire industry. It's like complaining about being 1/100000th of a soup with no real proof you're even in it. Can you provide proof that a model used your code? It's like a remix of a remix of a remix.
> It's like complaining about being 1/100000th of a soup with no real proof you're even in it.
I love a good analogy, especially one that takes a complex situation in which esoteric, unusual conditions are distilled and related back to common experiences held by the reader, such that all can understand.
Next time I'm a small part of a soup I'll think of this.
The fact that github copilot had an option to block generated code that matched public examples and the fact that the llms can regenerate Harry Potter books verbatim means the training data is definitely "stored in a digital system of retrieval" but Goodluck actually having common sense win vs trillionaire incentive group stealing from everyone
> I was made redundant recently "due to AI" (questionable) and it feels like my works in some way contributed to my redundancy where my works contributed to the profits made by these AI megacorps while I am left a victim.
This is increasingly common, and I don’t think it’s questionable that LLMs that software engineers help train are contributing to the obsolescence of software engineers. Large companies that operate these LLMs both 1) benefit from the huge amount of open-source software and at the same time 2) erode the very foundation that made open-source software explode in popularity (which happened thanks to copyright—or, more precisely, the ability to use copyright to enforce copyleft and thus protect the future of volunteer work made by individual contributors).
GPL was written long before this technology started to be used this way. There’s little doubt that the spirit of GPL is violated at scale by commercial LLM operators, and considering the amount of money that got sunk into this it’s very unlikely they would ever yield to the public the models, the ability to mass-scrape the entire Internet to train equivalent models, the capability to run these models to obtain comparable results, etc. The claim of “democratising knowledge” is disingenuous if you look deeper into it—somehow, they themselves will always be exempt from that democratisation and free to profit from our work, whereas our work is what gets “democratised”. Somehow, this strikes me personally more as expropriation than democratisation.
I wish Anthropic or someone would take a leadership role and re-train their models without any GPL code, or at least stop doing so in the future tense.
> I've been looking for a copy-left "source available" license that allows me to distribute code openly but has a clause that says "if you would like to use these sources to train an LLM, please contact me and we'll work something out". I haven't yet found that.
Personally, I want a viral (GPL-style) license that explicitly prohibits use of code for LLM training/tuning purposes — with the asterisk that while current law might view LLM training as fair use, this may not be the case forever, and blatant disregard of the terms of the license should make it easier for me to sue offenders in the future.
Alternatively, this could be expressed as: the output of any LLM trained on this code must retain this license.
> I've been looking for a copy-left "source available" license that allows me to distribute code openly but has a clause that says "if you would like to use these sources to train an LLM, please contact me and we'll work something out". I haven't yet found that
Frankly do you think AI companies have even the remotest amount of respect for these licenses anyways? They will simply take your code if it is publicly scrapeable, train their models, exactly like they have so far. Then it will be up to you to chase them down and try to sue or whatever. And good luck proving the license violation
I dunno. I just don't really believe that many tech companies these days are behaving even remotely ethically. I don't have much hope that will change anytime soon
Traditionally, large corporations have taken very conservative legal stances with regard to integrating e.g. A/GPL code, even when there's almost no risk.
If my license explicitly says "any LLM output trained on this code is legally tainted," I feel like BigAICorp would be foolish to ignore it. Maybe I couldn't sue them today, but are they confident this will remain the case 5, 10, 20 years from now? Everywhere in the world?
Github has posted that they will now train on everyone's data (even private) unless you opt out (until they change their mind on that). Anthropic has been training on your data on certain tiers already. Meta bittorrented books to train their models.
Surely if your license says "LLM output trained on this code is legally tainted", it is going to dissuade them.
The thing that leaves a bad taste in my mouth is the fact that my works were likely included in the training data and, if it doesn't violate my licenses (GNU 2/3), it certainly feels against the spirit of what I intended when distributing my works.
I was made redundant recently "due to AI" (questionable) and it feels like my works in some way contributed to my redundancy where my works contributed to the profits made by these AI megacorps while I am left a victim.
I wish I could be provided a dividend or royalty, however small, for my contribution to these LLMs but that will never happen.
I've been looking for a copy-left "source available" license that allows me to distribute code openly but has a clause that says "if you would like to use these sources to train an LLM, please contact me and we'll work something out". I haven't yet found that.
I'm guessing that such a license would not be enforceable because I am not in the US, but at least it would be nice to declare my intent and who knows what the future looks like.