Anti-piracy group BREIN forced the developer of GEITje-7B, a Dutch large language model (LLM), to take the model offline due to being trained on datasets that included copyrighted material sourced from shadow libraries.
Developed by Edwin Rijgersberg as a non-commercial hobby project, GEITje-7B was trained using the ‘Gigacorpus’ dataset. This dataset included a breadth of Dutch texts, some of which originated from the shadow library LibGen—a repository criticized for hosting copyrighted material.
Despite defending the project under copyright exemptions for text and data mining for scientific purposes, Rijgersberg lacked the financial resources to contest BREIN's claims in court.
The rapid development of large language models (LLMs) has sparked significant debate regarding their use of copyrighted material for training. This controversy has led to legal challenges, takedown requests, and increasing scrutiny on AI practices by anti-piracy organizations and rightsholders worldwide.
The debate over training LLMs on copyrighted content has become heated. Datasets such as Books3, developed in 2020 using materials from the pirate library Bibliotik, have drawn widespread criticism. Books3 was integrated into other AI training datasets, like EleutherAI’s “The Pile.”
Although Books3 was subsequently removed from platforms under pressure from groups like the Danish Rights Alliance, the issue extends beyond datasets to also include models derived from them.
BREIN has taken a firm stance against AI models trained on copyrighted content. According to the group, the use of such material disregards the efforts and investments of creators and media companies.
The organization argues that the European Union AI Act mandates the use of lawfully acquired content for AI training, a guideline they assert was violated in GEITje’s development.