AI and the Commons: Data Governance for Generative AI

As AI researchers chase performance gains by training ever-larger models on ever-more data, the question is: where does the data come from?  While pretraining data is compiled from many sources, data from the internet is a major component. Much of this data is either open-source or under a Creative Commons license, as the value of the internet as a public good (with positive externalities) depends on users' freely-available and fair-use contributions.  Since data is a high-value input to AI; it is thus a high-value lever for AI governance.

In light of this, in October of 2023 we collaborated with the Creative Commons Foundation to run an Alignment Assembly to ask the Creative Commons community a simple question: how should Creative Commons respond to the use of CC-licensed work in AI training?

During the CC Global Summit, under the theme of AI and the Commons, the Creative Commons team used the Alignment Assembly process to solicit and aggregate responses on several position statements, to see where the CC community aligned or diverged on the issue of AI and CC-licensed work, and a list of possible interventions for the foundation to pursue. 

The result of that process, which can be viewed on the Creative Commons website, yielded specific insights into the strategic direction of Creative Commons.

First, the responses of participants fell into three main camps:

  • The "Moat Protectors" want CC to focus on its core mission while advocating for noncommercial, public-interest AI.

  • "Oversight Maximalists" believe CC should take an active watchdog role overseeing how CC works are used in AI.

  • "Equitable Benefit Seekers" want required authorization for using traditional knowledge, and a way to share benefits when CC works contribute to proprietary AI models.

Beyond these overarching viewpoints, specific concerns emerged around attribution, compensation/benefit sharing, new licenses for datasets and models, ethical guidelines, and protecting traditional cultural works. For example, some worry that AI ingesting CC works could distort their intended meaning, or override community norms around sacred knowledge.

From this process, the Creative Commons developed a set of 7 regulatory principles “to protect the interests of creators, people building on the commons (including through AI), and society’s interests in the sustainability of the commons:

  1. It is important that people continue to have the ability to study and analyze existing works in order to create new ones. The law should continue to leave room for people to do so, including through the use of machines, while addressing societal concerns arising from the emergence of generative AI.

  2. All parties should work together to define ways for creators and rightsholders to express their preferences regarding AI training for their copyrighted works. In the context of an enforceable right, the ability to opt-out from such uses must be considered the legislative ceiling, as opt-in and consent-based approaches would lock away large swaths of the commons due to the excessive length and scope of copyright protection, as well as the fact that most works are not actively managed in any way.

  3. In addition, all parties must also work together to address implications for other rights and interests (e.g. data protection, use of a person’s likeness or identity). This would likely involve interventions through frameworks other than copyright.

  4. Special attention must be paid to the use of traditional knowledge materials for training AI systems including ways for community stewards to provide or revoke authorization.

  5. Any legal regime must ensure that the use of copyright protected works for training generative AI systems for noncommercial public interest purposes, including scientific research and education, are allowed.

  6. Ensure that generative AI results in broadly shared economic prosperity – the benefits derived by developers of AI models from access to the commons and copyrighted works should be broadly shared among all contributors to the commons.

  7. To counterbalance the current concentration of resources in the hands of a small number of companies these measures need to be flanked by public investment into public computational infrastructures that serve the needs of public interest users of this technology on a global scale. In addition there also needs to be public investment into training data sets that respect the principles outlined above and are stewarded as commons.”

In our own recent paper Generative AI and the Digital Commons, we note that generative models could degrade the commons they rely upon without mechanisms to sustain and replenish public resources. Current data rights and copyright laws don't fully address these challenges.

We propose exploring transparency requirements, funding monitoring entities, incentivizing corporate data contributions, and novel models for shared ownership based on community data contributions. As one example, those who enable generative models by providing niche data for fine-tuning could co-own commercial outputs.

These are complex but vitally important issues. How institutions respond to questions of data provenance, ownership, fair use, and attribution will be important not just for the trajectory of AI development and the health of our digital commons, but for the integrity of our democratic systems.

Previous
Previous

Building the Field of Democratic AI: Our Roadmap Launch and Webinar

Next
Next

AI For Institutions: Website Launch & Update