Co-owning the Future with Data Cooperatives @ Stanford HAI
Divya Siddarth
This is a lightly edited transcript of a talk given at Stanford HAI’s Fall Conference: Four Radical Policy Proposals to Shape AI’s Impact on Society. You can watch the talk here.
Let’s start with a phrase I’m sure we’ve all heard: Data is the new oil.
This phrase was actually coined by Clive Humby in 2006, but I think we can safely blame its canonization on the Economist, with its 2017 cover article stating that data had replaced oil as the world’s most valuable resource. Since then, ‘data is the new oil’ has shown up on hundreds of think tank reports, corporate strategy memos, policy documents, and even legislation.
For almost as long as ‘data is the new oil’ has been around, scholars have criticized it for both its inadequacy and its inaccuracy. And we’ll get there. But first, I think it’s important to look at what ‘data is the new oil’ gets right - perhaps more so now than in 2006.
First, data is valuable. There’s no getting around it – and increasingly so, as an input into large-scale AI systems and economy-wide processes of technology investment and innovation. Better input data (whether by sheer volume, or, more often, by quality) has led to incredible technological and scientific breakthroughs.
Second, data takes a lot of infrastructure for capture, cleaning, and transfer – infrastructure that needs to be built, funded, and regulated across the public and private sectors, often with big, large-scale projects. In fact, much like Standard Oil had a monopoly on oil refinement, we have a monopolized and captured landscape for data refinement and transfer as well, one that is ripe for scrutiny in much the same way Standard Oil was.
Third, disputes over data ownership, use, and sovereignty are increasingly matters of national and international concern, building tensions over technological interdependence and driving state and regional agendas. As AI takes off, access to data is looked at as a question of national security and national technology agendas as much as it is a domestic economic input. In fact, ‘data is the new oil’ may be most usefully viewed as a warning to the global community, highlighting the urgent need to build infrastructure and policy to ensure that data resources are productively used, equitably distributed, and legitimately governed, forestalling predictable inequality and conflict.
Okay, so there’s some truth to ‘data is the new oil’.
But this framing does miss a lot of things. And rather than miss, misconstrue, in a way that has real effects on how we think about and regulate data.
First, data is not a fungible commodity. Units of ‘data’ are not equivalent to each other, as units of oil are. In fact, data is almost too expansive a term to even be useful. The word encompasses everything from huge datasets that capture the web browsing data of millions to hospital records that may contain information on some of the most difficult events of people’s lives. We can’t look at both of these things the same way, and we shouldn’t. And we can’t use, or protect, them the same way either. Data is highly context-specific – non-fungible – and it needs to be utilized, protected, and governed as such.
Right now, the law focuses on two kinds of data – personal information, like our social security numbers, and intellectual property, copyrights and patents. But the ‘data is the new oil’ era did not come about due to one of these – it's talking about the data that we generate, and is now captured, as we move through the world – data like where we go, what we buy, who we talk to, what we write, who we swipe right on, and even what temperature we set our houses at. Data like this isn’t just used to sell us ads any more, either, but also used to train algorithms that may aim to emulate our intellectual capacities.
This brings us to point two: data is subject to network effects that natural resources are not. In terms of productive use, data is only really meaningful and valuable in the aggregate, incentivizing the monopolized data collection that we’re seeing. One person’s facebook data is valued at 7 cents, while all facebook users’ data is valued at tens of billions of dollars, with researchers estimating that Facebook earned roughly $110 from each American user. And going back to the things that data is used for – these applications come from having access to vast quantities of data – and ideally also good quality data, but the current ecosystem often trades quality for quantity, as we’ll see.
Third, data is fundamentally collective. Going back to something like Facebook data, If I make a post and a friend likes that post, can we say the post is mine, but the ‘like’, separate from the post, is hers? If I send an email, and you respond, who could own which piece of that interaction – and who owns the metadata? The kinds of data we produce now are very difficult to assign to individuals. And, given the network effects, ‘my’ data is mostly valuable because it can be used to predict the behavior of people I associate with. Even things like genomic data, which seem so personal, are collective. If I and my parents go ahead and get our genes sequenced for 23&me, and I have a brother who is incredibly privacy-focused, who would never agree to that – well, too bad for him, because 23&me now has a huge amount of his information too.
Fourt, and maybe most importantly – data isn’t just a latent resource, lying in the ground for capture. It is actively created through interactions, sociality, and labor. Data is the product of collective work, not some millenia-old natural process. And this is particularly relevant when we think about the data that trains AI systems. Large-scale models like GPT-3 aren’t autonomous AI achievements – entirely the opposite. These projects only work because they’re trained on hundreds of billions of words – words written by humans, capturing centuries of human knowledge, thought, and insight, everything from books to Wikipedia.
Our existing models of data governance aren’t going to work
So what do we do with these insights?
Well, we can’t treat data as individual private property, because the nature of data is highly networked and interdependent. And also because we already see the empirical failure of devolving responsibility to the individual in terms of preserving privacy and security, which has basically become a ‘initial here please’ model.
And while privacy itself is crucial, we can’t assume that all interests in data are privacy interests alone. I may have very legitimate financial or control interests in my data that extend beyond my privacy concerns, communities that produce data may also have these concerns, beyond the individual, this is especially true if that data is being used to train very powerful AI systems that may then be used on that very community. While regulation is crucial, it is also clear that regulatory schemes alone are insufficient to meaningfully protect privacy, when the incentives in the space tend so much against it, and the burden of being privacy-conscious is still largely on the individual.
And we, on the opposite side, certainly shouldn’t rush into building data marketplaces, hoping that they’ll enable competition and increase efficiency in the productive use of data. The balance of power is so skewed, with individual and data holders holding so little of it, that this would certainly result in even more entrenched inequality, particularly for the most vulnerable. It would be like the state of working conditions before both worker protections and strong collective bargaining laws, which is not something we want to return to.
Explaining the data cooperative approach
This brings us to the data cooperative approach. And the fundamental thing that sets data cooperatives apart is that this is a collective approach to the stewardship of data. It’s an approach that understands that, first the data we produce isn’t exhaust – it’s valuable, and the product of work. And second, most of this data is collective, and it’s most productive and most accountable when it’s treated that way. Data cannot be owned – but it absolutely must be governed.
The way that data cooperatives do this is by forming a new technical and institutional layer that would exist between those that have data and those that use it. These new institutions would do all the things we’re missing right now - mediating data flows, governing data use, re-use, storage, and transfer, preserving privacy. And they would curate and enable the high-quality datasets we increasingly need, enabling competition and unlocking innovation.
Let’s take a moment on this handy diagram, which lays out a microcosm of what this could look like. So on the data holder side, at the bottom, you’ve got both individuals and organizational forms. And these data holder entities have two way contracts with one or more data cooperatives. You can imagine one is on health data, where another is on web browsing data. Some are just for individuals and some work with other entities. The cooperative will have a fiduciary responsibility over this data, to act in the best interests of the data holder, as defined in the contract. And they’ll represent these interests to platforms, up top. This solves the principal agent problem of the current platform economy, because these cooperatives can represent individuals to platforms rather than platforms playing both sides.
But these entities don’t just represent interests. They also provide valuable data and insights to these platforms. This ensures higher quality data than what’s currently provided, and more useful, valuable data: health data that is required for drug development, activity data, location data, legal data, financial data flows for auditing. All of which is necessary to enable advanced AI services capabilities, by the way. Not just to the existing big platforms that hold all this data now, but to a bunch of small entities that need curated datasets for a range of activities, like driving data for self-driving cars. So they’re providing value to platforms as well as to data holders.
And cooperatives might have tradeoffs, because people face tradeoffs with data, right? So some co-ops may be very privacy focused. They don’t share data with most entities, and their work is really in preserving privacy. And others may be really monetization focused, where they’re all about making sure that their data holders get a fair deal on all the data labor that they’re doing, that companies are already benefiting from.
What does this look like?
So for example, take me. Say I’m thinking about doing a PhD – which I am – and so, maybe I’m anticipating budgets getting a little tight. I’m more interested in trying to monetize some of my data where I can. I already wear a Fitbit, and I join a data cooperative that licenses that data from members to a healthcare company, to use it to develop better indications and predictors of cardiovascular health. This is high quality, and verified data – and sometimes I even answer questions about it for them with my running group, all of whom have joined – making it particularly valuable, both to me and to them.
But I still don’t want to share things like my messaging data, so I make sure to join another data co-op that is very privacy focused, that not only protects the data that my phone gathers on my messages but actually uses its leverage, since millions of people feel similarly about this data, to advocate for better privacy and security practices on those devices generally. And actually has the power to do that, because of these collective bargaining rights.
And then – not to get ahead of myself – I manage to finish the PhD, head back into the world. I’m no longer so interested in spending some time on the data monetization side. In the current ecosystem, once a platform holds your data, that’s it. They can resell it and re-use it with minimal safeguards. But the data co-op approach has one crucial design feature: it does not allow permanent alienation from data. You can’t sell your data. You can only lease it, for short-term periods of time. So this preserves the option of meaningfully changing my preferences, and having trust that they’ll be followed.
So, how do we get there?
This requires a significant amount of regulation and oversight over the data cooperatives, which could be done through something like a data relations board, modelled after the labor relations board. There need to be guardrails on data co-ops, that prevent them from becoming predatory and exploitative, similar to the kinds of guardrails that are being proposed for the predatory and exploitative platforms we already have. We need ways for data cooperatives to make horizontal claims on each other, given the collective nature of this data, so that this doesn’t result in a race to the bottom scenario where the least privacy focused set the terms of engagement.
There also needs to be an enforced separation between entities that are cooperatives and entities that use data. Otherwise the cooperative is refereeing both sides of a case. You can’t just have Google try to incubate an internal data cooperative startup and then negotiate with itself, that has to be prevented. It’s not just regulation of the cooperatives that is necessary, but also structural guarantees like portability, interoperability, privacy regulation, as well as data labor regulation.
So, what will it take to get there? I’m not going to pretend it doesn’t take effort. It takes some active involvement in the governance of your data, although delegation paradigms and the collective approach can help. It takes advocacy to get the legislation necessary to enshrine the centrality of data cooperatives to the use of data, to assign cooperatives as accountable intermediaries and regulate them, and to enshrine certain rights to privacy and interoperability that are necessary. It takes careful understanding of technological feasibility and compliance, as Rumman mentioned in her talk, and some open space for cooperatives to experiment with different technical approaches. It will take trial and error, as different cooperative models exist in the same ecosystem, alongside organizations like data trusts and data commons, and they all find ways to work together to achieve their goals, and we find ways to ensure that bad actors don’t gain prominence.
Data cooperatives are not a new idea. Why believe they’ll work now?
The reason I believe this is both radical and feasible, as the conference calls for, is that we now have so many of these ingredients in place. First, data cooperatives already exist – a fast-growing number of them. There are data co-ops that follow the ‘health data for research’ paradigm we talked about earlier, like Salus and Midata. There are co-ops that aim to monetize data, like Streamr and Pool and CitizenMe, which allows consumers to get paid for data they share with companies, with millions of data exchanges completed. And then there are existing organizations that already serve communities or groups in other capacities that also function as co-ops, like Driver’s Seat, a driver’s worker cooperative that returns profits from aggregate data to their workers.
This really shows us that existing organizations, like credit unions and worker cooperatives, can adopt data cooperative functionality fairly easily, and in fact already have—Professor Pentland, who will be commenting on this proposal, has done pioneering work here.
And on the regulatory side – pieces of legislation like the data governance act, and tabled legislation like India’s non personal data proposal, point us in the direction of these sorts of intermediaries.
And finally, technology is catching up to what’s necessary to allow data cooperatives to work. The licensing function is only really possible with technologies that allow us to share insights from data without sharing the underlying data, like federated learning, and new privacy techniques, like differential privacy and homomorphic encryption. Approaches to tackling data lineage and provenance, while not a solved problem, are becoming much more sophisticated.
So yes, making data cooperatives a necessary and accountable intermediary for a majority of data transactions will take some work. So what do we get for all of that work?
What does the future look like?
Here’s what I think. You get much more effective advocacy around things like data privacy, because those asks come from a place of leverage as well as regulation. You get to exert and benefit from the financial and control interests over your data, that are just being swept under the table now. You get a trustworthy ecosystem that has functioning accountability measures. You get to benefit from the diverse set of applications and services that spring up in this opened -up ecosystem, moving from monopolies to companies that solve real problems and leverage your work to do it. And, as this data is used to underpin pioneering AI models, you get to be an active, recognized, and compensated participant, sharing insights and expertise and experience rather than being essentially scraped into providing that knowledge and then totally cut out of the benefits and the new economy created.
Really, this also means we get to move from a vision of the future that aims for large-scale autonomous AI, to one that acknowledges and takes strength and innovation from the contributions of people to existing AI systems. That leverages the vast amount of collective knowledge and experience and insight and wisdom we have, and incorporates it intentionally into the technology we build – exponentiating the reach and ambition of that technology while regenerating and elevating the human contributions that make this collective, not artificial, intelligence possible.