Policy Questions Blog 1: AI Data Transparency Remarks for NAIAC Panel 📚🔍⚖️
Context: I was recently invited to participate in a panel organized by the National AI Advisory Committee on the topic of data transparency standards. Providing a minimum universal standard is a somewhat different exercise from outlining best practices, and needs to meet different requirements. In this opening statement, I argued for a minimum standard that starts by recommending to consider the intersection of development datasets and data sources for an AI system to be the most appropriate level of granularity. Requiring minimal information about what data sources go into which dataset will not be sufficient to support full accountability, but will be necessary for other regulatory and governance mechanisms to be effective.
A Minimum Standard for Meaningful Data Disclosure
AI systems are first and foremost a representation of their development datasets, which define the scope of the models’ strengths, risks, and weaknesses. Yet these datasets are currently both the most under-discussed aspect of popular AI systems and the most under-represented in proposed regulatory approaches. This lack of visibility threatens to hamper efforts to make AI governance sustainable, robust to technical changes, and inclusive of perspectives beyond those of AI developers.
We have a shared responsibility to bring AI datasets back to the center of the conversation. Recent AI discourse has focused primarily on technical innovations that have been instrumental in developing ever more impressive systems from ever larger datasets. While these contributions do deserve attention, regulators also need to account for the ways in which the systems’ impact on society is ultimately determined by the properties of the data they leverage; from the domains, people, and perspectives it represents to the various rights of its data subjects - including privacy, labor, fair competition, and non-discrimination rights.
A renewed focus on data is also needed because the science of AI system evaluation is still in its early stages. We do not yet have the kinds of social impact or safety benchmarks that would enable fully model-performance-based regulation, and it is an open question whether model-level tests would ever capture all the social stakes of this new category of data-driven technology. Even in cases where model evaluations do provide accurate information, prevalent issues such as data contamination make them much less reliable without dataset information.
As a result, many of us have been arguing for several complementary approaches to data transparency. Dataset documentation such as datasheets and data statements is written by developers and describes “essential characteristics” such as demographic information that shape the behavior of AI systems. Data measurements supplement this documentation by providing quantitative summaries of datasets with up to trillions of examples where manual inspection is insufficient to understand broader social and technical dynamics. Interactive dataset visualization has an additional role to play in opening up the set of questions asked to empower specific stakeholder groups to interrogate datasets in ways that are relevant to their interests and reflect their scientific expertise. Finally, direct access to development datasets enables important research on training dynamics, transparency tools, and scrutiny into the effectiveness of risk mitigation strategies.
With proper governance, these practices are unequivocally beneficial and come at marginal cost to most developers, especially larger companies. They all should be strongly encouraged, and required for AI systems with more sensitive use cases. They are also very much context-dependent and would be challenging to operationalize as a universal requirement that is inclusive of open and collaborative development settings.
What then does a more pragmatic minimum meaningful data transparency requirement look like? To define “minimal” in this context, let’s look at what’s needed for external researchers and investigators to have the necessary information to assess social and technical risks of data use, regardless of what best practices the developer is adopting.
First, a data standard to that end needs to include a list of the datasets involved in the development of a system, with their sizes and purposes. From very large pre-training datasets to preference and fine-tuning data and evaluation benchmarks, the inclusion of different data types in various development datasets will have different technical and social implications.
Second, a minimum data standard needs to include a list of the various data sources used to curate the development datasets in question. This data may come from very diverse sources including licensing deals between the developer and another organization, user data collected by a company through a service offering, publicly available data obtained through web scraping, and data created directly by the developer.
Knowing what those sources are, under which conditions they were obtained by the developer, and what their contribution is to the various development datasets may not be sufficient to fully guide important decisions, but it is necessary to enable external stakeholders to identify potential issues – for example by looking at biases encoded in the most represented domains in a web crawl, seeing warning signs of market concentration in licensing deals, and examining terms of use of services that collect AI training data. At a high level, it can be summarized as the following. For any AI system, we need to ask:
- What datasets were used, and what are their sizes and purpose?
- Where and under what conditions were the data sources that provided data for these datasets obtained?
Again, such a standard is not by itself sufficient to guarantee good governance of AI systems without significant additional work – but it would provide a solid foundation to ensure that researchers, journalists, and regulators do not face insurmountable barriers when trying to make informed decisions about important topics. Further, it does so at no prejudice to individual privacy or even trade secrets that cover the technical and hardware contributions outlined at the beginning of this statement.
Of course, it would likely be valuable to go beyond this absolutely minimal standard. The intersection of the development datasets and original data sources would present an ideal way of grounding requirements to provide comprehensive datasheets. Giving explicit information about how the data collection or licensing accounts for data subject’s opt-out preferences would also go a long way to making the technology more consentful and aligned with international requirements. And finally, it will provide opportunities as needed to work on dataset measurement, visualization, and access in a flexible way that leverages interest and expertise from external researchers.
We do need to move forward with transparency requirements, and a broadly applicable minimum meaningful standard appears as a substantial step forward, although one that will still require significant further investment to ensure that AI systems are indeed developed to the benefit of all their stakeholders.