Written by Jane Greenberg, Drexel University
AI-ready data, which refers to high-quality and well-prepared data that is optimized for use in artificial intelligence (AI) applications, increasingly encompasses the inclusion of metadata and ontologies to enhance value and usability. While metadata provides essential context and information about data, ontologies offer structured semantic representation of a particular domain. These additional layers of information help data scientists, researchers, and AI systems understand, interpret, and apply appropriate algorithms and models for analysis. Metadata and ontologies enable consistent data integration, interoperability, and knowledge sharing across systems - while facilitating more knowledgeable AI applications. Additionally, these systems are proving vital for supporting the FAIR (Findable, Accessible, Interoperable, and Reusable) Principles and reproducible computational research.
Despite these capacities, approaches for developing, implementing, and sustaining metadata and ontologies within AI-ready data pipelines remain inconsistent, cumbersome, and lack sufficient support. Challenges underlie the full data lifecycle from data creation, collection, and research, to longer-term aims of data preservation, archiving, reuse and support for research reproducibility. Collective, community driven efforts are needed to address current obstacles and maximize the value and reliability of data. The AI-Ready Data: Navigating the Dynamic Frontier of Metadata and Ontologies two-day workshop was held in April at Drexel University and served as a viable step toward addressing this challenge.
Sponsored by the Institute for Data-Driven Dynamical Research, the workshop was hosted by the Metadata Research Center at Drexel and brought together more than 50 individuals with expertise across the data lifecycle to discuss issues, share solutions, and chart a path forward for addressing key challenges in preparing AI-ready data for scientific research. Participants were from the five NSF-HDR institutes, as well as other NSF initiatives (e.g., Big Data Hubs, Open Knowledge Network, FAIR OS RCNs, Research Data Alliance), industry, federal agencies (National Institute of Standards and Technology, NIST) and two U.S. National laboratories (Oak Ridge National Laboratory, and Pacific Northwest National Laboratory.
Workshop participants shared case studies, methods, and goals for incorporating metadata and ontologies into AI-ready data frameworks. They also gathered in a series of breakout groups to discuss AI-ready data approaches, needs, and opportunities interconnecting with metadata and ontologies.
Christine Kirkpatrick, PI of the FAIR in ML, AI Readiness, & Reproducibility Research Coordination Network (FARR RCN), presented at the Data Management, FAIR practices, and Prepping for AI-Ready Pipelines session.
Some key takeaways from the overall workshop include the following:
Incorporating metadata and ontologies into an AI-ready data framework may prove crucial for accelerating knowledge discovery and supporting longitudinal science.
AI methods, including generative AI, can leverage metadata and ontologies to improve and validate AI-ready data.
FAIR data is a component of AI ready data, although it is important to recognize that not all FAIR data is AI-ready.
Use cases can guide in determining the level of AI readiness necessary for data.
Metadata and ontology informed AI-ready data techniques developed for domain specific data are applicable across other domains.
More attention needs to be given to the AI-ready data spectrum, including DevOps training, AI readiness levels, and stakeholder engagement.
Training: Introducing undergraduates to metadata/data practices has the potential to have tremendous impact on prepping AI ready data and improving overall data representations long term.
AI-readiness levels: Progress codifying AI-readiness levels (e.g.,ESIP; iHARP work presented by Sanjay Purushotham, UMBC; and exiting machine learning schemes) can further inform work on leveraging metadata and ontological systems into AI-ready data pipelines.
On the closing date, when breakout groups presented their final reports, participants wanted to keep working on their projects - many of them staying afterwards and asking if a follow-up workshop would be soon occuring. Although no plans are set for a follow-up workshop, the conversation on Fully AI Ready Data will continue at the in-person 2024 FAIR in ML, AI Readiness & Reproducibility Workshop at the AGU Conference Center in Washington, D.C. on October 9-10, 2024.
Comments