By FARR
The FARR project organized a panel discussion as a Birds of Feather (BoF) session at SC24 on HPC (High-Performance Computing) Centers and their role in supporting AI-ready data and research. The session involved presentations by panelists representing both HPC centers and users, and their approaches to providing and using AI, respectively. HPC venues, especially the SC (aka Supercomputing) Conference, emphasize discussion of computational resources and rarely host discussions focused on data. It was a triumph to assemble such a session of like-minded people in a setting like SC24, and one of the accomplishments was identifying champions within the community to keep driving the discussion and identification of research gaps forward. The HPC community will be an important partner for building up the AI workforce, especially in building foundation models for science and for fine-tuning models for specific scientific applications. The BoF was especially illuminating in its discussion of the role of humans in making data usable for these models, and the need to ensure models built in proper context, what research questions the data is appropriate for, and the limitations or uncertainty that can be tolerated.
Here are the key points:
Introduction by Daniel S. Katz
AI is increasingly relevant across scholarly domains.
The AI lifecycle includes gathering and preparing data, model training, validation, storage, and reproducibility, all within the context of ethical and fair use.
HPC centers play a crucial role by providing infrastructure for storing and running models and handling large-scale tasks.
The session aimed to address how HPC centers currently support AI and what users expect from them.
Presentations from HPC Centers
Paola A. Buitrago (Pittsburgh Supercomputing Center, PSC):
PSC offers systems like Bridges-2 for production-level research and Neocortex, optimized for AI workloads with unique Cerebras wafer-scale engines.
Bridges-2 supports diverse users with data management tools, workflow support, and large dataset integration.
PSC’s key focus areas include enabling flexible AI research environments and facilitating shared datasets.
Volodomyr Kindratenko (NCSA - National Center for Supercomputing Applications):
Systems like Delta and Delta AI are purpose-built for AI and machine learning tasks, emphasizing scalability, high-speed interconnects, and storage integration.
NCSA addresses challenges of large-scale data sharing and performance optimization for AI workloads through a dedicated data-centered infrastructure design.
Matt Williams (University of Bristol):
Described the UK's AI-first systems like Isambard AI, designed for diverse use cases and users, from individual researchers to large-scale projects.
Promotes modular data centers and self-service tools to allow easy and scalable use of AI resources.
Katherine Evans (Oak Ridge National Laboratory):
Focused on the user perspective, highlighting the need for AI in applications like weather and climate modeling.
Introduced ORBIT, a foundational AI model for predicting Earth systems using standardized data across global datasets.
Emphasized the importance of testing, characterization, and statistical modeling for AI effectiveness.
Robert Underwood (Aurora GPT, Argonne National Laboratory):
Discussed the development of Aurora GPT, a general-purpose scientific large language model (LLM) with multimodal capabilities for scientific data.
Focused on challenges like ensuring data consistency, quality, and traceability while addressing model correctness, safety, and robustness.
Discussion Highlights
What Makes Data AI-Ready?
AI-ready data needs proper metadata, consistent formatting, and high quality to minimize bias.
Domain experts are critical to understanding the meaning and utility of data and to aid with model evaluation.
Handling Diverse Data Formats:
No single format can work across domains, but adopting domain-specific best practices and standards helps ensure data usability.
Tools for AI Data Management:
Tools for tasks like data cleaning, formatting, and dataset versioning are critical, though challenges remain with complex scientific data types such as graphs or adaptive meshes.
Training for AI Users:
Users need training in system interaction, distributed AI, data loaders, and monitoring tools to effectively scale up their workloads.
Metrics for AI Impact:
Impact is assessed by scientific productivity, discoveries, and computational efficiency improvements.
Key Takeaways
HPC centers are evolving to accommodate the increasing demands of AI workloads with tailored infrastructure and training.
Collaboration between HPC providers and users is crucial to bridge the gap between user needs and available resources.
The focus is shifting toward accessibility, scalability, and ensuring ethical and reproducible AI research practices.
Much of this summary was AI-generated, with editing by Daniel S. Katz and Christine Kirkpatrick.
Comments