The event will take place at Penn State University. See Directions for more details.
| 9:00 - 9:30 | Registration + Breakfast (Light breakfast will be provided) | 
| 9:30 - 9:45 | Opening Remark | 
| 9:45 - 10:45 | Keynote #1 (Xiang Lorraine Li) | 
| 10:45 - 11:00 | Coffee Break | 
| 11:00 - 12:00 | Poster Session #1 | 
| 12:00 - 13:00 | Lunch (Lunch will be provided) | 
| 13:00 - 14:00 | Keynote #2 (Daphne Ippolito) | 
| 14:00 - 14:15 | Coffee Break | 
| 14:15 - 15:15 | Oral Paper Presentation (4 Oral Papers) | 
| 15:15 - 16:30 | Ice Cream + Poster Session #2 | 
| 16:30 - 17:30 | Keynote #3 (Roger Beaty) | 
| 17:30 - 17:45 | Closing Remark & Award | 
 
            Associate Professor of Psychology
            Penn State University
Roger Beaty is the Dr. Frances Keesler Graham Early Career Professor and Associate Professor of Psychology at Penn State University. His lab develops AI-based tools to quantify creativity, focusing on aligning AI models with human preferences for automated evaluation. His research on creativity evaluation is supported by the National Science Foundation and the U.S. Army, with projects aimed at developing open-source educational tools and psychological test batteries. He is the recipient of the Berlyne Award from the American Psychological Association for early career contributions to creativity research and currently serves as President-Elect of the Society for the Neuroscience of Creativity.
                Talk Title:
                Beyond Generation: Language Models as Creativity Evaluators
            
                Abstract:
                Creativity is a multifaceted process, encompassing not only the generation of ideas but also the evaluation of their originality and impact. Although language models (LMs) can generate new content, their capacity for evaluating creative work—a subjective judgment traditionally requiring human expertise and nuanced understanding—has received less attention. This talk shifts the focus from generation to evaluation, exploring how LMs can be trained to assess creative outputs across diverse domains, from literature to science. I will present in-context learning and fine-tuning approaches, leveraging established psychological assessment techniques to develop robust datasets of human creativity judgments. Our studies show that LM evaluations achieve high correlation with expert ratings across diverse creativity tasks, reaching inter-rater agreement comparable to human judges. This automated creativity evaluation enables AI-powered tools that can augment human creativity through real-time feedback, helping people learn to better evaluate their own ideas. It also opens the door to a comprehensive benchmark for evaluating the creative potential of LMs and the possibility of training more creative AI models. I will conclude with future work toward mechanistic understanding of LM creativity and discuss some broader implications of creative AI for the arts and sciences.
            
 
            Assistant Professor of Computer Science
            University of Pittsburgh
Xiang Lorraine Li is an assistant professor in the Department of Computer Science at the University of Pittsburgh. Her research interests lie at the intersection of natural language processing and machine learning, in particular, she's interested in understanding model behavior via evaluation benchmark design and exploration around the meaning of model parameters in complex or long-tail situations, for example, high-impact domains such as law and education. By doing this, she aims to construct socially responsible, equitable, and robust models that cater to diverse users, populations, cultures, and scenarios. Her research plan on building pluralistic models is presented in the AAAI new faculty highlight in 2025, and he work has been published in NLP and ML conferences. She worked as a young investigator with Yejin Choi at AI2 before joining Pitt. Previously, she defended her Ph.D. in Computer Science from UMass Amherst in August 2022, working with Andrew McCallum.
                Talk Title:
                Every Opinion Matters: Distributional and Long-tail Evaluation for LLMs
            
                Human knowledge is inherently probabilistic and structured with multiple correct answers. For example, the purpose of "boiling water" could be cooking or making tea. However, people in areas with limited access to clean water might view it as a way to remove germs and ensure it's safe to drink. Unfortunately, this aspect is often overlooked in the current LLM evaluation. To ensure that models can serve diverse populations, it is important to gather multiple responses from a wide range of people and pay extra attention to rare, yet plausible and important, situations.
                
                This talk will highlight the limitations of current LLMs in terms of their abilities around distribution and long-tail situations. I will discuss two benchmarks for evaluating commonsense in LLMs. One introduces a method for retrieving commonsense question-answer distributions from human annotators, and the other focuses on assessing the long-tail (uncommon) aspects of commonsense knowledge. The new evaluation benchmarks aim to shed light on making LLMs more robust to long-tail knowledge and better catering to diverse populations.
            
 
            Assistant Professor of Language Technologies Institute (LTI), School of Computer Science
            Carnegie Mellon University (CMU)
Daphne Ippolito is an assistant professor at the Language Technologies Institute at Carnegie Mellon University and a senior research scientist at Google Deepmind. Among other topics, she studies privacy and security issues around LLM systems, strategies for better evaluation of language models, and customizability of LLMs for different real-world applications.
                Talk Title:
                Troubles with Training Data for Large Language Models
            
Language models are trained on billions of words of text. Where does that text come from and how is it processed and filtered on its way to becoming training data? In this talk, we will examine how seemingly small decisions made when preparing pre-training data can have a significant impact on observed model performance. We will also discuss the problems with relying on the Internet as a primary source for training data. Gradual shifts in how content is posted on the web will limit its future usefulness as training data, and since the Internet is public, anyone can edit it, including adversaries aiming to introduce undesirable behaviours by inserting poisoned text.