The event will take place at Penn State University. See Directions for more details.

Schedule

Oral Presentations

  1. 14:20 - 14:30: Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
  2. 14:35 - 14:45: Mitigating the Harmful Self-Preference of LM Evaluators
  3. 14:50 - 15:00: AAAR-1.0: Assessing AI's Potential to Assist Research
  4. 15:05 - 15:15: Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

Poster Sessions

Session 1

  1. Questioning Privacy: Contrasting User Questions with Questions Answered by Privacy Policies
  2. DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models
  3. Detection of Biased Phrases in the Wiki Neutrality Corpus for Fairer Digital Content Management using Artificial Intelligence
  4. How does a Multilingual LM Handle Multiple Languages?
  5. Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties
  6. Language Models Generate Multiple-Choice Questions with Artifacts
  7. Do LLMs' math solutions align with humans?
  8. Mental Model for Machine Translation in Human-MT Scenario
  9. DnDScore: Decontextualization and Decomposition for Factuality Verification in Long-Form Text Generation
  10. Multi-LLM Collaborative Caption Generation in Scientific Documents
  11. Evaluating Vision-Language Models for Emotion Recognition
  12. Multiple LLM Agents Debate for Equitable Cultural Alignment
  13. Multilingual Dolma: Extending Open Pretraining Data Beyond English
  14. Mitigating the Harmful Self-Preference of LM Evaluators
  15. AAAR-1.0: Assessing AI's Potential to Assist Research
  16. Vietnamese Emotion-Aware Text-to-Speech (VEA-TTS) System with Tone Adjustment Based on Sentiment
  17. Do Vision-Language Models Discriminate based on Implicit Assumptions?
  18. MAC-Summ: Multi Axis Controllable Summarization
  19. eRevise+RF: A Writing Evaluation System for Assessing Student Essay Revisions and Providing Formative Feedback
  20. Shaping Perception of Emotional Storytelling with Synthesized Speech
  21. Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models
  22. Live Query: Maintaining Relevance in LLM Responses With Evolving Source Documents
  23. Distributive Fairness in Large Language Models: Evaluating Alignment with Human Values
  24. Crisis MT Cookbook 2.0: Updates and Challenges
  25. Mic'd Up and Misleading: Challenges and Future Directions in Detecting Falsehoods in Podcasts
  26. Using LLMs to Analyze L1 Interference in a Longitudinal Learner Corpus
  27. A Synergistic Approach to Explainable Factual Inconsistency Evaluation
  28. Understanding How Paper Writers Use AI-Generated Captions in Figure Caption Writing
  29. DecepBench: Benchmarking Multimodal Deception Detection
  30. LLMs can Perform Multi-Dimensional Analytic Writing Assessments: A Case Study of L2 Graduate-Level Academic English Writing
  31. Evaluating AI Mental Health Support Alignment with Community Needs and Expectations

Session 2

  1. Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
  2. GraphSnapShot: A System for Graph Learning Acceleration
  3. A Systematic Evaluation of Transformer-LM Representations for Capturing Author States and Traits
  4. An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science
  5. Can LLMs Disambiguate Grounded Language? The Case of PP Attachment
  6. Enhancing Speaker Verification: Insights from CAARMA and PDAF
  7. The Role of Metalanguage in Prompt Injection Attacks
  8. Subjectivity in the Annotation of Bridging Anaphora
  9. Beyond Believability: Scalable Reliability Evaluation of LLM-Based Social Simulations Using Verifiable Setups
  10. DEER: Improving ICL-based Named Entity Recognition with Token-focused Retrieval and Reflection
  11. Vision and Speech Language Models for Emotional Interpretation
  12. Active Learning and Feature-Acquisition with LLMs and Humans
  13. MedScore: Factuality Evaluation of Free-Form Medical Answers
  14. Thai Winograd Schemas: A Benchmark for Thai Commonsense Reasoning
  15. Anti-stereotypical Predictive Text Suggestions Only Occasionally Yield Anti-stereotypical Writing
  16. Fusing LLaMa with State-Space Models to Read the Weather: Efficient Contextualization via State-Space Cross-Attention
  17. Minds Like Ours? Probing Human-like Stereotypical Causal Attribution in LLMs
  18. Code-Mixed Telugu-English Hate Speech
  19. Generative AI for Efficient and Empathetic Conversational Models in Mental Health
  20. Supporting TAs with Evaluating and Providing Feedback to Student Sensemaking in a Large Classroom
  21. Echoes of Automation: The Increasing Use of LLMs in Newsmaking
  22. Beyond the Field: Revolutionizing Football News Analytics with a Multi-Stage NLP Pipeline Integrating RAG and TEXT2SQL
  23. Beyond Checkmate: Exploring the Creative Chokepoints in AI Text
  24. Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing
  25. COCOLOFA: A Dataset of News Comments with Common Logical Fallacies Written by LLM-Assisted Crowds
  26. Exploring Numeracy of LLMs through Embedding Probes
  27. Examining Bias in Large Language Models on Child Maltreatment: Representational Bias, Allocative Bias, and Output Homogeneity
  28. Training AI to Assess Human Creativity across Tasks, Modalities, and Languages
  29. Amuro and Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models
  30. AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic
  31. Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

Keynote Speakers

Roger Beaty

Roger Beaty

Associate Professor of Psychology
Penn State University


Roger Beaty is the Dr. Frances Keesler Graham Early Career Professor and Associate Professor of Psychology at Penn State University. His lab develops AI-based tools to quantify creativity, focusing on aligning AI models with human preferences for automated evaluation. His research on creativity evaluation is supported by the National Science Foundation and the U.S. Army, with projects aimed at developing open-source educational tools and psychological test batteries. He is the recipient of the Berlyne Award from the American Psychological Association for early career contributions to creativity research and currently serves as President-Elect of the Society for the Neuroscience of Creativity.


Talk Title:
Beyond Generation: Language Models as Creativity Evaluators

Abstract:
Creativity is a multifaceted process, encompassing not only the generation of ideas but also the evaluation of their originality and impact. Although language models (LMs) can generate new content, their capacity for evaluating creative work—a subjective judgment traditionally requiring human expertise and nuanced understanding—has received less attention. This talk shifts the focus from generation to evaluation, exploring how LMs can be trained to assess creative outputs across diverse domains, from literature to science. I will present in-context learning and fine-tuning approaches, leveraging established psychological assessment techniques to develop robust datasets of human creativity judgments. Our studies show that LM evaluations achieve high correlation with expert ratings across diverse creativity tasks, reaching inter-rater agreement comparable to human judges. This automated creativity evaluation enables AI-powered tools that can augment human creativity through real-time feedback, helping people learn to better evaluate their own ideas. It also opens the door to a comprehensive benchmark for evaluating the creative potential of LMs and the possibility of training more creative AI models. I will conclude with future work toward mechanistic understanding of LM creativity and discuss some broader implications of creative AI for the arts and sciences.


Lorraine (Xiang) Li

Lorraine (Xiang) Li

Assistant Professor of Computer Science
University of Pittsburgh


Xiang Lorraine Li is an assistant professor in the Department of Computer Science at the University of Pittsburgh. Her research interests lie at the intersection of natural language processing and machine learning, in particular, she's interested in understanding model behavior via evaluation benchmark design and exploration around the meaning of model parameters in complex or long-tail situations, for example, high-impact domains such as law and education. By doing this, she aims to construct socially responsible, equitable, and robust models that cater to diverse users, populations, cultures, and scenarios. Her research plan on building pluralistic models is presented in the AAAI new faculty highlight in 2025, and he work has been published in NLP and ML conferences. She worked as a young investigator with Yejin Choi at AI2 before joining Pitt. Previously, she defended her Ph.D. in Computer Science from UMass Amherst in August 2022, working with Andrew McCallum.


Talk Title:
Every Opinion Matters: Distributional and Long-tail Evaluation for LLMs

Human knowledge is inherently probabilistic and structured with multiple correct answers. For example, the purpose of "boiling water" could be cooking or making tea. However, people in areas with limited access to clean water might view it as a way to remove germs and ensure it's safe to drink. Unfortunately, this aspect is often overlooked in the current LLM evaluation. To ensure that models can serve diverse populations, it is important to gather multiple responses from a wide range of people and pay extra attention to rare, yet plausible and important, situations.
This talk will highlight the limitations of current LLMs in terms of their abilities around distribution and long-tail situations. I will discuss two benchmarks for evaluating commonsense in LLMs. One introduces a method for retrieving commonsense question-answer distributions from human annotators, and the other focuses on assessing the long-tail (uncommon) aspects of commonsense knowledge. The new evaluation benchmarks aim to shed light on making LLMs more robust to long-tail knowledge and better catering to diverse populations.


Daphne Ippolito

Daphne Ippolito

Assistant Professor of Language Technologies Institute (LTI), School of Computer Science
Carnegie Mellon University (CMU)


Daphne Ippolito is an assistant professor at the Language Technologies Institute at Carnegie Mellon University and a senior research scientist at Google Deepmind. Among other topics, she studies privacy and security issues around LLM systems, strategies for better evaluation of language models, and customizability of LLMs for different real-world applications.


Talk Title:
Troubles with Training Data for Large Language Models

Language models are trained on billions of words of text. Where does that text come from and how is it processed and filtered on its way to becoming training data? In this talk, we will examine how seemingly small decisions made when preparing pre-training data can have a significant impact on observed model performance. We will also discuss the problems with relying on the Internet as a primary source for training data. Gradual shifts in how content is posted on the web will limit its future usefulness as training data, and since the Internet is public, anyone can edit it, including adversaries aiming to introduce undesirable behaviours by inserting poisoned text.

Awards

Best Poster in Session #1

Best Poster in Session #2

Best Oral Presentation