Hi 👋, I am a fourth-yearPhD Candidate at the University of Maryland, College Park currently on both industry and academic job markets.. I work on human-grounded generative AI — systems that don't just generate text, images, diagrams, and video, but reason about the people, purposes, and worlds behind those outputs. I am advised by Prof. Jordan Boyd-Graber in the Department of Computer Science.
My research swaps the field's usual target — fluency and similarity — for human utility, across three threads: inferring intent, expertise, and mental models from how people naturally express themselves through implicit and explicit actions (edits, comments, critiques); maintaining coherent multimodal world models of physical and social state across long interactions; and turning dense scientific papers and figures into faithful, audience-aware diagrams, slides, and explanatory videos.
During my PhD I worked with several research groups. I was a Student Researcher at Google Research Lab] hosted by Yiwen Song and Yale Song, and spent three summers at the Adobe Research Lab and spent a beautiful summer at Microsoft Research Lab, Redmond, Seattle hosted by Jennifer Neville and Jay Stokes. I have collaborated with Prof. Rachel Rudinger, Prof. Tianyi Zhou on human-centered evaluation.
Prior to joining PhD, I worked as a Research Fellow in Microsoft Research India with Monojit Chowdhury, Kalika Bali and collaborated with a lot of other people from different institutes.
Feel free to reach out if you're interested in research collaboration!
Research
I build AI that infers the hidden context behind what people do — their intent, expertise, and mental models — and uses it to generate artifacts judged by whether they actually help, not by how closely they mirror a reference. My work has been organized around the following major themes
AI for Scientific communication Tools4 papers
Turning dense papers and figures into diagrams, slides, and videos that are faithful, audience-aware, and useful for real understanding.
Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures Arxiv 2026
Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation.
SciDoc2Diagrammer-MAF: Generating Scientific Diagrams from Documents via Multi-Aspect Feedback Refinement EMNLP 2024
Automating scientific-diagram creation from academic papers can streamline tutorials, presentations, and posters. Current text-to-image models struggle to produce accurate, appealing diagrams from long-context inputs. We propose SciDoc2Diagram — a task that extracts relevant information from papers and generates diagrams — with a benchmark, SciDoc2DiagramBench. Our pipeline, SciDoc2Diagrammer, generates diagrams from user intentions via intermediate code generation. Because initial drafts were often incomplete or unfaithful, we add Multi-Aspect Feedback (MAF), a refinement strategy that substantially improves factual correctness and visual appeal, outperforming existing models on automatic and human judgments.
Presentations by the Humans and For the Humans: Harnessing LLMs for Generating Persona-Aware Slides from Documents EACL 2024 Oral
Papers and slides are two representations of the same information, but both take substantial work to prepare. Prior document-to-slides efforts ignore the need to tailor presentation to the audience's persona or the talk's duration. We introduce end-user-specification-aware document-to-slides conversion: starting from the SciDuet dataset of paper/slide-deck pairs from recent *ACL conferences, we build four persona-aware configurations, then present Persona-Aware-D2S, which finetunes LLMs with target-audience feedback to create persona-aware slides. Automated metrics and human evaluation show the model produces presentations that are informative and tailored to the expectations and cognitive abilities of the target audience.
SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity EACL 2026
Despite progress in natural-image editing with MLLMs, compositional layout and content editing for structured visual domains (posters, websites) remains under-explored. SMART-Editor is a multi-agent framework for compositional editing of structured images. Unlike prior models focused on isolated local edits, it maintains global coherence through two complementary strategies: Reward-Refine, an inference-time reward-guided refinement method, and Reward-DPO, a training-time preference-optimization approach over reward-aligned layout pairs. We introduce SMART-Edit-Bench, a benchmark of cascading multi-step edits that are implicit yet require layout- and semantics-preserving reasoning about edit order. Automatic and human evaluations confirm that reward-guided planning produces semantically consistent, visually coherent edits beyond what single-shot VLMs generate.
Human-Centered Evaluation4 papers
Large language models are effective human annotation assistants, but not good independent annotators ACL 2026
Event annotation is important for identifying market changes, monitoring breaking news, and understanding sociological trends. Although expert annotators set the gold standards, human coding is expensive and inefficient. Unlike information extraction experiments that focus on single contexts, we evaluate a holistic workflow that removes irrelevant documents, merges documents about the same event, and annotates the events. Although LLM-based automated annotations are better than traditional TF-IDF-based methods or Event Set Curation, they are still not reliable annotators compared to human experts. However, adding LLMs to assist experts for Event Set Curation can reduce the time and mental effort required for Variable Annotation. When using LLMs to extract event variables to assist expert annotators, they agree more with the extracted variables than fully automated LLMs for annotation..
A Good Talk Doesn't Look Like a Summary, it Teaches You! Measuring Takeaways from Paper-to-Video Generation Arxiv 2026
Automatically generated videos from scientific papers are increasingly used for education and research dissemination. However, existing evaluation metrics mainly measure visual quality or whether key points from the paper appear in the video—without assessing whether the video actually helps viewers understand the ideas. We introduce \EffectivePresentationScorer{}, a framework for evaluating the instructional quality of scientific presentation videos. It checks whether a video explains the main ideas clearly, introduces needed background concepts, and connects technical details to the paper’s main contribution. When we apply \EffectivePresentationScorer{} to the existing paper-to-video generation systems, we find that generated videos mention the correct topics and follow the structure of the paper but fail to explain prerequisite concepts or clarify why the method works. These failures are often ignored by existing video evaluation metrics, which focus on content presence rather than explanatory quality. .
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility Arxiv 2026
Generative AI systems achieve impressive performance on standard benchmarks yet fail to deliver real-world utility, a disconnect we identify across 28 deployment cases spanning education, healthcare, software engineering, and law. We argue that this benchmark utility gap arises from three recurring failures in evaluation practice: proxy displacement, temporal collapse, and distributional concealment. Motivated by these observations, we argue that generative AI evaluation requires a paradigm shift from static benchmark-centered transparency toward stakeholder, goal, and context-conditioned utility transparency grounded in human outcome trajectories. Existing evaluations primarily characterize properties of model outputs, while deployment success depends on whether interaction with AI improves stakeholders' ability to achieve their goals over time. The missing construct is therefore utility: the change in a stakeholder's capability induced through sustained interaction with an AI system within a deployment context. To operationalize this perspective, we propose SCU-GenEval, a four-stage evaluation framework consisting of stakeholder-goal mapping, construct-indicator specification, mechanism modeling, and longitudinal utility measurement. To make these stages practically deployable, we introduce three supporting instruments: structured deployment protocols, context-conditioned user simulators, and persona- and goal-conditioned proxy metrics. We conclude with domain-specific calls to action, arguing that progress in generative AI must be evaluated through measurable improvements in human outcomes rather than benchmark performance alone.
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence EMNLP 2024
Question answering only progresses if we can tell whether an answer is correct, but current answer-correctness metrics struggle with the verbose, free-form answers of LLMs. Two challenges dominate short-form QA evaluation: a lack of diverse evaluation data and over-reliance on expensive, slow LLM scorers. We provide rubrics and datasets adopted from the Trivia community and propose an efficient, interpretable QA evaluation that is more stable than exact match and neural methods such as BERTScore.
Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness NAACL 2025 Outstanding Paper Award
Adversarial datasets should validate AI robustness with samples humans handle well but models do not — yet as models evolve, datasets become obsolete, and there is no standardized metric for measuring how adversarial a dataset remains. We propose ADVSCORE, a human-grounded metric that captures models' and humans' varying abilities while identifying poor examples. ADVSCORE drives a new dataset-creation pipeline for realistic, high-quality adversarial samples, which we use to collect ADVQA. Applying it across 9,347 human responses and ten models' predictions over 2020–2024, we track model improvement and provide guidance for achieving robustness comparable to human capabilities, ensuring adversarial datasets test real capability rather than outdated or artificial difficulty.
SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks EMNLP 2022
How well can NLP models generalize to a variety of unseen tasks when provided with task instructions? To address this question, we first introduce SUPER-NATURALINSTRUCTIONS, a benchmark of 1,616 diverse NLP tasks and their expert-written instructions. Our collection covers 76 distinct task types, including but not limited to classification, extraction, infilling, sequence tagging, text rewriting, and text composition. This large and diverse collection of tasks enables rigorous benchmarking of cross-task generalization under instructions— training models to follow instructions on a subset of tasks and evaluating them on the remaining unseen ones. Furthermore, we build Tk-INSTRUCT, a transformer model trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples). Our experiments show that Tk-INSTRUCT outperforms existing instruction-following models such as InstructGPT by over 9% on our benchmark despite being an order of magnitude smaller. We further analyze generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances per task, and model sizes. We hope our dataset and model facilitate future progress towards more general-purpose NLP models.
Uncovering People from interactions and Alignment Towards Their Needs 4 papers
Recovering intent, expertise, and mental models from the traces of how people work — edits, comments, critiques, preferences — rather than from explicit labels.
Group Preference Alignment: Customized LLM Response Generation from In-Situ Conversations EMNLP 2025
LLMs often fail to meet the specialized needs of distinct user groups due to their one-size-fits-all training paradigm, and there is limited research on what personalization aspects each group expects. We propose Group Preference Alignment (GPA), a group-aware framework that identifies context-specific variations in conversational preferences across user groups and steers LLMs to address them. It has two steps: (1) Group-Aware Preference Extraction, where maximally divergent user-group preferences are mined from real conversation logs and distilled into interpretable rubrics, and (2) Tailored Response Generation via either Context-Tuned Inference (GPA-CT), which adjusts responses through context-dependent prompt instructions, or Rubric-Finetuning Inference (GPA-FT), which uses the rubrics to generate contrastive synthetic data for group-specific alignment. Experiments show significant gains in preference alignment over baselines while maintaining robust performance on standard benchmarks.
Learning User Mental Models for Personalized Creation and Collaborative Work Splitting Ongoing Work
When and Where Does Personalization Help and Hurt? Modelling People's Needs From Longitudinal Traces Ongoing Work
Adaptive IE: Investigating the Complementarity of Human–AI Collaboration to Adaptively Extract Information on-the-fly COLING 2025 Oral
Information-extraction needs vary over time, so a flexible IE system is valuable — yet existing systems are either fully supervised (expensive annotation) or fully unsupervised (output that ignores user needs). We formally introduce "IE on-the-fly" and address it with Adaptive IE, which uses human-in-the-loop refinement to adapt to changing user questions. Through human experiments on three diverse datasets, we show Adaptive IE is a domain-agnostic, responsive, and efficient framework that helps users access useful information while quickly reorganizing it in response to evolving needs.
Multimodal world state 2 papers
Keeping physical and social state coherent — objects, layouts, constraints, gestures, roles — across long, multi-turn generation and editing.
CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboard Generation Arxiv 2026
Long-form visual storytelling requires maintaining continuity across shots, including consistent characters, stable environments, and smooth scene transitions. While existing generative models can produce strong individual frames, they fail to preserve such continuity, leading to appearance changes, inconsistent backgrounds, and abrupt scene shifts. We introduce CANVAS (Continuity-Aware Narratives via Visual Agentic Storyboarding), a multi-agent framework that explicitly plans visual continuity in multi-shot narratives. CANVAS enforces coherence through character continuity, persistent background anchors, and location-aware scene planning for smooth transitions within the same setting We evaluate CANVAS on two storyboard generation benchmarks ST-BENCH and ViStoryBench and introduce a new challenging benchmark HardContinuityBench for long-range narrative consistency. CANVAS consistently outperforms the best-performing baseline, improving background continuity by 21.6%, character consistency by 9.6% and props consistency by 7.6%.
Correlating Instruction-Tuning (in Multimodal Models) with Vision-Language Processing (in the Brain) ICLR 2025
Instruction-tuned multimodal LLMs show stronger alignment with brain activity during natural-scene viewing than vision-only models, especially when processing task-specific instructions like image captioning and visual question answering. However, not all instructions contribute equally to brain alignment, highlighting the need for more precise instruction encoding to better predict neural responses.