Mahsa Khoshnoodi
Not all seeing is understanding: when a vision-language model looks at an image, does it reason or does it pattern-match? I build diagnostic frameworks that locate exactly where and why VLMs fail on fine-grained visual understanding, treating hallucination as a reasoning failure rather than an output artifact. My broader vision is that truly capable VLMs will need more than perception: a structured world model that captures how the visual world works, bridging the gap between seeing, understanding, and acting.
I am actively seeking Research Internships in multimodal AI and visual reasoning. Feel free to reach out by email (mk2524@georgetown.edu) or use the icons under my bio at the top of the page.
Perception to Reasoning
I investigate how VLMs integrate visual and linguistic information to arrive at decisions. My core observation: even when models reach correct conclusions, their internal reasoning paths are often flawed or biased. I build interpretability tools that function as a microscope for AI, tracing information flow and exposing where perception fails to become genuine reasoning.
Diagnostic Frameworks for VLMs
I develop evaluation frameworks that assess not just whether a model is correct, but whether its reasoning process is valid. Hallucination and bias manifest heterogeneously across layers and architectures, so effective diagnosis requires understanding internal dynamics rather than observing outputs alone.
From Seeing to Acting
My long-term research targets AI systems that perceive, reason, and act reliably in the world. Drawing on ideas from structured world modeling and vision-language-action architectures, I aim to build multimodal systems that remain aligned with human values across the full loop from visual input to real-world decision.