Mahsa Khosh

I am a first-year PhD student in Computer Science at Georgetown University, advised by Dr. Sarah Adel Bargal in the Georgetown University Computer Vision (GUCV) Lab, and I also collaborate closely with Dr. Michael Saxon in the UCSB NLP group. My research focuses on developing multimodal AI systems that perform explicit reasoning over visual and linguistic inputs. I work on vision-language models that go beyond pattern matching to construct interpretable reasoning chains—enabling models to ground their decisions in visual evidence, articulate their intermediate steps, and align their outputs with human cognitive processes.

Specifically, I focus on:

Multimodal reasoning: Integrating visual scenes, spatial relationships, and textual context to solve complex tasks requiring geometric awareness and positional inference.
Visual grounding and interpretability: Developing methods for VLMs to explain predictions through attention mechanisms, reasoning graphs, or natural language justifications.
Alignment via Verifiability: Using spatial grounding and explicit reasoning to ensure model faithfulness. I focus on mitigating hallucinations and shortcut learning by requiring models to produce human-auditable logic for their decisions.

My research goal is to develop multimodal AI systems that reason explicitly over visual and linguistic information, enabling models that are transparent, interpretable, and aligned with human understanding.