A teacher walks between desks, pausing to crouch beside a student's work. Another student points to the board, explaining her approach to the class. These moments—physical positioning, gesture, eye contact—are where belonging happens in math classrooms.

But traditional classroom analysis misses them entirely. Audio transcription captures words, not the teacher circulating to honor a student's thinking. Not the invitation to share multiple solution paths. Not the subtle cues that tell historically marginalized students: you belong here.

We developed a multimodal AI pipeline using Gemini 2.5 Pro to transcribe classroom videos—capturing both speech and visual context. Then we tested whether this richer picture helps identify belonging-centered instructional practices in mathematics classrooms serving primarily Black and Latino students.

Beyond Words

Our pipeline processes video through five stages: chunking, transcription, merging, format conversion, and validation. Unlike audio-only systems, it captures:

What Audio Misses

Teacher positioning and movement, student activities and engagement, instructional materials being used, and non-verbal interactions that signal inclusion.

We compared our multimodal approach against three baselines: Whisper (audio-only), GPT4o-Diarize (audio with speaker identification), and Gemini-Audio (audio-focused prompting).

The Results

Multimodal transcription achieved the best accuracy—Word Error Rate of 0.14 with error rates below 2.5% across all categories including speaker identification and content recognition.

Visual context dramatically improved detection of belonging-centered practices. Overall precision increased from 0.79 to 0.85 (+7.6%). False positives dropped from 17.3% to 12.9%.

The biggest improvements came for practices that depend on non-verbal cues:

  • Honoring Multiple Ways of Knowing: +24.3% precision
  • Inviting Feedback on Instruction: +25.0% precision
  • Opportunities for Agency: +13.6% precision

These are exactly the practices enacted through physical positioning, interaction with materials, and non-verbal invitation—invisible to audio-only analysis.

A Limitation Worth Noting

Under both transcription conditions, the AI failed to identify non-examples—segments where no belonging-centered practice was present. A 100% false negative rate for these cases suggests the model struggles to recognize the absence of inclusive practices, not just their presence.

This asymmetry matters. Identifying what's missing may be just as important as identifying what's there.

Why This Matters

Implications

  • Equity-focused research at scale. Multimodal AI enables large-scale analysis of how teachers support historically marginalized students' sense of belonging.
  • No additional training required. Off-the-shelf models can be deployed for classroom analysis without fine-tuning.
  • Teacher professional development. Detailed, automated feedback on belonging-centered practices could support teacher growth.
  • Privacy remains a concern. Even with face blurring, multimodal transcription may infer sensitive attributes from visual context.

The Bottom Line

Belonging isn't just about what teachers say. It's about how they move through the room, whose thinking they elevate, and how they invite students into mathematical authority.

For the first time, we can analyze these practices at scale—seeing what audio alone could never capture.

Thanks for reading!