Audio-Visual Speaker Diarization

Low-latency multimodal diarization under streaming constraints.

Designed low-latency audio-visual speaker diarization models under strict streaming budgets. The work explores multimodal fusion for real-time speaker attribution and has been submitted to INTERSPEECH 2026.