Security
Objective
To design efficient and generalizable methods for video understanding by combining motion cues from compressed video with the semantic strength of large vision–language models.
Team
This is a collaborative work between:
Description
Video has become the dominant form of digital content, yet analyzing it remains computationally expensive. Traditional deep learning models for video recognition demand vast resources, making them impractical for large-scale deployment or use on lightweight devices.
Our research addresses this challenge by developing methods that combine efficiency with semantic richness. We leverage motion information that is already embedded in compressed video streams and pair it with the power of large pre-trained vision-language models. This integration enables video understanding systems that are both computationally efficient and capable of robust generalization to unseen categories and tasks.
The long-term goal is to build scalable, real-time video analytics systems that preserve accuracy while drastically lowering the computational and data requirements. This has direct impact on applications such as intelligent surveillance, multimedia retrieval, human-computer interaction, and edge AI.
Results
Please refer to the publications.