Hello, I'm Mingru Huang

I am a Master’s degree student in Wuhan University of Technology. I’m interested in research in computer vision, particularly video understanding. My research includes video Q&A, video-text retrieval, and video captioning. Meanwhile, I also have some research and practice in large language models, prompt engineering, basic operator development and optimization, knowledge graphs, and Q&A systems. I hope to make a generalized multimodal video model that is affordable, secure and trustworthy for everyone.


News

  • Aug. 2024: Incorporated a project on automotive maintenance inspection using a multimodal large model.
  • Jul. 2024: Joined the SpConv operator optimization project based on MetaX MXMACA computing platform.
  • May 2024: The paper “ST-CLIP” has been accepted at ICIC 2024 conference.
  • Jan. 2024: Invited as a reviewer for the ICME2024 conference.
  • Dec. 2023: Joined the school-enterprise cooperation program of Haluo Corporation, responsible for the AI speech generation part of it.
  • Nov. 2023: Completed the Transformer Heterogeneous Bisheng C++ Arithmetic Development Project of Huawei Crowd Intelligence Program, and was responsible for the Adam Arithmetic part of the project.
  • Sept. 2023: Join a video understanding project, the main direction is dense video captioning.

Publications

Memory Enhanced Visual-Speech Aggregation Model for Dense Video Captioning

Memory Enhanced Visual-Speech Aggregation Model for Dense Video Captioning

Under review

Introducing a Memory Enhanced Visual-Speech Aggregation model for dense video captioning, inspired by cognitive informatics on human memory recall. The model enhances visual representations by merging them with relevant text features retrieved from a memory bank through multimodal retrieval involving transcribed speech and visual inputs.

ST-CLIP: Spatio-Temporal enhanced CLIP towards Dense Video Captioning

ST-CLIP: Spatio-Temporal enhanced CLIP towards Dense Video Captioning

2024 Twentieth International Conference on Intelligent Computing (ICIC 2024)

Proposing a new factorized spatio-temporal self-attention paradigm to address inaccurate event descriptions caused by insufficient temporal relationship modeling between video frames and apply it to dense video captioning tasks.