Construction of a multimodal intelligent agent for live-line operation monitoring in distribution networks via first-person lightweight object detection fusion
DOI:10.19783/j.cnki.pspc.251328
Key Words:first-person vision technology  live-line operation  multimodal system  large language model  multi-scale object detection
Author NameAffiliation
LIU Zihao 1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China
2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China 
LIU Youbo 1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China
2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China 
GONG Haochen 1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China
2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China 
WAN Li 1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China
2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China 
YUAN Lin 1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China
2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China 
LIU Junyong 1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China
2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China 
Hits: 1
Download times: 1
Abstract:To address the limitations in live-line operation monitoring for distribution networks, including restricted field of view, insufficient edge computing capability, and the lack of dedicated lightweight intelligent solutions for accurate operation analysis, a construction method for multimodal intelligent agent integrating first-person lightweight object detection is proposed. First, built on the YOLO11 model, an improved architecture incorporating multi-scale attention mechanisms, windmill-shaped convolution, and cross-view interaction modules is introduced to enhance multi-scale feature extraction, lightweight low-contrast feature representation, and view robustness. The model adopts minimum point distance intersection over union (MPDIoU) loss function to refine bounding box regression. Then, first-person view information is extracted from detection results to generate structured prompts. Finally, a multimodal intelligent agent is constructed by integrating the optimized detection model with DeepSeek-V3, enabling the fusion of visual outputs for quantitative risk assessment and operation analysis. Experimental results show that the proposed method reduces computational complexity to 7.3 G FLOPs, and achieves up to a 9.44% improvement in mean precision for multi-class detection with IoU thresholds below 0.5 compared to mainstream single-stage models. The generated outputs outperform leading open-source multimodal models in state judgment accuracy, structuring degree, and interpretability, providing an efficient and scalable solution for multimodal monitoring of live-line operations in distribution networks.
View Full Text  View/Add Comment  Download reader