Power System Protection and Control

Construction of a multimodal intelligent agent for live-line operation monitoring in distribution networks via first-person lightweight object detection fusion

DOI：10.19783/j.cnki.pspc.251328

Key Words:first-person vision technology live-line operation multimodal system large language model multi-scale object detection

Author Name	Affiliation
LIU Zihao	1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China 2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China
LIU Youbo	1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China 2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China
GONG Haochen	1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China 2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China
WAN Li	1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China 2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China
YUAN Lin	1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China 2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China
LIU Junyong	1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China 2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China

Hits: 1

Download times: 1

Abstract:To address the limitations in live-line operation monitoring for distribution networks, including restricted field of view, insufficient edge computing capability, and the lack of dedicated lightweight intelligent solutions for accurate operation analysis, a construction method for multimodal intelligent agent integrating first-person lightweight object detection is proposed. First, built on the YOLO11 model, an improved architecture incorporating multi-scale attention mechanisms, windmill-shaped convolution, and cross-view interaction modules is introduced to enhance multi-scale feature extraction, lightweight low-contrast feature representation, and view robustness. The model adopts minimum point distance intersection over union (MPDIoU) loss function to refine bounding box regression. Then, first-person view information is extracted from detection results to generate structured prompts. Finally, a multimodal intelligent agent is constructed by integrating the optimized detection model with DeepSeek-V3, enabling the fusion of visual outputs for quantitative risk assessment and operation analysis. Experimental results show that the proposed method reduces computational complexity to 7.3 G FLOPs, and achieves up to a 9.44% improvement in mean precision for multi-class detection with IoU thresholds below 0.5 compared to mainstream single-stage models. The generated outputs outperform leading open-source multimodal models in state judgment accuracy, structuring degree, and interpretability, providing an efficient and scalable solution for multimodal monitoring of live-line operations in distribution networks.

View Full Text View/Add Comment Download reader