|
| Construction of a multimodal intelligent agent for live-line operation monitoring in distribution networks via first-person lightweight object detection fusion |
| DOI:10.19783/j.cnki.pspc.251328 |
| Key Words:first-person vision technology live-line operation multimodal system large language model multi-scale object detection |
| Author Name | Affiliation | | LIU Zihao | 1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China 2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China | | LIU Youbo | 1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China 2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China | | GONG Haochen | 1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China 2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China | | WAN Li | 1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China 2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China | | YUAN Lin | 1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China 2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China | | LIU Junyong | 1. College of Electrical Engineering, Sichuan University, Chengdu 610065, China 2. State Grid Chengdu Electric Power Supply Company, Chengdu 610041, China |
|
| Hits: 1 |
| Download times: 1 |
| Abstract:To address the limitations in live-line operation monitoring for distribution networks, including restricted field of view, insufficient edge computing capability, and the lack of dedicated lightweight intelligent solutions for accurate operation analysis, a construction method for multimodal intelligent agent integrating first-person lightweight object detection is proposed. First, built on the YOLO11 model, an improved architecture incorporating multi-scale attention mechanisms, windmill-shaped convolution, and cross-view interaction modules is introduced to enhance multi-scale feature extraction, lightweight low-contrast feature representation, and view robustness. The model adopts minimum point distance intersection over union (MPDIoU) loss function to refine bounding box regression. Then, first-person view information is extracted from detection results to generate structured prompts. Finally, a multimodal intelligent agent is constructed by integrating the optimized detection model with DeepSeek-V3, enabling the fusion of visual outputs for quantitative risk assessment and operation analysis. Experimental results show that the proposed method reduces computational complexity to 7.3 G FLOPs, and achieves up to a 9.44% improvement in mean precision for multi-class detection with IoU thresholds below 0.5 compared to mainstream single-stage models. The generated outputs outperform leading open-source multimodal models in state judgment accuracy, structuring degree, and interpretability, providing an efficient and scalable solution for multimodal monitoring of live-line operations in distribution networks. |
| View Full Text View/Add Comment Download reader |
|
|
|