Abstract:To address the limitations in live-line operation monitoring for distribution networks, including restricted field of view, insufficient edge computing capability, and the lack of dedicated lightweight intelligent solutions for accurate operation analysis, a construction method for multimodal intelligent agent integrating first-person lightweight object detection is proposed. First, built on the YOLO11 model, an improved architecture incorporating multi-scale attention mechanisms, windmill-shaped convolution, and cross-view interaction modules is introduced to enhance multi-scale feature extraction, lightweight low-contrast feature representation, and view robustness. The model adopts minimum point distance intersection over union (MPDIoU) loss function to refine bounding box regression. Then, first-person view information is extracted from detection results to generate structured prompts. Finally, a multimodal intelligent agent is constructed by integrating the optimized detection model with DeepSeek-V3, enabling the fusion of visual outputs for quantitative risk assessment and operation analysis. Experimental results show that the proposed method reduces computational complexity to 7.3 G FLOPs, and achieves up to a 9.44% improvement in mean precision for multi-class detection with IoU thresholds below 0.5 compared to mainstream single-stage models. The generated outputs outperform leading open-source multimodal models in state judgment accuracy, structuring degree, and interpretability, providing an efficient and scalable solution for multimodal monitoring of live-line operations in distribution networks.