Cost-effective Data Labelling for Graph Neural Networks

Abstract

Active learning (AL), that aims to label limited data samples to effectively train the model, stands as a very cost-effective data labelling strategy in machine learning. Given the state-of-the-art performance GNNs have achieved in graph-based tasks, it is critical to design proper AL methods for graph neural networks (GNNs). However, existing GNN-based AL methods require considerable supervised information to guide the AL process, such as the GNN model to use, and initially labelled nodes and labels of newly selected nodes. Such dependency on supervised information limits both flexibility and scalabilty. In this paper, we propose an unsupervised, scalable and flexible AL method – it incurs low memory footprints and time cost, is flexible to the choice of underlying GNNs, and operates without requiring GNN-model-specific knowledge or labels of selected nodes. Specifically, we leverage the commonality of existing GNNs to reformulate the unsupervised AL problem as the Aggregation Involvement Maximization (AIM) problem. The objective of AIM is to maximize the involvement or participation of all nodes during the feature aggregation process of GNNs for nodes to be labelled. In this way, the aggregated features of labelled nodes can be diversified to a large extent, thereby benefiting the training of feature transformation matrices which are major trainable components in GNNs. We prove that the AIM problem is NP-hard and propose an efficient solution with theoretical guarantees. Extensive experiments on public datasets demonstrate the effectiveness, scalability and flexibility of our method.

Publication
ACM Web Conference 2024 (WWW), May 13 - 17, 2024, Singapore, Singapore (CORE A*)
Shirui Pan
Shirui Pan
Professor | ARC Future Fellow

My research interests include data mining, machine learning, and graph analysis.