'The Price of Labelling: A Two-Phase Federated Self-Learning Approach'

Published: 28 May 2024

Our Distributed AI paper 'The Price of Labelling: A Two-Phase Federated Self-Learning Approach', has been accepted in ECML PKDD 2024At: September 9-13, Vilnius, authored by T Aladwani, S Parambath, C Anagnostopoulos, F Delignianni. Keywords: Federated Learning, Self-learning, Pseudo-labeling, Data Augmentation.

Federated Learning (FL) is a privacy-preserving collabora-tive learning paradigm that eliminates the need for transferring client data. While most existing studies on FL primarily focus on supervised learning, assuming that all clients possess sufficient training data with ground-truth labels, this assumption may not always hold in practical scenarios. In many cases, data are unlabelled due to labeling costs, time constraints, or lack of expertise and resources. To address this challenge, self-learning FL has been introduced, which leverages both labelled and unlabelled data. However, self-learning relies on the availability of ground truth labels for a subset of data either on the server or among the clients. Furthermore, such approaches assume that the data are independent and identically distributed (IID). However, in real-world scenarios, data can be non-IID, leading to common issues such as class imbalance and distribution shift across clients. This poses a challenge for creating high-quality pseudo-labels without addressing data heterogeneity. To overcome these challenges, we propose a two-phase FL approach based on data augmentation and self-learning, coined 2PFL. In the first phase, 2PFL addresses data imbalance among fully labelled and partially labelled clients data by adopting lightweight data augmentation to train a global model. Subsequently , 2PFL employs self-learning using pseudo-labelled data, thereby significantly improving the performance of the global model. Comprehensive experiments and comparative assessments against baselines demonstrate that 2PFL efficiently generates high-quality pseudo-labels, and achieves fast convergence and performance in classification tasks.

First published: 28 May 2024