Distributed model training in the edge. The principle of this approach is to
aggregate models learned on distributed clients in order to obtain a new, more
general model. The results model is then redistributed to clients for further
training.
FL struggles with Data Heterogeneity?
As each client train a model in federation with non-identical and individual
(non-iid) data, during model aggregation the next version of aggregated model
might have reduced accuracy when deployed in client settings due to non-iid.
Goals of Federated Learning Model aggregator?
Heterogeneity of devices and user’s challenges ML with double objective •
Generalization: Achieving over all target accuracy with federated learning
settings (pervasive devices). • Personalization: Achieving target accuracy on
each client in AI tasks with already seen data or new data.
Strategies adopted in Model Aggregated algorithms?
Key point in federated learning is the way specialized models are aggregated at
the server In case of deep learning – two families of algorithms implementing
different strategies are considered • First Strategy: Emphasize generalization –
aggregation algorithm considers local models and build a new model that
potentially calls into question all layers and all weights associated with
neurons. This approach is exemplified by FedAvg and FedMA • Second Strategy: In
contrast focus more on client specialization. Thus, the algorithm does not
question certain parts of the local models. Specifically, only the local models
base layers are sent to the server for generalization, while the last layers are
kept unchanged. This approach is exemplified by the FedPer algorithm.
Different type of model aggregator algorithms available/used?
Federated Averaging: (FedAvg)
• Starts with random initialization of a neural model (server is in charge of
coordinating and managing model transfers with client devices) • Resulting
model is sent to clients to start local training from it • When on-device
training is finished the weights of the local models are sent to the server •
Aggregation is done in a weighted averaging manner where clients with more data
influence more significantly the newly aggregated model. • FedAvg is however,
has naive form of aggregation due to its coordinate wise averaging that may lead
to sub-optimal solutions. Due to non-iiD data neurons in the same coordinate may
be opted for entirely different purposes due to client’s specialization •
Averaging neurons that are drastically different causes decrement results.
Federated learning with Personalization layers (FedPer):
• Similar to FedAvg the way it computes new weights in the aggregated model. •
However, it differs strongly on the parts of the model that are considered
during aggregation • Clients only communicate the neural model’s base layer to
the server and retrain the other layers • Underlying idea is base layer deals
with “representation learning” and can be advantageously shared by clients
through aggregation • Upper layers are more concerned with decision making which
is more specific to each client • FedPer clients better handle various inputs
(in the base layers) while being able to specialize in their particular data (in
the upper layers). • However, aggregated model, on the server-side is only
partial and is not usable for decision-making due to missing layers • FedPer can
be seen as an adaption of the Transfer Learning methodology into a federared
learning scheme. Studies have shown thtat it can surpass centralized learning
and FedAvg approach in the HAR field.
Federated Matched Averaging (FedMA):
• *FedMA modifies the neural model architecture by incorporating a layer-wise
aggregation process where similar neurons can be fused, and new ones can be
added. • This approach treats the number of nodes in a layer as a sub-problem to
solve rather than a hyper-parameter to be set as an extension to CNN and RNN •
*FedMA Considers that neuron in the NN layer are permutation invariant (changing
the layer/neuron order will not impact the NN output). • *The algorithm central
intuition is that all clients can contain neurons that are similar and should be
merged together. **All neurons in the same cluster are averaged to produce a
global neuron** • [[To find out which neuron can be fused, the algorithm uses a
2D permutation matrix that is computed iteratively from increasing rank
layers.]] • Experiments with deep CNN and LSTM memory architectures show that
the FedMA algorithm outperforms FedAvg on CV datasets.
Fededated Distance (FedDist):
• Novel neuron matching and detail a federated learning algorithm based on a
Euclidian distance dissimilarity measurement. • This algorithm includes some
elements of FedAvg and FedMA, is called Federated Distance (FedDist) for its
emphasis on computing distances of neurons of similar coordinates when comparing
clients and server models. • FedDist recognized, like FedMA, that some client
models may diverge because of heterogeneous, non-IID data. This results in
neurons that cannot be matched with neurons from other models (because of
weights that are too far apart). • FedDist also recognized that the model
structure is relatively stable, which means that neurons with the same
coordinates play a similar role. This view provides an opportunity to build a
coordinate-wise approach • FedDist identifies diverging neurons using Euclidean
distances. These neurons that are specific to certain clients are added to the
aggregated model as new neurons. • This new neuron adding scheme can lead to
larger models that are able to generalize better. As new neurons are added to a
layer, a layer-wise training round is added in order to allow the neurons in the
next layers to adjust to the new incoming neurons and weights. • To do that
layer with the new neuron and those below are frozen, and the subsequent layers
are trained. When all layers have been treated, and a new aggregated model is
computed. ◦ A global server model is computed from all clients. ◦ Outliers are
identified using the Euclidean distance and added to the aggregated model. • A
penalty function has also been implemented to raise the threshold as training
continues to prevent the never-ending addition of new neurons. If a neuron in
any of the clients holds an individual distance above the threshold, it is then
added to the server model. The process is performed layer wise. At each
communication round, it is performed on the first layer.
Reference:
https://arxiv.org/abs/2110.10223 Happy reading..