^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

With the widespread application of distributed systems, many problems need to be solved urgently. How to design distributed optimization strategies has become a research hotspot. This article focuses on the solution rate of the distributed convex optimization algorithm. Each agent in the network has its own convex cost function. We consider a gradient-based distributed method and use a push-pull gradient algorithm to minimize the total cost function. Inspired by the current multi-agent consensus cooperation protocol for distributed convex optimization algorithm, a distributed convex optimization algorithm with finite time convergence is proposed and studied. In the end, based on a fixed undirected distributed network topology, a fast convergent distributed cooperative learning method based on a linear parameterized neural network is proposed, which is different from the existing distributed convex optimization algorithms that can achieve exponential convergence. The algorithm can achieve finite-time convergence. The convergence of the algorithm can be guaranteed by the Lyapunov method. The corresponding simulation examples also show the effectiveness of the algorithm intuitively. Compared with other algorithms, this algorithm is competitive.

Consider a network with N nodes. Each node on the network has its own cost function, expressed as f i : ℝ n → ℝ , i = 1 , 2 , ⋯ , N . It is strictly convex. All nodes cooperate to achieve the optimal value of the target cost function.

x * = arg min F ( x ) (1-1)

Among them, F ( x ) = ∑ i = 1 N f i ( x ) , x * ∈ ℝ n is the optimal value of function F ( x ) . Generally, a problem of the form (1-1) is called an unconstrained convex optimization problem [

At present, a series of algorithms on problem (1-1) have been extensively studied. In general, these algorithms can be divided into two categories: discrete-time algorithms [

Based on the above research, a finite-time convergence algorithm is proposed in this chapter, using the Hessian inverse matrix to solve the problem (1-1). This algorithm was inspired by references [

Distributed optimization theory and applications have become one of the important development directions of contemporary systems and control science. Among them, the design of optimization algorithms, proof of convergence, and algorithm complexity analysis are several key issues in the research of optimization theory. According to whether the optimized objective function has convexity, it can be divided into two categories: distributed non-convex optimization and distributed convex optimization. Because convexity has many excellent characteristics, solving distributed convex optimization is relatively simpler than solving distributed non-convex optimization. Therefore, for non-convex optimization problems, we often use some methods to convert it into convex optimization to solve. For distributed convex optimization, their objective function is generally the sum of the local objective functions of the nodes in the network. Common research methods include gradient descent method (including hybrid steepest descent method [

Consistency is the theoretical basis of distributed computing, an important performance indicator of distributed optimization and distributed cooperative learning, and convergence is a key indicator of consistency algorithms. However, most of the existing literature is about evolution Results of near-consistent convergence [

For distributed learning, the learning speed is as important as the learning effect. At present, many algorithms are dedicated to finding an optimal learning strategy [

Based on the existing research results in related fields, this paper proposes a finite-time convergence distributed optimization algorithm and a fast-convergent distributed cooperative learning algorithm. The effectiveness of our algorithm is verified theoretically and experimentally. . First, a new distributed optimization method and its graph variants are used. Based on this, a neural network-based finite-time convergence algorithm is used to solve the distributed strong convex optimization based on the fixed-time undirected topology network's finite-time convergence problem. The proposed distributed convex optimization algorithm can clearly give the upper bound of the convergence time, which is closely related to the initial state of the algorithm, the algorithm parameters, and the network topology graph. Secondly, the proposed distributed cooperative learning algorithm is a privacy protection algorithm, and the global optimization goal can be solved by simply exchanging the learning weights of the neural network. Unlike previous distributed cooperative learning algorithms that can only achieve asymptotic or exponential convergence, this algorithm can achieve rapid convergence.

We first give the basic assumptions of symbols and descriptions in Section 1.4. Then introduce the push-pull gradient algorithm in the second section and prove its convergence. An introduction to the finite-time convergence algorithm and proof of convergence are given in Section 3. In the fourth section, we introduce a push-pull fast convergence distributed cooperative learning algorithm, demonstrate its convergence, and give numerical simulation. Section 5 gives simulations and comparisons with other algorithms to prove their competitiveness, and gives the conclusion

Let’s start with a brief description of the symbols that will be used later. ℝ and ℝ + represent the real number set and the non-negative real number set, respectively; ‖ ⋅ ‖ represents the Euclidean norm on the set ℝ n ;

Consider the following system

x = g ( t , x ( t ) ) , g ( 0 , t ) = 0 , x ∈ U 0 ⊂ ℝ n (1-2)

where g : U 0 × ℝ + → ℝ n is continuous in an open neighborhood U 0 containing the origin x = 0 . Suppose there is a continuous positive definite Lyapunov function V ( x ( t ) ) on the set U × ℝ + , where U ∈ U 0 is a neighborhood about the origin. If there exists a real number λ > 0 , a ∈ ( 0 , 1 ) such that V ≤ − λ V a holds on the set U, then the system is stable in finite time, and the bound of its convergence time T

T ≤ V 1 − a ( x ( t 0 ) ) λ ( 1 − a ) (1-3)

For a linear parameterized neural network with m-dimensional input, n-dimensional output, and l hidden neuron, it can be modeled as follows

f ( x ) = ∑ i = 1 l s i ( x ) w i = S ( x ) W (1-4)

where x ⊂ ℝ m represents the m-dimensional input vector, s i represents the output of the i-th hidden node, and w i ⊂ ℝ n is the neural network learning weight connecting the output node with the i-th hidden node.

In this section, the default vector is a column, let N = { 1 , 2 , ⋯ , n } , be a group of agents, each agent i ∈ N , and it holds a local copy of the decision variable x i ∈ ℝ p and the auxiliary variable y i ∈ ℝ p of the average tracking gradient, and their iteration values are obtained by x i , j , y i , j , k respectively. Instead, use {∙} to represent the trajectory of the matrix by default. Make:

x = [ x 1 , x 2 , x 3 , ⋯ , x n ] T ∈ ℝ p ∗ n (2-1.a)

y = [ y 1 , y 2 , y 3 , ⋯ , y n ] T ∈ ℝ p ∗ n (2-1.b)

Define F ( x ) as the sum function of local variables

F ( x ) = ∑ i = 1 n f i ( x ) , (2-2)

Write it as

∇ F ( x ) = [ ∇ f 1 ( x 1 ) T , ∇ f 2 ( x 2 ) T , ⋯ , ∇ f n ( x n ) T ] ∈ ℝ p ∗ n (2-3)

Definition 2.1 Given an arbitrary vector ‖ ⋅ ‖ on ℝ p , for any x ∈ ℝ p ∗ n we define

‖ ⋅ ‖ = ‖ [ ‖ x ( 1 ) ‖ , ‖ x ( 2 ) ‖ , ⋯ , ‖ x ( p ) ‖ ] ‖ 2 , (2-5)

where x ( 1 ) , x ( 2 ) , ⋯ , x ( p ) ∈ ℝ n are members of the x column.

Assumption 2.1 is strongly convex and continuous for each node function

〈 ∇ f i ( x ) − ∇ f i ( x ′ ) , x − x ′ 〉 ≥ μ ‖ x − x ‖ 2 2 (2-6.a)

‖ ∇ f i ( x ) − ∇ f i ( x ′ ) ‖ 2 ≤ L ‖ x − x ‖ 2 2 (2-6.b)

Under this assumption we studied, there is a problem of unique optimal solution.

For the interactive topology graph between the nodes to be used, we model it abstractly as a directed graph. A histogram G = ( N , E ) consisting of a pair of nodes N and ordered edge sets E . Here we think that if a message from node i reaches node j in the graph, and i , j is within the directed edge E , then i is defined as the parent node and j is the child node. Information can be passed from parent to child nodes. In graph G , a directed edge path is a subsequence of edges, such as ( i , j ) , ( j , k ) , ⋯ In addition, directed trees are directed graphs, in other words, each vertex has only one parent. A tree generated by a directed graph is a directed tree that will follow all vertices in the graph.

The algebraic form of the push-pull gradient method can be written as:

x k + 1 = R ( x k − a y k ) , (2-7.a)

y k + 1 = C y k + ∇ F ( x k + 1 ) − ∇ F ( x k ) , (2-2.b)

where a = d i a g { a 1 , a 2 , ⋯ , a n } is a non-negative diagonal matrix, and R = [ R i j ] , C = [ C i j ] R n ∗ n We derive the hypothesis after this.

Assume 2.2, the matrix R ∈ ℝ n ∗ n is non-negative random, and C ∈ ℝ n ∗ n is also non-negative random, that is, R 1 = 1 , 1 T C = 1 T . In addition, we show that the diagonal terms of R and C are positive, that is, R i i > 0 , C i i > 0 for i ∈ G .

Inductive by column random C

1 n 1 T y k = 1 n 1 T ∇ F ( x k ) , ∀ k (2-8)

The above relationship has a very important relationship to the average tracking speed of the subset 1 T ∇ F ( x k ) n .

Now, we give the graphs G R and G C T derived from the matrices R and C T , respectively. Here we want to explain that G R and G C T are the same, but all edges are opposite.

Assume 2.3. For graphs G R and G C T , each contains at least one spanning tree. In addition, at least one node is followed by a spanning tree of G R and G C T , that is, R R ∩ R C T ≠ ∅ , and R R is the set of all possible spanning tree roots in graph G R .

For the choice of step size, we assume that at least one node in the range has a positive step size.

From the above prerequisites and assumptions we can get some constraints and the scope of the argument, which intuitively opens the way for the algorithm, so we explain our algorithm from another angle.

In order to show the feasibility of the push-pull algorithm, we first calculate in the optimal form

x * ∈ n u l l { I − R } , (2-9.a)

1 T ∇ F ( x * ) = 0 (2-9.b)

where x * = 1 x * T , and meet the conditions introduced above, now consider the algorithm proposed above, assuming that the algorithm generates two sequences { x k } and { y k } , which converge to x ∞ and y ∞ , respectively, We can get

( I − R ) ( x ∞ − a y ∞ ) + a y ∞ = 0 , (2-10.a)

( I − R ) y ∞ = 0. (2-10.b)

Here we want to show that if ( I − R ) does not intersect the span of a ⋅ n u l l { I − C } , we will get x ∞ ∈ n u l l { I − R } , a y ∞ = 0 ,Therefore, x ∞ satisfies

the optimal condition of x * ∈ n u l l { I − R } . From 1 1 n 1 T y k = 1 n 1 T ∇ F ( x k ) , 1 T ∇ F ( x * ) = 1 T y ∞ = 0 is the exactly Optimal condition in 1 T ∇ F ( x * ) = 0 .

We now reproduce the feasibility of the push-pull algorithm, and from the above assumptions and conditions we know that it is linearly convergent

lim k → ∞ R k = 1 u T n , lim k → ∞ C k = v 1 T n (2-11)

Therefore, in the case of relatively small step sizes, the above relationship means that x k ≈ 1 u T x k + 1 n , y k ≈ v 1 T ∇ F ( x k ) n , x k means that the entire network

only pulls the state information of the agent i ∈ R R , while y_{k} means pushing back the agent j ∈ R C T and tracking the average gradient information. This form of “push” and “pull” information gives the name of our proposed algorithm. The information that R R ∩ R C T ≠ ∅ essentially represents is that at least every agent needs to be pushed and pulled at the same time.

The algorithm in (2-7) is similar in structure to the DIGing algorithm proposed in [

Above we have explained the rationality of this method mathematically, now we conceptually explain it as a push-pull algorithm and its reliability. In the current calculation, we still put it in a static network, discuss and analyze it. But in fact, many networks in the real world are dynamic or even unreliable. We need to expand the scope of the discussion. The original algorithm was actually calculated from [

In order to keep the part of the network we specified converge, a relatively effective method is to make the receiver perform the task of scaling and combining. When the network environment changes, as the underlying sender, it is difficult to know the entire network change and we can adjust the weight accordingly. We can also continue to use the push protocol to communicate and let the surrounding nodes continue to send messages to it. However, it is difficult to determine whether it is still alive (expired) in the network, because we do not know its status should not or cannot respond as death). We can “subjectively” judge whether a certain node or agent is dead. The important reason is that we cannot fully synchronize. If a node waits for a certain period of time without responding, we can consider it to be dead until he again Answer. In fact, a pull communication protocol can also be used to allow agents to pull information from neighbors or nodes for effective coordination and synchronization.

To sum up, for the general implementation of Algorithm 1, the push protocol is indispensable, and using the pull protocol on this basis can improve the network operation efficiency, but it cannot be operated only by the pull network.

We now show how the proposed algorithm unifies different types of distributed architectures to a limited extent. For a completely decentralized case, for example, there is an undirected connection graph G , we can set G R = G C = G , and let R = C , then it becomes a symmetric matrix. In this case, the algorithm can be regarded as [

While it may not be straightforward to implement in a centralized or semi-centralized network, let us illustrate by example. Consider a four-node star network consisting of {1, 2, 3, 4}. Let node 1 be located in the center, and nodes 2, 3, and 4 be connected to node 1. In this case, we can use the matrices R and C are set to

R = [ 1 0 0 0 0.5 0.5 0 0 0.5 0 0.5 0 0.5 0 0 0.5 ] , C = [ 1 0.5 0.5 0.5 0 0.5 0 0 0 0 0.5 0 0 0 0 0.5 ]

As an illustration,

At the same time, the node collects information about y i , k from the feedback information through G C , and other nodes can only passively comply with the request from node 1. This very intuitively shows the name of the push-pull algorithm. Although the related nodes 2, 3, and 4 update their information, these numbers do not need to participate in the optimization process. Due to the last three rows of C weights, they are geometrically fast, will disappear. In this case, we can set the local step size of 2, 3, 4 to 0 as a matter of course. In general, we can assume that f 1 ( x ) = 0 , ∀ x , then we can make ∑ i = 2 4 f i ( x ) become a centralized algorithm. The master node uses 2, 3, 4 Calculated by distributed gradient method.

The above example is more of a semi-centralized case. Node 1 cannot be replaced by a strongly connected subnet in R and C, but 2, 3, and 4 can be replaced by different nodes, as long as the information of these subnodes can be passed to G R . In the subordinate agent layer of the above, the theory is discussed in the next section. The layer in G C , using the concept of the root tree, can be understood as the specific requirement of the subnet connectivity. In the network, his role is similar to the role of node 1, we call it the leader, and other nodes are called followers. One thing we want to emphasize here is that a subnet can be used to replace a node, but after the replacement, all subnet structures are decentralized, and the relationship between the leader and the subnet is subordinate. This is what we call a semi-centralized architecture.

In this section, we will study the convergence of the algorithm. First, we define x ¯ k = 1 n u T x k , y ¯ k = 1 n 1 T y k . Our thinking is based on the linear constraint ‖ x ¯ k + 1 − ˙ x ⋆ ‖ , ‖ x k + 1 − ˙ 1 x ¯ k + 1 ‖ R , ‖ y k + 1 − ˙ V y ¯ k + 1 ‖ C for binding. Among them, ‖ ⋅ ‖ R and ‖ ⋅ ‖ C are defined later. He is a specific specification. On this basis, a linear system can be established, which belongs to the inequality.

Algorithm analysis According to Formula (2-7), we can get

x ¯ k + 1 = 1 n u T R ( x k − ˙ α y k ) = x ¯ k − ˙ 1 n u T α y k (2-12)

y ¯ k + 1 = 1 n 1 T ( C y k + ∇ F ( x k + 1 ) ⋅ ∇ F ( x k ) ) = y ¯ k + 1 n 1 T ( ∇ F ( x k + 1 ) ⋅ ∇ F ( x k ) ) (2-13)

Let’s further define, g k = 1 n 1 T ∇ F ( 1 x ¯ k ) ,

a ′ = 1 n u T α V (2-14)

From (2-8) and (2-10) we get a ′ > 0

Then we can get

x ¯ k + 1 = x ¯ k − ˙ 1 n u T ( y k − ˙ V y ¯ k + V y ¯ k ) = x ¯ k − ˙ a ′ g k − ˙ a ′ ( y ¯ k − ˙ g k ) − ˙ 1 n u T α ( y k − ˙ V y ¯ k ) (2-15)

According to the above definition we can get

x k + 1 − ˙ 1 x ¯ k + 1 = R ( x k − ˙ α y k ) − ˙ 1 x ¯ k + 1 n u T α y k = R ( x k − ˙ 1 x ¯ k ) − ˙ ( R − ˙ 1 n u T ) α y k = ( R − ˙ 1 n u T ) ( x k − ˙ 1 x ¯ k ) − ˙ ( R − ˙ 1 n u T ) α y k (2-16)

Similarly available

y k + 1 − ˙ V y ¯ k + 1 = ( C − ˙ 1 n 1 T ) ( y k − ˙ V y ¯ k ) + ( 1 − ˙ V 1 n 1 T ) ( ∇ F ( x k + 1 ) − ˙ ∇ F ( x k ) ) (2-17)

According to ‖ x ¯ k + 1 − ˙ x ⋆ ‖ , ‖ x k + 1 − ˙ 1 x ¯ k + 1 ‖ R , ‖ y k + 1 − ˙ V y ¯ k + 1 ‖ C lemma we build a linear system of inequality

[ ‖ x ¯ k + 1 − ˙ x ⋆ ‖ ‖ x k + 1 − ˙ 1 x ¯ k + 1 ‖ R ‖ y k + 1 − ˙ V y ¯ k + 1 ‖ C ] ≤ A [ ‖ x ¯ k − ˙ x ⋆ ‖ ‖ x k − ˙ 1 x ¯ k ‖ R ‖ y k − ˙ V y ¯ k ‖ C ]

Here the inequality is calculated by component, and the transformation matrix element A = [ a i j ] can be obtained

[ a 11 a 21 a 31 ] = [ 1 − ˙ a ′ μ a ^ σ R ‖ ν ‖ R L a ^ c o δ C , 2 ‖ R ‖ 2 ‖ υ ‖ 2 L 2 ]

[ a 12 a 22 a 32 ] = [ a ^ L n σ R ( 1 + a ^ ‖ ν ‖ R L n ) c o δ C , 2 L ( ‖ R − I ‖ 2 + a ^ ‖ R ‖ 2 ‖ υ ‖ 2 L n ) ]

[ a 12 a 22 a 32 ] = [ a ^ ‖ u ‖ 2 n a ^ c R δ R , C σ C + a ^ c o δ C , 2 ‖ R ‖ 2 L 2 ]

It’s here a ^ = max a i , c o = ‖ I − υ 1 n 1 T ‖ C .

According to the previous inequality linear system, we know that when the spectral radius ρ ( A ) < 1 is satisfied, ‖ x ¯ k + 1 − ˙ x ⋆ ‖ , ‖ x k + 1 − ˙ 1 x ¯ k + 1 ‖ R , ‖ y k + 1 − ˙ V y ¯ k + 1 ‖ C converge to 0 at a linear rate ℴ ( ρ ( A ) k ) . The problem to be explained next is ρ ( A ) < 1 .

Given a nonnegative irreducible matrix M = [ m i j ] ∈ ℝ 3 ∗ 3 , i = 1 , 2 , 3 and m i j < λ * , then ρ ( M ) < λ * is ( λ * I − M ) > 0 Necessary and sufficient conditions.

We now give convergence results for the proposed algorithm.

We assume that in the algorithm (1-1), M > 0 in a ′ ≥ M a ^ , we get

a ^ ≤ min { 2 c 3 c 2 + c 2 2 + 4 c 1 c 3 , ( 1 − σ C ) 2 σ C δ C , 2 ‖ R ‖ 2 L } (2-18)

Among them c 1 , c 2 , c 3 will be given later. In this way, when the spectral radius ρ ( A ) < 1 is satisfied, ‖ x ¯ k + 1 − ˙ x ⋆ ‖ , ‖ x k + 1 − ˙ 1 x ¯ k + 1 ‖ R , ‖ y k + 1 − ˙ V y ¯ k + 1 ‖ C converges to 0 at a linear rate ℴ ( ρ ( A ) k ) .

We prove that according to the above lemma, we guarantee that a 11 , a 22 , a 33 < 1 , and

det ( I − A ) = ( 1 − a 11 ) ( 1 − a 22 ) ( 1 − a 33 ) − a 23 a 31 − a 13 a 21 a 32 − ( 1 − a 22 ) a 13 a 31 − ( 1 − a 11 ) a 23 a 32 − ( 1 − a 11 ) a 12 a 21 = ( 1 − a 11 ) ( 1 − a 22 ) ( 1 − a 33 ) − a ′ a ^ 2 σ R c o δ R , C δ 2 , C ‖ R ‖ 2 ‖ v ‖ 2 L 3 n − a ^ 2 σ R c o δ 2 , C ‖ u ‖ 2 ‖ v ‖ R ( ‖ R − I ‖ 2 + a ^ ‖ R ‖ 2 ‖ v ‖ 2 L n ) L 2 n

− a ^ 2 c o δ 2 , C ‖ R ‖ 2 ‖ v ‖ 2 ‖ u ‖ 2 L 2 n ( 1 − a 22 ) − a ^ σ R c o δ R , C δ 2 , C L ( ‖ R − I ‖ 2 + a ^ ‖ R ‖ 2 ‖ v ‖ 2 L n ) ( 1 − a 11 ) − a ′ a ^ σ R ‖ v ‖ 2 L 2 n ( 1 − a 33 ) > 0 (2-19)

The small problem now is to explain that a 11 , a 22 , a 33 < 1 make the above formula hold.

First, a 11 < 1 , 1 − a 22 ≥ ( 1 − σ R ) 2 and 1 − a 33 ≥ ( 1 − σ C ) 2 are guaranteed in the selected a ′ ≤ 2 ( μ + L ) , we can get

a ^ ≤ min { ( 1 − σ C ) n 2 σ R ‖ v ‖ R L , ( 1 − σ C ) 2 c o δ C , 2 ‖ R ‖ 2 L } (2-20)

Secondly, the sufficient condition for making det ( I − A ) > 0 is to replace ( 1 − a 11 ) with ( 1 − σ C ) / 2 , and the rest is similar, so that a ′ = M a ^ . We can get c 1 a ^ 2 + c 2 a ^ − c 3 < 0 .

Here we explain c 1 , c 2 , c 3

c 1 = M σ R c o δ C , 2 δ R , C ‖ R ‖ 2 ‖ v ‖ 2 L 3 n + M μ σ R c o δ C , 2 δ R , C ‖ R ‖ 2 ‖ v ‖ 2 L 2 n + σ R c o δ C , 2 ‖ R ‖ 2 ‖ v ‖ R ‖ u ‖ 2 L 3 n n = σ R c o δ C , 2 ‖ R ‖ 2 ‖ v ‖ 2 L 2 n n [ M δ R , C n ( L + μ ) + ‖ v ‖ R ‖ u ‖ 2 L ] (2-21)

get

c 2 = σ R c o δ C , 2 ‖ R ‖ 2 ‖ v ‖ R ‖ u ‖ 2 ‖ R − I ‖ 2 L 2 n + c o δ C , 2 ‖ R ‖ 2 ‖ v ‖ R ‖ u ‖ 2 ( 1 − σ C ) L 2 2 n + M σ R c o δ C , 2 δ R , C μ ‖ R − I ‖ 2 L + σ R 2 M ( 1 − σ C ) ‖ v ‖ R L 2 n (2-22)

And then

c 3 = M ( 1 − σ C ) ( 1 − σ R ) 4 μ (2-23)

As discussed above

a ^ ≤ 2 c 3 c 2 + c 2 2 + 4 c 1 c 3 (2-24)

From this we get the final limit of a ^ .

Now we introduce the optimization algorithm for finite-time convergence. Compared with most existing distributed convex optimization algorithms that can only achieve exponential convergence, this algorithm can achieve finite-time convergence. The convergence of the algorithm can be guaranteed by Lyapunov’s finite-time stability theory.

Consider a network with N nodes. Each node on the network has its own cost function, expressed as f i : ℝ + → ℝ , which is strictly convex. All nodes cooperate to obtain the optimal value of the target cost function. In order to better design the algorithm, we give the following assumptions:

Assumption 3.1: The upper-layer communication topology network is undirected and connected.

Assumption 3.2: For each proxy node i ∈ V of the network, his cost function f i is second-order continuously differentiable strongly convex, the convex parameter θ i > 0 , and the Hessian matrix ∇ 2 f i meets the local Lipschitz condition.

From this we get

{ x i ( t ) = γ ( ∇ 2 f i ( x i ( t ) ) ) − 1 ∑ j ∈ N i a i j S i g | x j ( t ) − x i ( t ) | a x i ( 0 ) = x i * , ∀ i ∈ V (3-1)

where x i ∈ ℝ n represents the state of node i, and γ ∈ ℝ + is a gain constant that can be used to improve the convergence speed of Aijie; N i = { j ∈ V : ( i , j ) ∈ E ( G ) } means The set of all neighbor nodes of node i; a i j is an element of the adjacency matrix A; 0 < a < 1 .

And x i * is the optimal value of cost function f i .

Note 3.1: The algorithm (3-1) is inspired by continuous time zero gradient [

d d t ∑ i ∈ V ∇ f i ( x i ( t ) ) = ∑ i ∈ V ( ∇ 2 f i ( x i ( t ) ) ) x i ( t ) = ∑ i ∈ V ∑ i ∈ V a i j S i g | x j ( t ) − x i ( t ) | a = 0 .

From the second formula, we can get ∑ i ∈ V ∇ f i ( x i ( 0 ) ) = 0 , So it is easy to get the gradient and satisfy

∑ i ∈ V ∇ f i ( x i ( t ) ) = 0 , ∀ i ≥ 0

where ∑ j ∈ N i a i j S i g | x j ( t ) − x i ( t ) | a can ensure that the algorithm achieves finite-time consistent convergence, that is, there is a convergence time T and a convergence state x ˜ . For ∀ i ∈ V , both have lim t → T x i ( t ) = x ˜ and ∑ i ∈ V ∇ f i ( x ˜ ) = ∇ F ( x ˜ ) = 0 . From the hypothesis 2, we know that F ( x ( t ) ) Strongly convex has only one optimal value x * , and satisfies ∇ F ( x * ) = ∑ i ∈ V ∇ f i ( x * ) = 0 . The above analysis shows that x ˜ = x * , which shows that at the upper level, this algorithm can solve the problem we raised. It should be noted that when α = 1 , the algorithm only achieves progressive convergence.

Theorem 3.1: Based on assumptions 1 and 2, our proposed algorithm can solve the target problem in a finite time, and the bound of its convergence time is T. It also shows that lim t → T x ( t ) = x * Where T satisfies:

T ≤ 4 V 1 − a 2 ( x ˜ ( t 0 ) ) γ ( 1 − a ) ( 4 λ 2 Θ ) 1 + a 2 (3-2)

Among them x ( t ) = ( x 1 ( t ) T , x 2 ( t ) T , ⋯ , x N ( t ) T ) T ∈ ℝ n N ;

x * = ( x * T , x * T , ⋯ , x * T ) T ; x ( t 0 ) = ( x 1 * T , x 2 * T , ⋯ , x N * T ) T , V ( x ) is a continuous positive definite Lyapunov function, λ 2 is the algebraic connectivity related to the topological graph, and Θ > 0 is a constant related to f i ( i ∈ V ) .

Proof: This part of the Lyapunov method gives a proof of Theorem 3.1. First,

V ( x ( t ) ) = ∑ i ∈ V f i ( x * ) − f i ( x i ( t ) ) − ∇ f i ( x i ( t ) ) T ( x * − x i ( t ) ) (3-3)

This function is given in [

Next, for convenience of derivation, we give the following definitions:

For i ∈ V , f i is a local strongly convex function. From the above formula, we know that U i is a compact set. In order to take advantage of the strong convex function, we need to find another convex compact set, so we let U = c o n v ( U i ∈ V U i ) , where “ c o n v “ represents a convex set From hypothesis 3.2, we can know that U i ( i ∈ V ) is a compact set, U is a convex compact set and satisfies ∀ t ≥ 0 , ∀ i ∈ V , x * ∈ U i ⊂ U is based on the convex compact set U, for every i ∈ V , combined with hypothesis 3.2, there will be Θ i ≥ θ i satisfying

∑ i ∈ V Θ i 2 ‖ x * − x i ( t ) ‖ 2 ≥ V ( x ( t ) ) ≥ ∑ i ∈ V θ i 2 ‖ x * − x i ( t ) ‖ 2 , x ( t ) ∈ U (3-5)

From (3-5), we can get V V ( x ( t ) ) ≥ 0 , when ∀ x ( t ) ∈ U . Considering the derivative of V with respect to time, then for ∀ x ( t ) ∈ U , the following relationship exists

V ( x ( t ) ) = γ 2 ∑ i ∈ V ∑ j ∈ N i ( x j ( t ) − x i ( t ) ) T φ i j (3-6)

where φ i j = a i j S i g | x j ( t ) − x i ( t ) | a , we can get ( x j ( t ) − x i ( t ) ) T φ i j ≥ ‖ x j ( t ) − x i ( t ) ‖ ( a + 1 ) ≥ 0 , V ≤ 0 if and only if the equation x ( t ) = x * holds, so V can be used to prove Theorem 3.1.

In addition, d d t ∑ i ∈ V ∇ f i ( x i ( t ) ) = ∑ i ∈ V ( ∇ 2 f i ( x i ( t ) ) ) x i ( t ) = γ ∑ i ∈ V ∑ i j ∈ N i φ i j = 0 , combining the existing initial conditions, we can get the following properties ∑ i ∈ V ∇ f i ( x i ( t ) ) = 0 . We set η ( t ) = 1 N ∑ i ∈ V x i ( t ) ∈ U , there are F ( x * ) ≤ F ( η ( t ) ) And can get the following inequality

V ( x ( t ) ) ≤ ∑ i ∈ V F ( η ( t ) ) − f i ( x i ( t ) ) − ∇ f i ( x i ( t ) ) T ( η ( t ) − x i ( t ) ) (3-7)

Combining (1-4) for x ( t ) ∈ U , (3-7) can be written as

V ( x ( t ) ) ≤ ∑ i ∈ V Θ i 2 ‖ x i ( t ) − 1 N ∑ j ∈ V x i ( t ) ‖ 2 = Θ i 2 N x ( t ) T ( L ( G ¯ ) ⊗ I n ) x ( t ) , (3-8)

where Θ = max { Θ i } i ∈ V , G ¯ is the complete graph of graph G , then combining Cauchy's inequality and (3-6), we can get

V ( x ( t ) ) ≤ − γ 2 ∑ i ∈ V ∑ j ∈ N i [ ‖ x j ( t ) − x i ( t ) ‖ 2 ] a + 1 2 ≤ − 2 a − 1 2 γ { 1 2 ∑ i ∈ V ∑ j ∈ N i ‖ x j ( t ) − x i ( t ) ‖ 2 } a + 1 2 = − 2 a − 1 2 γ ( x ( t ) T ( L ( G ¯ ) ⊗ I n ) x ( t ) ) a + 1 2 (3-9)

Due to λ 2 ( x ( t ) T ( L ( G ¯ ) ⊗ I n ) x ( t ) ) ≤ N ( x ( t ) T ( L ( G ¯ ) ⊗ I n ) x ) ( t ) , Combining (3-8) we get:

V ( x ( t ) ) ≤ − γ 2 ( 4 λ 2 Θ ) a + 1 2 V ( x ( t ) ) a + 1 2 (3-10)

Combined with the finite-time stability theorem proposed earlier, we can get that our algorithm is convergent, then there is a time T, lim t → T V ( x ( t ) ) = 0 V ( x ( t ) ) ≡ 0 ( t > T ) . That is lim t → T V ( x ( t ) ) = x * , In addition, the bound of T

can be obtained from Theorem (3.1) T ≤ 4 V 1 − a 2 ( x ˜ ( t 0 ) ) γ ( 1 − a ) ( 4 λ 2 Θ ) a + 1 2 . Among them, the

convergence speed is related to parameters such as algebraic connectivity λ 2 , function curvature Θ , f i , γ , α , etc.

In this section, a simulation experiment is given to demonstrate the effectiveness of the algorithm in this section. We set up a 6-node network topology diagram, as shown in

f i ( x ) = 1 8 ( x − i ) 6 + 3 4 ( x − i ) 2 , i = 1 , 2 , ⋯ , 6. (3-11)

It can be obtained that the optimal value of each node satisfies x i * = i , i ∈ { 1 , 2 , ⋯ , 6 } . The optimal value of Equation (1-1) is calculated as x * = 3.5 , V ( x ( 0 ) ) = 130.168 . Combining the convex compact set U in the proof, we can get Θ 1 = 41.6538 , Θ 2 = 115.7049 , Θ 3 = 1041.3 , Θ 4 = 1041.3 , Θ 5 = 115.7049 , Θ 6 = 41.6538 , which means Θ = 1041.3 .

In the simulation, we use the parameter values λ 2 ( G 1 ) = 0.5858 , α = 0.5 , γ = 10 , and the simulation results are shown in

This chapter aims to combine and generalize the previously proposed algorithms to practical applications, such as common machine learning scenarios. Inspired by the previous algorithm, we will design a fast convergent distributed cooperative learning (P-DCL) algorithm based on a linear parameterized neural network based on push-pull mode. In the first step, a P-DCL algorithm based on continuous-time convergence in push-pull gradient mode is first given. In the second step, we give a convergence analysis of the algorithm based on the Lyapunov method. In the third step, for the practical effect of the algorithm, we use the fourth-order Runge-Kutta (RK4) method to discretize the algorithm. In the fourth step, the distributed ADMM algorithm and the push-pull gradient-based (P-DCL) algorithm simulation are given. Experiments show that our proposed algorithm has higher learning ability and faster convergence speed. Finally, we give the relationship between the algorithm’s own convergence speed and some parameters. Simulation results show that the convergence speed of the algorithm can be effectively improved by properly selecting some adjustable parameters.

Restatement: In order to construct the algorithm systematically, the problem formation is given first, and then the local cost function is analyzed. Then the relationship between global cost function and local cost function solution is given.

Consider a network with N nodes. Each node i ∈ V in the network contains M i ∈ ℝ + samples, and each sample set can be expressed as D i = U k = 1 M i { X i ( k ) Y i ( k ) } , Where { X i ( k ) Y i ( k ) } , represents the k-th sample on the i-th node, so for each node, their local cost function can be expressed as:

E i l o c ( W i ) ≜ 1 2 ‖ Y i − S ( Y i ) W i ‖ 2 2 + δ i 2 ‖ W i ‖ 2 2 (4-1)

W i ∈ ℝ l × n is the learning weight of the i-th node, X i ∈ ℝ M i × m represents the sample of the i-th node; Y i and S ( Y i ) W i represent the expectations of With the output, δ i is a non-negative constant. In this way, the optimal learning weight of node i can be easily obtained.

W i * = ( S ( X i ) T S ( X i ) + δ i I l ) − 1 S ( X i ) T Y i (4-2)

If all the node samples satisfy ∑ i = 1 N M i = M , the adjustment parameters of all nodes satisfy ∑ i = 1 N δ i = K . Then the W-optimal global cost function (1-7) is equivalent to the sum function of the local cost functions of each node.

E g l o b ( W ) = ∑ i = 1 N E i l o c ( W ) (4-3)

As mentioned earlier, there are many distributed solving algorithms for this problem that can achieve progressive convergence. Next, what needs to be done is to design a fast distributed optimization algorithm, such as the following requirements:

lim t → T W i → W * , i ∈ { 1 , ⋯ , N } (4-4)

This shows that all nodes can converge to the optimal learning weight W * in a finite time T.

From the above analysis, the global cost function (1-7) can be written as:

{ min ∑ i = 1 N E i ( W i ) s .t . W i − W * = 0 , i = 1 , ⋯ , N (4-5)

This is often referred to as global consistency. Unlike the traditional multi-agent consistency problem, the result of consistency convergence here has no specific meaning. Consistency has a long history of research. The basic concept is that all nodes in all networks eventually reach the same state through information exchange with neighbors. From the perspective of learning, an efficient learning algorithm is very necessary. For distributed cooperative learning algorithms, their learning rate is an important measurement index of their algorithm. However, in real life, it is more necessary to reach a valid result within a certain time, which also prompts us to design a fast consensus learning cooperation algorithm.

Here, based on the linear parameterized neural network, a distributed strategy for the target problem is given. To build a better construction algorithm, the following assumptions are given first:

Hypothesis 4.1 assumes that the network topology G is undirected and connected.

Based on the previous analysis, the distributed cooperative learning algorithm in continuous time gives:

{ W i ( t ) = ρ ( S ( X i ) T S ( X i ) + δ i I l ) − 1 ∑ j ∈ N i S i g | W j ( t ) − W i ( t ) | β , W i ( t 0 ) = ( S ( X i ) T S ( X i ) + δ i I l ) − 1 S ( X i ) T Y i (4-6)

where ρ ∈ R + is a constant used to adjust the convergence rate. 0 < β < 1 , a i , j is an element in the adjacency matrix A ; S i g | W j ( t ) − W i ( t ) | β = S i g ( W j ( t ) − W i ( t ) ) ⊙ | W j ( t ) − W i ( t ) | β ,

Let W ˜ ( t ) = [ W ( t ) 1 T , W ( t ) 2 T , ⋯ , W ( t ) N T ] T ∈ ℝ l N × n , Q ( X ) = d i a g [ S ( X 1 ) , S ( X 2 ) , ⋯ , S ( X N ) ] ∈ ℝ M × l N , Δ = d i a g ( δ 1 I l , δ 2 I l , ⋯ , δ N I l ) , The algorithm can be written as a matrix:

{ W ˜ ( t ) = ρ ( Q ( X ) T Q ( X ) + Δ I l N ) − 1 S i g ( ( − L ⊗ I l ) W ˜ ( t ) ) ⊙ | ( L ⊗ I l ) W ˜ ( t ) | β β W ˜ ( t 0 ) = ( Q ( X ) T Q ( X ) + Δ ) − 1 Q ( X ) T Y (4-7)

Note 4.1: The above algorithms are inspired by [

Theorem 4.1: The algorithm (4-7) can achieve the goal in a finite time T, where time T satisfies:

T ≤ 4 V 1 − β 2 ( W ˜ ( t 0 ) ) ρ ( 1 − β ) ( 4 λ 2 Θ ) 1 + β 2 (4-8)

where V ( W ˜ ( t 0 ) ) is a second-order continuous positive definite function, β is a constant in the algorithm (4-7), W ˜ ( t 0 ) = [ W 1 * ; ⋯ ; W N * ] ; λ 2 is related to the network topology Graph-related algebraic connectivity. Θ is a constant related to the cost function of all nodes; ρ is the gain constant in the algorithm.

Proof: Based on the Lyapunov method, a rigorous proof of Theorem 4.1 is given next. Before certification, some related work needs to be prepared. First, select:

V ( W ˜ ( t ) ) = 1 2 ∑ i ∈ V ( W * − W i ( t ) ) T ∇ 2 E i ( W i ( t ) ) ( W * − W i ( t ) ) = 1 2 ∑ i ∈ V ( W * − W i ( t ) ) T ( S ( X i ) T S ( X i ) + δ i I l ) ( W * − W i ( t ) ) (4-9)

As a Lyapunov candidate function, V : ℝ l n N → ℝ . Since ( S ( X i ) T S ( X i ) + δ i I l ) > 0 , then we can get V ( W ˜ ( t ) ) > 0 , ∀ W ˜ ( t ) ≠ W * , change In other words, V ( W ˜ ( t ) ) in the Formula (4-9) is positive definite. In addition,

V ( W ˜ ( t ) ) = ∑ i ∈ V ( W * − W i ( t ) ) T − ∇ E i ( W i ( t ) ) T ( W * − W i ( t ) ) ≤ ∑ i ∈ V E i ( η ( t ) ) − E i ( W i ( t ) ) − ∇ E i ( W i ( t ) ) T − ( η ( t ) − W i ( t ) ) ,

where η ( t ) = 1 N ∑ i ∈ V W i ( t ) , then:

V ( W ˜ ( t ) ) ≤ ∑ i ∈ V Θ i 2 ‖ W i ( t ) − 1 N ∑ j ∈ V W j ( t ) ‖ 2 = Θ 2 N W ( t ) T ( L ( G ˜ ) ⊗ I n ) W ( t ) , (4-10)

where Θ i = λ max ( ∇ 2 E i ( W i ( t ) ) ) , Θ = max { Θ i } i ∈ V , L ( G ) , L ( G ) is the Laplacian matrix of G , and G ˜ is completely

Next, by calculating the inverse of V ( W ˜ ( t ) ) , we can get

V ( W ˜ ( t ) ) = − ∑ i ∈ V ( W * − W i ( t ) ) T ( S ( X i ) T S ( X i ) + δ i I l ) W i ( t ) = W * ∑ i ∈ V ∑ i ∈ N i φ i j + ∑ i ∈ V W i ( t ) T ∑ i ∈ N i φ i j (4-11)

where N i = { j ∈ V : ( i , j ) ∈ ε ( G ) } represents the neighbor of node i, φ i j = a i j s i g ( W j ( t ) − W i ( t ) ) ⊙ | W j ( t ) − W i ( t ) | β , we can get φ i j = − φ i j , which also means

∑ i ∈ V ∑ i ∈ N i φ i j = 0 (4-12)

In addition, it can be concluded

∑ i ∈ V W i ( t ) T ∑ i ∈ N i φ i j = − ρ 2 ∑ i ∈ V ∑ i ∈ N i ( W j ( t ) ) T − W i ( t ) φ i j ≤ − ρ 2 ∑ i ∈ V ∑ i ∈ N i ‖ W j ( t ) − W i ( t ) ‖ ( β + 1 ) (4-13)

Combining Formula (4-12) and Formula (4-13) Formula (4-11) can be written as

V ( W ˜ ( t ) ) ≤ − ρ 2 ∑ i ∈ V ∑ i ∈ N i ‖ W j ( t ) − W i ( t ) ‖ ( β + 1 ) = − ρ 2 ∑ i ∈ V ∑ i ∈ N i [ ‖ W j ( t ) − W i ( t ) ‖ 2 ] ( β + 1 ) = − 2 β − 1 2 ρ { 1 2 ∑ i ∈ V ∑ i ∈ N i ‖ W j ( t ) − W i ( t ) ‖ 2 } ( β + 1 ) = − 2 β − 1 2 ρ ( W ˜ ( t ) T ( L ( G ) ⊗ I n ) W ˜ ( t ) ) ( β + 1 ) (4-14)

This indicates that V ( W ˜ ( t ) ) is negative. Since λ 2 ( W ˜ ( t ) T ( L ( G ) ⊗ I n ) W ˜ ( t ) ) ≤ N ( W ˜ ( t ) T ( L ( G ) ⊗ I n ) W ˜ ( t ) ) , Formula (4-14) can be obtained

V ( W ˜ ( t ) ) ≤ − ρ 2 ( 4 λ 2 Θ ) β + 1 2 V ( W ˜ ( t ) ) β + 1 2 (4-15)

we can get that the proposed algorithm (4-7) is stable for a finite time, so there is a time T here, with lim t → T V ( W ˜ ( t ) ) = 0 , V ( W ˜ ( t ) ) ≡ 0 ( t → T ) , that is, lim t → T W ˜ ( t ) = W ˜ * . Can be combined with theorem 4.1 from Formula (4-15) to get

T ≤ 4 V 1 − β 2 ( W ˜ ( t ) ) ρ ( 1 − β ) ( 4 λ 2 Θ ) β + 1 2 (4-16)

Based on the above analysis, we can get that the algorithm proposed in this chapter can indeed find the optimal value of (1-7) in a limited time.

Based on the algorithm of (4-6), this section gives the discrete form:

{ W i ( k + 1 ) = W i ( k ) + h 6 ( μ i 1 + 2 μ i 2 + 2 μ i 3 + μ i 4 ) μ i 1 = f i ( t ( k ) , W i ( k ) , W N i ( k ) ) μ i 2 = f i ( t ( k ) + h 2 , W i ( k ) + h 2 μ i 1 , W N i ( k ) + h 2 F N i ( 1 ) ( k ) ) , μ i 2 = f i ( t ( k ) + h 2 , W i ( k ) + h 2 μ i 2 , W N i ( k ) + h 2 F N i ( 2 ) ( k ) ) , μ i 2 = f i ( t ( k ) + h 2 , W i ( k ) + h 2 μ i 3 , W N i ( k ) + h 2 F N i ( 3 ) ( k ) ) W i ( 0 ) = W i * (4-17)

W i ( k ) ( i ∈ V ) represents the k-th estimate of the i-th node with respect to W * . h represents the iteration step size; W N i = ( W j ) j ∈ N i ∈ ℝ l × n | N i | , where | ⋅ | represents the cardinality of the set.

f i ( t , W i ( k ) , W N i ( k ) ) = ρ ( S ( X i ) T S ( X i ) + δ i I l ) − 1 ∑ j ∈ N i a i j s i g | W j ( t ) − W i ( t ) | β , F N i ( m ) ( k ) = ( μ j m ) j ∈ N i ∈ R l × n | N i | , m ∈ 1 , 2 , 3 . In addition,

Note 4.2: In order to obtain good control performance or simplify the design process, usually in the design process of modern industrial control, we need to discretize a continuous-time system. In addition, effective discretization can not only reduce time and space costs, but also improve the learning accuracy of the algorithm. Methods like pulse invariance methods, pole-zero mapping methods, and triangle-equivalent equivalence are commonly used to convert continuous-time systems into equivalent discrete systems. Runkutta (RKK) algorithm with high accuracy and good stability is widely used. Therefore, we use the fourth-order RK (RK4) to process the discretization algorithm (4-6). However, for node i, we need to add 4 | N i | communications for each step. In other words, using the RK4 method for calculation increases the complexity of the calculation.

In this section, we present two distributed cooperative learning algorithms to be compared with our algorithm (4-17). Specific comparison results can be found in the simulation section.

The algorithm achieves the global goal of the algorithm through each communication with the remaining nodes.

{ W i ( k + 1 ) = ρ ( S ( X i ) T S ( X i ) + γ I l ) − 1 Y i − t i ( k ) + γ z ( k ) W i ( 0 ) = ( S ( X i ) T S ( X i ) + γ I l ) − 1 S ( X i ) T Y i (4-18)

where γ > 0 is a tuning function, z ( k + 1 ) = γ W ˜ + t ˜ K N + γ ; W ^ = 1 N ∑ i ∈ V W i ( k + 1 ) ; t ^ = ∑ i ∈ V t i ( k ) , among them t i ( k + 1 ) = t i ( k ) + γ ( W i ( k + 1 ) − z ( k + 1 ) ) . For a more detailed description of the algorithm (4-18), you can refer to [

Note 4.3: The ADMM algorithm is actually a constrained optimization algorithm, where the constraint is W i = z , i = 1 , ⋯ , N . From the above algorithm form, we can know that the ADMM algorithm is not a completely distributed algorithm, and each iteration of it requires the information of all nodes rather than the information of neighbors. So this algorithm is only suitable for fully connected undirected network topologies. It is known from [

Unlike the distributed ADMM algorithm, this algorithm only needs the information of the neighbor nodes for each iteration.

{ W i ( k + 1 ) = ρ ( S ( X i ) T S ( X i ) + δ i I l ) − 1 [ ∑ j ∈ N i a i j ( W j ( k ) − W i ( k ) ) ] + W k ( k ) W i ( 0 ) = ( S ( X i ) T S ( X i ) + δ i I l ) − 1 S ( X i ) T Y i (4-19)

Lemma ( [

algorithm (4-19) can find the optimal value of the target cost function, and Ω i = ( S ( X i ) T S ( X i ) + δ i I l ) .

Note 4.4: Like the algorithm (4-19), the algorithm mentioned in this chapter is also constrained by the zero-gradient sum, which can help us find the global best advantage faster. In particular, when the parameter β = 1 in the algorithm (4-6), it is equivalent to the algorithm (4-19). In addition, the algorithms (4-19) and (4-6) are completely distributed algorithms and can be applied in distributed connection networks. But the algorithm (4-19) can only achieve asymptotic convergence.

In this section, we consider numerically verifying our conclusions on real data sets in several different network situations. First, we give the comparison results of different algorithms based on different parameters of different data sets. Four different network topologies are given and their algebraic connectivity λ 2 is calculated. Secondly, in order to simplify the calculation, each node is assigned the same training sample and the same adjustment constant δ i . Finally, ρ max is calculated by lemma, and corresponding simulation parameters are set, such as the number of hidden neurons l, gain constants ρ , γ and δ i .

In order to better show the comparison results of the algorithms, the general form of the mean square error (MSE) is given. The MSE of the k-th iteration of the i-th node is defined as follows:

M S E i ( k ) ≜ ‖ Y i − S ( Y i ) W i ( k ) ‖ 2 2 (5-1)

In addition, the MSE of the entire network at the k-th iteration can be written as follows:

M S E a l l ( k ) ≜ 1 N ∑ i = 1 N ‖ Y i − S ( Y i ) W i ( k ) ‖ 2 2 (5-2)

By using the transformation M S E a l l ( k ) [ d b ] = 10 log 10 ( M S E a l l ( k ) ) to enlarge the error, the error curve of the iterative process can be more clearly shown.

We choose f ( x ) = sin ( x ) as the objective function. The sample set { X , Y } is =10,000 samples generated from the random set [−1.1]. We take S ( x ) = { x i } i = 1 , ⋯ , l as our basis function and choose a four-node network as the network topology graph, where the adjacency matrix A = [ 0 , 1 , 0 , 1 ; 1 , 0 , 1 , 0 ; 0 , 1 , 0 , 1 ; 1 , 0 , 1 , 0 ] , and λ 2 = 2 , each node is evenly distributed to N i = 2500 samples, where δ i starts from [0,1] In the experiment, we randomly selected the results: δ 1 = 0.3842 , δ 2 = 0.7459 , δ 3 = 0.9625 , δ 4 = 0.0321 , and K = 2.168 . The distributed learning weights are [0.9813, 0.0002, −0.0813, −0.0003, −0.0734, −0.0002, −0.0196, 0.0001, 0.0078, 0.0001, 0.0156, 0, 0.0130]. By making a difference, it is obvious that distributed is very close to centralized. In particular, let Θ i = λ max ( ∇ 2 E i ( W i ( t ) ) ) , Then you can get Θ 1 = 1681.5 , Θ 2 = 1812.8 , Θ 3 = 1840.2 , Θ 4 = 1742.9 , so Θ i = max { Θ i } = 1840.2 , Similarly, we can get V ( W ˜ ( t 0 ) ) = 0.0131 , Combining Lemma gives T ≤ 31398 s , In order to show the convergence speed of the proposed algorithm more clearly, we randomly select a component of W to display. Its convergence speed can be seen in

Combined with Theorem 4.1, the relationship between the convergence speed and parameters of the algorithm will be given intuitively in this part.

In this paper, we study the distributed optimization problem on the network. We propose a new distributed method based on push-pull finite time convergence, in which each node keeps the average gradient estimation of the optimal decision variable and the principal objective function. Information about gradients is pushed to its neighbors, and information about decision variables is pulled from its neighbors. This method uses two different graphs for information exchange between agents and is applicable to different types of distributed architectures, including decentralized, centralized, and semi-centralized architectures. Along with this, we introduced a fast convergent distributed cooperative learning algorithm based on a linear parameterized neural network. Through strict theoretical proof, the algorithm can achieve finite-time convergence under continuous time conditions. In the simulation, we have investigated the influence of different parameter changes on the convergence speed, and also proved the effectiveness of the algorithm compared with some typical algorithms. In the future work, we can properly promote and apply the proposed distributed cooperative learning algorithm to large-scale distributed machine learning problems.

The authors declare no conflicts of interest regarding the publication of this paper.

Chen, X.B., Yan, K.X., Gao, Y., Xu, X.F., Yan, K. and Wang, J. (2020) Push-Pull Finite-Time Convergence Distributed Optimization Algorithm. American Journal of Computational Mathematics, 10, 118-146. https://doi.org/10.4236/ajcm.2020.101008