File size: 10,745 Bytes
a67ae61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f27391d
a67ae61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
### Modeling Q/T {#sec-subsec_2_4_2_QT}
In this subsection, the goal is to explain how the transition properties are modeled. 
The transition properties are the two tensors $\boldsymbol Q$ and $\boldsymbol T$, which consist of transition probability from one centroid to another and the corresponding transition time, respectively.
For further details about the transition properties, the reader is referred to section [-@sec-sec_1_1_2_CNM].
Modeling $\boldsymbol Q / \boldsymbol T$ means to find surrogate models that capture the trained behavior and can predict the tensors for unseen model parameter values $\boldsymbol{\tilde{Q}}(\vec{\beta}_{unseeen}) ,\, \boldsymbol{\tilde{T}}(\vec{\beta}_{unseeen})$.
To go through the data and workflow figure @fig-fig_44 shall be considered.\newline

![Data and workflow of $\boldsymbol Q / \boldsymbol T$ modeling](../../3_Figs_Pyth/2_Task/2_Modeling/2_QT_Mod.svg){#fig-fig_44}

First, the tracked state data is loaded and adapted in the sense that CNM's data format is received. After that \gls{cnm} can be executed on the tracked state data.
The outcome of \gls{cnm} are the transition property tensors for all the provided model parameter values $\boldsymbol Q(\vec{\beta}) ,\, \boldsymbol T(\vec{\beta})$.
However, \gls{cnm} does not return tensors as *NumPy* [@harris2020array] arrays, but as *Python* dictionaries. 
Thus, the next step is to transform the dictionaries to *NumPy* arrays.
$\boldsymbol Q / \boldsymbol T$ are highly sparse, i.e., $85 \% - 99\%$ of the entries can be zero. 
The $99\%$ case is seen with a great model order, which for the Lorenz system @eq-eq_6_Lorenz was found to be $L \approx 7$.
Furthermore, with an increasing $L$, saving the dictionaries as *NumPy* arrays becomes inefficient and at some $L$ impossible. With $L>7$ the storage cost goes above multiple hundreds of gigabytes of RAM.
Therefore, the dictionaries are converted into sparse matrices. \newline 

Thereafter, the sparse matrices are reshaped or stacked into a single matrix, such that a modal decomposition method can be applied.
Followed by training a regression model for each of the mode coefficients.
The idea is that the regression models receive a $\beta_{unseen}$ and returns all the corresponding predicted modes. 
The regression models are saved and if desired plots can be enabled via *settings.py*. \newline 

In this version of \gls{cnmc} two modal decomposition methods are available.
Namely, the \glsfirst{svd} and the \glsfirst{nmf}.
The difference between both is given in [@Lee1999].
The \gls{svd} is stated in equation @eq-eq_26, where the variables are designated as the input matrix $\boldsymbol A$ which shall be decomposed, the left singular matrix $\boldsymbol U$, the diagonal singular matrix $\boldsymbol{\Sigma}$ and the right singular matrix $\boldsymbol V^T$.
The singular matrix $\boldsymbol{\Sigma}$ is mostly ordered in descending order such that the highest singular value is the first diagonal element. 
The intuition behind the singular values is that they assign importance to the modes in the left and right singular matrices $\boldsymbol U$ and $\boldsymbol {V^T}$, respectively. 
$$
\begin{equation}
    \boldsymbol A = \boldsymbol U \, \boldsymbol{\Sigma} \, \boldsymbol {V^T}
    \label{eq_26}
\end{equation}$$ {#eq-eq_26}


The big advantage of the \gls{svd} is observed when the so-called economical \gls{svd} is calculated. 
The economical \gls{svd} removes all zero singular values, thus the dimensionality of all 3 matrices can be reduced. 
However, from the economical \gls{svd} as a basis, all the output with all $r$ modes is available.
There is no need to perform any additional \gls{svd} to get the output for $r$ modes, but rather the economical \gls{svd} is truncated with the number $r$ for this purpose.
\gls{nmf}, given in equation @eq-eq_5_NMF, on the other hand, has the disadvantage that there is no such thing as economical NMF. 
For every change in the number of modes $r$, a full \gls{nmf} must be recalculated.\newline

$$
\begin{equation}
    \boldsymbol {A_{i \mu}} \approx \boldsymbol A^{\prime}_{i \mu}  = (\boldsymbol W  \boldsymbol H)_{i \mu}  = \sum_{a = 1}^{r} 
    \boldsymbol W_{ia} \boldsymbol H_{a \mu}
    \tag{4.5}
\end{equation}
$$ 

The issue with \gls{nmf} is that the solution is obtained through an iterative optimization process. 
The number of iterations can be in the order of millions and higher to meet the convergence criteria.
Because the optimal $r_{opt}$ depends on the dynamical system, there is no general rule for acquiring it directly. 
Consequently, \gls{nmf} must be run with multiple different $r$ values to find $r_{opt}$. 
Apart from the mentioned parameter study, one single \gls{nmf} execution was found to be more computationally expensive than \gls{svd}. 
In [@Max2021] \gls{nmf} was found to be the performance bottleneck of *first CNMc*, which became more evident when $L$ was increased.
In subsection 
[-@sec-subsec_3_3_1_SVD_Speed]
a comparison between \gls{nmf} and \gls{svd} regarding computational time is given.\newline


Nevertheless, if the user wants to apply \gls{nmf}, only one attribute in *settings.py* needs to be modified.
Because of that and the overall modular structure of \gls{cnmc}, implementation of any other decomposition method should be straightforward.
In \gls{cnmc} the study for finding $r_{opt}$ is automated and thus testing \gls{cnmc} on various dynamical systems with \gls{nmf} should be manageable. 
The benefit of applying \gls{nmf} is that the entries of the output matrices $\boldsymbol W_{ia},\, \boldsymbol H_{a \mu}$ are all non-zero.
This enables interpreting the $\boldsymbol W_{ia}$ matrix since both $\boldsymbol Q / \boldsymbol T$ tensors cannot contain negative entries, i.e., no negative probability and no negative transition time.\newline

Depending on whether  \gls{nmf} or \gls{svd} is chosen, $r_{opt}$ is found through a parameter study or based on $99 \%$ of the information content, respectively.
The $99 \%$ condition is met when $r_{opt}$ modes add up to $99 \%$ of the total sum of the modes. In \gls{svd} $r_{opt}$ is automatically detected and does not require any new \gls{svd} execution. A comparison between \gls{svd} and \gls{nmf} regarding prediction quality is given in 
section
[-@sec-subsec_3_3_2_SVD_Quality].
After the decomposition has been performed, modes that capture characteristic information are available. 
If the modes can be predicted for any $\beta_{unseen}$, the predicted transition properties $\boldsymbol{\tilde{Q}}(\vec{\beta}_{unseeen}) ,\, \boldsymbol{\tilde{T}}(\vec{\beta}_{unseeen})$ are obtained. 
To comply with this \gls{cnmc} has 3 built-in methods. 
Namely, **R**andom **F**orest (RF), AdaBoost, and Gaussian Process.\newline 

First, \gls{rf} is based on decision trees, but additionally deploys
a technique called bootstrap aggregation or bagging. 
Bagging creates multiple sets from the original dataset, which are equivalent in size. 
However, some features are duplicated in the new datasets, whereas others are 
neglected entirely. This allows \gls{rf} to approximate very complex functions 
and reduce the risk of overfitting, which is encountered commonly 
with regular decision trees. 
Moreover, it is such a powerful tool
that, e.g., Kilian Weinberger, a well-known Professor for machine learning 
at Cornell University, considers \gls{rf} in one of his lectures, to be 
one of the most powerful regression techniques that the state of the art has to offer.
Furthermore, \gls{rf} proved to be able to approximate the training data  
acceptable as shown in [@Max2021]. 
However, as mentioned in subsection [-@sec-subsec_1_1_3_first_CNMc], it faced difficulties to approximate spike-like curves . 
Therefore, it was desired to test alternatives as well.\newline

These two alternatives were chosen to be AdaBoost, and Gaussian Process.
Both methods are well recognized and used in many machine learning applications.
Thus, instead of motivating them and giving theoretical explanations, the reader is referred to [@Adaboost], [@Rasmussen2004; @bishop2006pattern] for AdaBoost and Gaussian Process, respectively. 
As for the implementation, all 3 methods are invoked through *Scikit-learn* [@scikit-learn].
The weak learner for AdaBoost is  *Scikit-learn's* default decision tree regressor.
The kernel utilized for the Gaussian Process is the **R**adial **B**asis **F**unction (RBF). A comparison of these 3 methods in terms of prediction capabilities is provided in section [-@sec-subsec_3_3_2_SVD_Quality].\newline

Since the predicted $\boldsymbol{\tilde{Q}}(\vec{\beta}_{unseeen}) ,\, \boldsymbol{\tilde{T}}(\vec{\beta}_{unseeen})$ are based on regression techniques, their output will have some deviation to the original $\boldsymbol{Q}(\vec{\beta}_{unseeen}) ,\, \boldsymbol{T}(\vec{\beta}_{unseeen})$. 
Due to that, the physical requirements given in equations @eq-eq_31_Q_T_prediction may be violated. \newline

$$
\begin{equation}
    \begin{aligned}
        0 \leq  \,  \boldsymbol Q \leq 1\\
        \boldsymbol T \geq 0 \\
        \boldsymbol Q \,(\boldsymbol T > 0) > 0 \\
        \boldsymbol T(\boldsymbol Q = 0) = 0 
    \end{aligned}
    \label{eq_31_Q_T_prediction}
\end{equation} $$ {#eq-eq_31_Q_T_prediction}

To manually enforce these physical constraints, the rules defined in equation @eq-eq_32_Rule are applied. 
The smallest allowed probability is defined to be 0, thus negative probabilities are set to zero. 
The biggest probability is 1, hence, overshoot values are set to 1.
Also, negative transition times would result in moving backward, therefore, they are set to zero. 
Furthermore, it is important to verify that a probability is zero if its corresponding transition time is less than or equals zero.
In general, the deviation is in the order of $\mathcal{O}(1 \mathrm{e}{-2})$, such that the modification following equation @eq-eq_32_Rule can be considered reasonable. 
  \newline


$$ 
\begin{equation}
    \begin{aligned}
        & \boldsymbol Q < 0 := 0 \\
        & \boldsymbol Q > 1 := 1 \\
        & \boldsymbol T < 0 := 0\\
        & \boldsymbol Q \, (\boldsymbol T \leq 0) := 0
    \end{aligned}
    \label{eq_32_Rule}
\end{equation}$$ {#eq-eq_32_Rule}

In conclusion, it can be said that modeling $\boldsymbol Q / \boldsymbol T$ \gls{cnmc} is equipped with two different modal decomposition methods, \gls{svd} and NMF.
To choose between them one attribute in *settings.py* needs to be modified.
The application of \gls{nmf} is automated with the integrated parameter study.
For the mode surrogate models, 3 different regression methods are available.
Selecting between them is kept convenient, i.e. by editing one property in *settings.py*.