Spaces:

JavedA
/

master_Thesis

Running

File size: 14,626 Bytes

a67ae61

<!-- % ============================================================================== -->
<!-- % =========================== Clustering ======================================= -->
<!-- % ============================================================================== -->
## Clustering {#sec-sec_2_3_Clustering}
In this section, the second step, the clustering of all trajectories $(\vec{\beta})$, is explained. 
The main idea is to represent $F(\vec{\beta})$ through movement on centroids.
The data and workflow of clustering are very similar to the previous step of the data generation. It can be comprehended with figure @fig-fig_12_Clust .
All settings for this step can be individually configured in *settings.py*.
The $F(\vec{\beta})$ and cluster-algorithm specific parameters are filtered and provided to the clustering algorithm. The solutions are plotted and both, the plots and the clustered output are saved.\newline 

<!-- % ============================================== -->
<!-- % ========== Clustering Workflow =============== -->
<!-- % ============================================== -->
![Data and workflow of the second step: Clustering](../../3_Figs_Pyth/2_Task/9_Clust.svg){#fig-fig_12_Clust}

Data clustering is an unsupervised machine learning technique. 
There are a variety of approaches that may be used for this, e.g., 
k-means, affinity propagation, mean shift, 
spectral clustering and Gaussian mixtures. All the 
methods differ in their use cases, scalability,  
metric or deployed norm and required input parameters. The latter 
is an indicator of customization abilities. Since k-means can be used for very large
data sets and enables easy and fast implementation, k-means is preferred. Furthermore, David Arthur et al. 
[@Arthur2006] introduced k-means++, which is known 
to outperform k-means. Therefore, \gls{cnmc} uses k-means++ 
as its default method for data clustering. 
Note, applying k-means++ is not new in \gls{cnmc}, but it was already applied in the regular \gls{cnm} [@Fernex2021].\newline

In order to 
cover the basics of k-means and k-means++, two terms 
should be understood.
Picturing a box with 30 points in it, where 10 are located on the left,  10
in the middle and 10 on the right side of the box. Adhering to such a 
constellation, it is appealing to create 3 groups, one for 
each overall position (left, center and right). Each group would 
contain 10 points. These groups are called clusters and the 
geometrical center of each cluster is called a centroid. 
A similar thought experiment is visually depicted in [@Sergei_Visual].
Considering a dynamical system, the trajectory is retrieved by integrating the \gls{ode} numerically at discrete time steps. 
For each time step the obtained point is described with one x-, y- and z-coordinate. 
Applying the above-mentioned idea on, e.g., the Lorenz system [@lorenz1963deterministic], defined as the set of equations in @eq-eq_6_Lorenz, then the resulting centroids can be seen in figure @fig-fig_13_Clust .
The full domains of the groups or clusters are color-coded in figure @fig-fig_14_Clust .\newline

::: {layout-ncol=2}

![Centroids of the Lorenz system  @eq-eq_6_Lorenz with $\beta =28$](../../3_Figs_Pyth/2_Task/10_Clust.svg){#fig-fig_13_Clust}

![Cluster domains of the Lorenz system @eq-eq_6_Lorenz with $\beta =28$](../../3_Figs_Pyth/2_Task/11_Clust.svg){#fig-fig_14_Clust}
:::


Theoretically, 
the points which are taken to calculate a center could be assigned 
weighting factors. However, this is not done in \gls{cnmc} and therefore only 
be outlined as a side note. After being familiar with the concept of 
clusters and centroids, the actual workflow of k-means shall be explained.
For initializing 
k-means, a number of clusters and an initial guess for the centroid 
positions must be provided. Next, the distance between all the data 
points and the centroids is calculated. The data points closest to a 
centroid are assigned to these respective clusters. In other words, each data point is assigned to that cluster for which 
the corresponding centroid exhibits the smallest distance 
to the considered data point. 
The geometrical mean value for all clusters is subsequently determined for all cluster-associated residents' data points. With the 
new centroid positions, the clustering is 
performed again. \newline 

Calculating the mean of the clustered 
data points (centroids) and performing clustering based on the 
distance between each data point and the centroids 
is done iteratively. The iterative process stops when 
the difference between the prior and current 
centroids position is equal to zero or 
satisfies a given threshold. Other explanations with pseudo-code and  
visualization for better understanding can be found[@Frochte2020] 
and [@Sergei_Visual], respectively\newline

<!-- % ------------------------------------------------------------------------------ -->
<!-- % --------------------- PART 2 ------------------------------------------------- -->
<!-- % ------------------------------------------------------------------------------ -->
Mathematically k-means objective can be expressed 
as an optimization problem with the centroid 
position $\boldsymbol{\mu_j}$ as the design variable. That is given in equation
@eq-eq_1_k_means (extracted from [@Frochte2020]), where 
$\boldsymbol{\mu_J}$ and  $\mathrm{D}^{\prime}_j$ denote the centroid or 
mean of the *j*th cluster and the data points 
belonging to the *j*th cluster, respectively.
The distance between all the *j*th cluster data points 
and its corresponding *j*th centroid is 
stated as $\mathrm{dist}(\boldsymbol{x}_j, \boldsymbol{\mu}_j)$.

$$
\begin{equation}
    \label{eq_1_k_means}
    \underset{\boldsymbol{\mu}_j}{\mathrm{argmin}}\sum_{j=1}^k \; \sum_{\boldsymbol{x}_j \in \mathrm{D}^{\prime}_j }
    \mathrm{dist}(\boldsymbol{x}_j, \boldsymbol{\mu_j})
\end{equation}
$$ {#eq-eq_1_k_means}

Usually, the k-means algorithm is deployed with a Euclidean metric 
and equation @eq-eq_1_k_means becomes @eq-eq_2_k_Means_Ly, which
is known as the Lloyd algorithm [@Frochte2020; @Lloyd1982]. The
Lloyd algorithm can be understood as the minimization of the variance.
Thus, it is not necessarily true that k-means is equivalent to reducing 
the variance. It is only true when the Euclidean norm is used.
$$
\begin{equation}
    \label{eq_2_k_Means_Ly}
    \underset{\boldsymbol{\mu}_j}{\mathrm{argmin}}\sum_{j=1}^k \; \sum_{\boldsymbol{x}_j \in \mathrm{D}^{\prime}_j }
    \| \boldsymbol x_j - \boldsymbol{\mu_j} \|^2
\end{equation}
$$ {#eq-eq_2_k_Means_Ly}

The clustering algorithm highly depends on the provided 
initial centroids positions. Since in most cases, these 
are guessed, there is no guarantee of a reliable outcome.
Sergei Vassilvitskii, one of the founders of 
k-means++, says in one of his presentations [@Sergei_Black_Art],
finding a good set of initial points would be black art.
Arthur et al. [@Arthur2006] state,
that the speed and simplicity of k-means would be appealing, not 
its accuracy. There are many natural examples for which 
the algorithm generates arbitrarily bad clusterings [@Arthur2006].\newline 


An alternative or improved version of k-means is the already
mentioned k-means++, which
only differs in the initialization step. Instead of providing 
initial positions for all centroids, just one centroid's 
position is supplied. The remaining are calculated based on 
maximal distances. In concrete, the distance between all 
data points and the existing centroids is computed. The data point 
which exhibits the greatest distance is added to the 
list of collected centroids. This is done until all $k$ 
clusters are generated. A visual depiction of this process 
is given by Sergei Vassilvitskii in [@Sergei_Visual].
Since the outcome of k-means++ is more reliable than 
k-means, k-means++ is deployed in \gls{cnmc}.\newline

After having discussed some basics of k-means++, it shall be 
elaborated on how and why the solution of the dynamical system should be 
clustered. The solution of any dynamical system returns a trajectory.
If the trajectory repeats itself or happens to come close 
to prior trajectories without actually touching them,  
characteristic sections can be found.
Each characteristic section in the phase space is 
captured by a centroid. The movement from one 
centroid to another is supposed to portray the original
trajectory. With a clustering algorithm, these representative 
characteristic locations in the phase space are obtained. 
Since the clusters shall capture an entire trajectory, it is 
evident that the number of clusters is an
essential parameter to choose. Latter fact becomes even 
more clear when recalling that a trajectory can be multi-modal or complex.\newline

In the case of a highly non-linear 
trajectory, it is obvious that many clusters are demanded in
order to minimize the loss of the original
trajectories. The projection of the real trajectory 
to a cluster-based movement can be compared to 
a reduced-order model of the trajectory. In this context, 
it is plausible to refer to the centroids as 
representative characteristic locations. Furthermore, \gls{cnm} and thus, \gls{cnmc}, exploits graph theory. 
Therefore, the centroids can be denoted as nodes or characteristic nodes.\newline

The remaining part of this section will be devoted exclusively to the application of \gls{cnmc}. First, the leveraged kmeans++ algorithm is from the machine learning *Python* library *Scikit-learn* [@scikit-learn]. 
Crucial settings, e.g., the number of clusters $K$, the maximal number of iterations, the tolerance as a convergence criterion and the number of different centroid seeds with which k-means is executed.
The operator can decide if the clustering step shall be performed or skipped.
The path for outputs can be modified and generating plots is also optional.
For the clustering stage, there are 4 types of plots available. 
Two types of plots are depicted in figures @fig-fig_13_Clust and @fig-fig_14_Clust .
With the generated HTML plots the same features as described in section [-@sec-sec_2_2_Data_Gen] are available, e .g., receiving more information through pop-up panels and
switching between a dark and white mode. 
 \newline 

The other two types of charts are not displayed here because they are intended to be studied as HTML graphs where the output can be viewed from multiple angles.
The first type shows the clustered output of one system for two different $\beta$ next to each other. 
The centroids are labeled randomly as will be shown in subsection [-@sec-subsec_2_2_1_Parameter_Study]. 
Consequently, the centroids representing the immediate neighbors across the two separate $\beta$ have separate labels. 
In the second remaining type of HTML graph, the closest centroids across the two different $\beta$  are connected through lines.
Also, in the same way, as it was done for the first step in the data generation an additional HTML file containing all $\vec{\beta }$ charts is generated.
\newline

It can be concluded that the clustering step is performed by employing *Scikit-learn's* kmeans++ implementation, which is well suited for a great number of points. As usual, all important settings can be controlled with *settings.py*.

### Parameter Study {#sec-subsec_2_2_1_Parameter_Study}
In this subsection, the effects on the clustering step caused by the parameter *n\_init* shall be named. After that, the random labeling of the centroids is to be highlighted.
With the parameter *n\_init* it is possible to define how often the k-means algorithm will be executed with different centroid seeds [@scikit-learn].
For a reliable clustering quality *n\_init* should be chosen high. However, the drawback is that with increasing *n\_init* the calculation time increases unacceptably high. Having chosen *n\_init* too high, the clustering part becomes the new bottleneck of the entire \gls{cnmc} pipeline. \newline

To conduct the parameter study, clustering was performed using the following *n\_init* values: 
$\textit{n\_init} = \{100,\, 200, \, 400,\, 800,\, 1000, \, 1200, \, 1500 \}$. 
Some results are presented in figures @fig-fig_15 to @fig-fig_20 .
It can be observed that when all the different *n\_init* values are compared, visually no big difference in the placing of the centroids can be perceived. 
A graphical examination is sufficient because even with *n\_init* values that differ by only by the number one ($n_{init,1} - n_{init,2} = 1$), the centroid positions are still expected to vary slightly. 
Simply put, only deviations on a macroscopic scale, which can be noted visually are searched for. As a conclusion it can be said that $\textit{n\_init} = 100$ and $\textit{n\_init} = 1500$ can be considered as an equally valuable clustering outcome. 
Hence, *n\_init* the computational expense can be reduced by deciding on a reasonable small value for *n\_init*.\newline

The second aim of this subsection was to highlight the fact that the centroids are labeled randomly. For this purpose, the depicted figures @fig-fig_15 to @fig-fig_20 shall be examined. Concretely, any of the referred figures can be compared with the remaining figures to be convinced that the labeling is not obeying any evident rule.


<!-- % ============================================================================== -->
<!-- % ============================ PLTS ============================================ -->
<!-- % ============================================================================== -->
::: {layout-ncol=2}

![Lorenz @eq-eq_6_Lorenz, $\beta =28$, $\text{n\_init}= 100$](../../3_Figs_Pyth/2_Task/0_N_Study/0_K_100.svg){#fig-fig_15}

![Lorenz @eq-eq_6_Lorenz, $\beta =28$, $\text{n\_init}= 200$](../../3_Figs_Pyth/2_Task/0_N_Study/1_K_200.svg){#fig-fig_16}
:::

::: {layout-ncol=2}

![Lorenz @eq-eq_6_Lorenz, $\beta =28$, $\text{n\_init}= 400$](../../3_Figs_Pyth/2_Task/0_N_Study/2_K_400.svg){#fig-fig_17}

![Lorenz @eq-eq_6_Lorenz, $\beta =28$, $\text{n\_init}= 1000$](../../3_Figs_Pyth/2_Task/0_N_Study/3_K_1000.svg){#fig-fig_18}
:::

::: {layout-ncol=2}

![Lorenz @eq-eq_6_Lorenz, $\beta =28$, $\text{n\_init}= 1200$](../../3_Figs_Pyth/2_Task/0_N_Study/4_K_1200.svg){#fig-fig_19}

![Lorenz @eq-eq_6_Lorenz, $\beta =28$, $\text{n\_init}= 1500$](../../3_Figs_Pyth/2_Task/0_N_Study/5_K_1500.svg){#fig-fig_20}
::: 

<!-- % ============================================================================== -->
<!-- % ============================ PLTS ============================================ -->
<!-- % ============================================================================== -->