JavedA's picture
init
a67ae61
raw
history blame
14.6 kB
<!-- % ============================================================================== -->
<!-- % =========================== Clustering ======================================= -->
<!-- % ============================================================================== -->
## Clustering {#sec-sec_2_3_Clustering}
In this section, the second step, the clustering of all trajectories $(\vec{\beta})$, is explained.
The main idea is to represent $F(\vec{\beta})$ through movement on centroids.
The data and workflow of clustering are very similar to the previous step of the data generation. It can be comprehended with figure @fig-fig_12_Clust .
All settings for this step can be individually configured in *settings.py*.
The $F(\vec{\beta})$ and cluster-algorithm specific parameters are filtered and provided to the clustering algorithm. The solutions are plotted and both, the plots and the clustered output are saved.\newline
<!-- % ============================================== -->
<!-- % ========== Clustering Workflow =============== -->
<!-- % ============================================== -->
![Data and workflow of the second step: Clustering](../../3_Figs_Pyth/2_Task/9_Clust.svg){#fig-fig_12_Clust}
<!--%-->
<!--%-->
Data clustering is an unsupervised machine learning technique.
There are a variety of approaches that may be used for this, e.g.,
k-means, affinity propagation, mean shift,
spectral clustering and Gaussian mixtures. All the
methods differ in their use cases, scalability,
metric or deployed norm and required input parameters. The latter
is an indicator of customization abilities. Since k-means can be used for very large
data sets and enables easy and fast implementation, k-means is preferred. Furthermore, David Arthur et al.
[@Arthur2006] introduced k-means++, which is known
to outperform k-means. Therefore, \gls{cnmc} uses k-means++
as its default method for data clustering.
Note, applying k-means++ is not new in \gls{cnmc}, but it was already applied in the regular \gls{cnm} [@Fernex2021].\newline
In order to
cover the basics of k-means and k-means++, two terms
should be understood.
Picturing a box with 30 points in it, where 10 are located on the left, 10
in the middle and 10 on the right side of the box. Adhering to such a
constellation, it is appealing to create 3 groups, one for
each overall position (left, center and right). Each group would
contain 10 points. These groups are called clusters and the
geometrical center of each cluster is called a centroid.
A similar thought experiment is visually depicted in [@Sergei_Visual].
Considering a dynamical system, the trajectory is retrieved by integrating the \gls{ode} numerically at discrete time steps.
For each time step the obtained point is described with one x-, y- and z-coordinate.
Applying the above-mentioned idea on, e.g., the Lorenz system [@lorenz1963deterministic], defined as the set of equations in @eq-eq_6_Lorenz, then the resulting centroids can be seen in figure @fig-fig_13_Clust .
The full domains of the groups or clusters are color-coded in figure @fig-fig_14_Clust .\newline
::: {layout-ncol=2}
![Centroids of the Lorenz system @eq-eq_6_Lorenz with $\beta =28$](../../3_Figs_Pyth/2_Task/10_Clust.svg){#fig-fig_13_Clust}
![Cluster domains of the Lorenz system @eq-eq_6_Lorenz with $\beta =28$](../../3_Figs_Pyth/2_Task/11_Clust.svg){#fig-fig_14_Clust}
:::
Theoretically,
the points which are taken to calculate a center could be assigned
weighting factors. However, this is not done in \gls{cnmc} and therefore only
be outlined as a side note. After being familiar with the concept of
clusters and centroids, the actual workflow of k-means shall be explained.
For initializing
k-means, a number of clusters and an initial guess for the centroid
positions must be provided. Next, the distance between all the data
points and the centroids is calculated. The data points closest to a
centroid are assigned to these respective clusters. In other words, each data point is assigned to that cluster for which
the corresponding centroid exhibits the smallest distance
to the considered data point.
The geometrical mean value for all clusters is subsequently determined for all cluster-associated residents' data points. With the
new centroid positions, the clustering is
performed again. \newline
Calculating the mean of the clustered
data points (centroids) and performing clustering based on the
distance between each data point and the centroids
is done iteratively. The iterative process stops when
the difference between the prior and current
centroids position is equal to zero or
satisfies a given threshold. Other explanations with pseudo-code and
visualization for better understanding can be found[@Frochte2020]
and [@Sergei_Visual], respectively\newline
<!-- % ------------------------------------------------------------------------------ -->
<!-- % --------------------- PART 2 ------------------------------------------------- -->
<!-- % ------------------------------------------------------------------------------ -->
Mathematically k-means objective can be expressed
as an optimization problem with the centroid
position $\boldsymbol{\mu_j}$ as the design variable. That is given in equation
@eq-eq_1_k_means (extracted from [@Frochte2020]), where
$\boldsymbol{\mu_J}$ and $\mathrm{D}^{\prime}_j$ denote the centroid or
mean of the *j*th cluster and the data points
belonging to the *j*th cluster, respectively.
The distance between all the *j*th cluster data points
and its corresponding *j*th centroid is
stated as $\mathrm{dist}(\boldsymbol{x}_j, \boldsymbol{\mu}_j)$.
$$
\begin{equation}
\label{eq_1_k_means}
\underset{\boldsymbol{\mu}_j}{\mathrm{argmin}}\sum_{j=1}^k \; \sum_{\boldsymbol{x}_j \in \mathrm{D}^{\prime}_j }
\mathrm{dist}(\boldsymbol{x}_j, \boldsymbol{\mu_j})
\end{equation}
$$ {#eq-eq_1_k_means}
Usually, the k-means algorithm is deployed with a Euclidean metric
and equation @eq-eq_1_k_means becomes @eq-eq_2_k_Means_Ly, which
is known as the Lloyd algorithm [@Frochte2020; @Lloyd1982]. The
Lloyd algorithm can be understood as the minimization of the variance.
Thus, it is not necessarily true that k-means is equivalent to reducing
the variance. It is only true when the Euclidean norm is used.
$$
\begin{equation}
\label{eq_2_k_Means_Ly}
\underset{\boldsymbol{\mu}_j}{\mathrm{argmin}}\sum_{j=1}^k \; \sum_{\boldsymbol{x}_j \in \mathrm{D}^{\prime}_j }
\| \boldsymbol x_j - \boldsymbol{\mu_j} \|^2
\end{equation}
$$ {#eq-eq_2_k_Means_Ly}
The clustering algorithm highly depends on the provided
initial centroids positions. Since in most cases, these
are guessed, there is no guarantee of a reliable outcome.
Sergei Vassilvitskii, one of the founders of
k-means++, says in one of his presentations [@Sergei_Black_Art],
finding a good set of initial points would be black art.
Arthur et al. [@Arthur2006] state,
that the speed and simplicity of k-means would be appealing, not
its accuracy. There are many natural examples for which
the algorithm generates arbitrarily bad clusterings [@Arthur2006].\newline
An alternative or improved version of k-means is the already
mentioned k-means++, which
only differs in the initialization step. Instead of providing
initial positions for all centroids, just one centroid's
position is supplied. The remaining are calculated based on
maximal distances. In concrete, the distance between all
data points and the existing centroids is computed. The data point
which exhibits the greatest distance is added to the
list of collected centroids. This is done until all $k$
clusters are generated. A visual depiction of this process
is given by Sergei Vassilvitskii in [@Sergei_Visual].
Since the outcome of k-means++ is more reliable than
k-means, k-means++ is deployed in \gls{cnmc}.\newline
After having discussed some basics of k-means++, it shall be
elaborated on how and why the solution of the dynamical system should be
clustered. The solution of any dynamical system returns a trajectory.
If the trajectory repeats itself or happens to come close
to prior trajectories without actually touching them,
characteristic sections can be found.
Each characteristic section in the phase space is
captured by a centroid. The movement from one
centroid to another is supposed to portray the original
trajectory. With a clustering algorithm, these representative
characteristic locations in the phase space are obtained.
Since the clusters shall capture an entire trajectory, it is
evident that the number of clusters is an
essential parameter to choose. Latter fact becomes even
more clear when recalling that a trajectory can be multi-modal or complex.\newline
In the case of a highly non-linear
trajectory, it is obvious that many clusters are demanded in
order to minimize the loss of the original
trajectories. The projection of the real trajectory
to a cluster-based movement can be compared to
a reduced-order model of the trajectory. In this context,
it is plausible to refer to the centroids as
representative characteristic locations. Furthermore, \gls{cnm} and thus, \gls{cnmc}, exploits graph theory.
Therefore, the centroids can be denoted as nodes or characteristic nodes.\newline
The remaining part of this section will be devoted exclusively to the application of \gls{cnmc}. First, the leveraged kmeans++ algorithm is from the machine learning *Python* library *Scikit-learn* [@scikit-learn].
Crucial settings, e.g., the number of clusters $K$, the maximal number of iterations, the tolerance as a convergence criterion and the number of different centroid seeds with which k-means is executed.
The operator can decide if the clustering step shall be performed or skipped.
The path for outputs can be modified and generating plots is also optional.
For the clustering stage, there are 4 types of plots available.
Two types of plots are depicted in figures @fig-fig_13_Clust and @fig-fig_14_Clust .
With the generated HTML plots the same features as described in section [-@sec-sec_2_2_Data_Gen] are available, e .g., receiving more information through pop-up panels and
switching between a dark and white mode.
\newline
The other two types of charts are not displayed here because they are intended to be studied as HTML graphs where the output can be viewed from multiple angles.
The first type shows the clustered output of one system for two different $\beta$ next to each other.
The centroids are labeled randomly as will be shown in subsection [-@sec-subsec_2_2_1_Parameter_Study].
Consequently, the centroids representing the immediate neighbors across the two separate $\beta$ have separate labels.
In the second remaining type of HTML graph, the closest centroids across the two different $\beta$ are connected through lines.
Also, in the same way, as it was done for the first step in the data generation an additional HTML file containing all $\vec{\beta }$ charts is generated.
\newline
It can be concluded that the clustering step is performed by employing *Scikit-learn's* kmeans++ implementation, which is well suited for a great number of points. As usual, all important settings can be controlled with *settings.py*.
### Parameter Study {#sec-subsec_2_2_1_Parameter_Study}
In this subsection, the effects on the clustering step caused by the parameter *n\_init* shall be named. After that, the random labeling of the centroids is to be highlighted.
With the parameter *n\_init* it is possible to define how often the k-means algorithm will be executed with different centroid seeds [@scikit-learn].
For a reliable clustering quality *n\_init* should be chosen high. However, the drawback is that with increasing *n\_init* the calculation time increases unacceptably high. Having chosen *n\_init* too high, the clustering part becomes the new bottleneck of the entire \gls{cnmc} pipeline. \newline
To conduct the parameter study, clustering was performed using the following *n\_init* values:
$\textit{n\_init} = \{100,\, 200, \, 400,\, 800,\, 1000, \, 1200, \, 1500 \}$.
Some results are presented in figures @fig-fig_15 to @fig-fig_20 .
It can be observed that when all the different *n\_init* values are compared, visually no big difference in the placing of the centroids can be perceived.
A graphical examination is sufficient because even with *n\_init* values that differ by only by the number one ($n_{init,1} - n_{init,2} = 1$), the centroid positions are still expected to vary slightly.
Simply put, only deviations on a macroscopic scale, which can be noted visually are searched for. As a conclusion it can be said that $\textit{n\_init} = 100$ and $\textit{n\_init} = 1500$ can be considered as an equally valuable clustering outcome.
Hence, *n\_init* the computational expense can be reduced by deciding on a reasonable small value for *n\_init*.\newline
The second aim of this subsection was to highlight the fact that the centroids are labeled randomly. For this purpose, the depicted figures @fig-fig_15 to @fig-fig_20 shall be examined. Concretely, any of the referred figures can be compared with the remaining figures to be convinced that the labeling is not obeying any evident rule.
<!-- % ============================================================================== -->
<!-- % ============================ PLTS ============================================ -->
<!-- % ============================================================================== -->
::: {layout-ncol=2}
![Lorenz @eq-eq_6_Lorenz, $\beta =28$, $\text{n\_init}= 100$](../../3_Figs_Pyth/2_Task/0_N_Study/0_K_100.svg){#fig-fig_15}
![Lorenz @eq-eq_6_Lorenz, $\beta =28$, $\text{n\_init}= 200$](../../3_Figs_Pyth/2_Task/0_N_Study/1_K_200.svg){#fig-fig_16}
:::
::: {layout-ncol=2}
![Lorenz @eq-eq_6_Lorenz, $\beta =28$, $\text{n\_init}= 400$](../../3_Figs_Pyth/2_Task/0_N_Study/2_K_400.svg){#fig-fig_17}
![Lorenz @eq-eq_6_Lorenz, $\beta =28$, $\text{n\_init}= 1000$](../../3_Figs_Pyth/2_Task/0_N_Study/3_K_1000.svg){#fig-fig_18}
:::
::: {layout-ncol=2}
![Lorenz @eq-eq_6_Lorenz, $\beta =28$, $\text{n\_init}= 1200$](../../3_Figs_Pyth/2_Task/0_N_Study/4_K_1200.svg){#fig-fig_19}
![Lorenz @eq-eq_6_Lorenz, $\beta =28$, $\text{n\_init}= 1500$](../../3_Figs_Pyth/2_Task/0_N_Study/5_K_1500.svg){#fig-fig_20}
:::
<!-- % ============================================================================== -->
<!-- % ============================ PLTS ============================================ -->
<!-- % ============================================================================== -->