## Contents

• High dimensinoal data
• Dimensionality Reduction Methods
• Clustering (a discrete form of reduction)

## High Dimensional Data

data matrix X:n*m.

n samples as rows, m columns as observables

How to visualize?

• m=1: list
• m=2: x-y plot

Aim of dimension reduction: reduce the number of observables, while keeping relevant information.

### Examples

Ferromagnetic Ising model of a 28*28 lattice, 50,000 samples. The 1st principal component scales with the magnetization.

500,000 DNA sites of 3,000 humans. Projections onto the first 2 principal components givees a rough map of Europe

Fashion MNIST: 60,000 pictures of 784 (28*28) pixels. UMAP projections onto 2 dimensions

## Linear dimension reduction

• Principal component analysis
• Non-negative matrix factorization
• Independent component analysis
• Factor analysis

### Principal component analysis

change of basis

Aim: to find a transformation such that

• the components are uncorrelated
• ordered by the explained variability

Math:

$$X^T X$$ is a $$m \times m$$ matrix. Eigen-decomposition gives $$(\lambda,W)$$ where $$W$$’s columns are eigen vectors sorted by $$\lambda$$

$$T = X \cdot W$$ where $$T$$ is the score onto principal components, and $$T^TT$$ is diagonal

$$T_r = X \cdot W_r$$ where $$W_r$$ is a n*r matrix of the first r components

Practice In R:

• Iris Dataset
• Pitures of Objects at Different Angles:
• Swiss Roll: the transformed data almost have the same structure as the original data. This shows the limit of PCA due to linearity

wikipedia - Nonlinear_dimensionality_reduction

### UMAP: uniform manifold approximation

Assumptions:

• uniformed distributeed on Riemann manifold
• Riemann metric is locally constant
• The manifold is locally conncected

Results:

• models the manifold with a fuzzy topological structure
• finds low dimensional projection of the data that has the cloest possible equivalent fuzzy topological structure

Simplex 0-simplex: dot 1-simplex: line 2-simplex: triangle 3-simplex: 正四面体

Problem:

When the uniform distribution is not met, the result may not be what we like. To circumvent this, we need to define the local metric with the nearest neighbours.

Finding a low dimensional representation

• derive similar fuzzy topological representation of a low dimensional representation
• interpret edge weight s as proba
• minimize cross entropy
• pair-code.github.io/understand_umap

Practice in R

• Swiss Roll becomes a 2-d ring
• Pictures at different angles are well seperated

## Clustering

### hierarchiral clustering

• metric
• linkage criterion / distance between clusters

### centroid base clustering (k-means)

• pre-specify number of clusters, k
• partition the data such that the squared distance to cluster mean position is minimal
• Lloyd algorithm
• assign data to neatest center
• compute new centers

### graph based-clustering

• graph representation of data
• partition the data such that modularity Q is optimized
• $$Q = \sum_{i=1}^c e_{ii}-\gamma {a_i}^2, a_i = \sum_{j-1}^c e_{ij}$$.
• $$e_{ij}$$: fraction of links between cluster i and j
• $$\gamma$$: resulution

## Summary

Question: How to identify order in high-dimensinoal data?

• dimensionality reductuon to visualize high-dimensional data and to identify macroscopic observables
• clustering to identify discrete order in high dimensional data
• methods to se depending on the problem at hand