.mpipks-transcript | 10. Dimension Reduction

这一讲由嘉宾给出，官网没有课件。所以不再做课件帧数的标记了。作为替代，随堂做了笔记，参见笔记部分。

00:07 okay so let’s
00:08 uh let’s join uh so welcome
00:12 to our uh lecture again now to this uh
00:16 today we have a special guest as we uh
00:19 which is fabian
00:21 ross that you see on the other window
00:23 and as you remember
00:25 uh we before christmas we introduced uh
00:28 theoretical physics concepts that
00:30 explain how order emerges
00:33 in non-equilibrium systems and then last
00:36 week i started
00:37 introducing some basic tools
00:40 from data science namely what are the
00:43 tools that we can use
00:45 to computationally deal with large
00:48 amounts of data
00:50 and today so i need some special help
00:53 from somebody who’s a real expert
00:56 which is far beyond you know fabian is a
00:59 trained
01:00 uh physicist and did a phd in
01:02 mathematical biology
01:04 i then did a postdoc in my group at the
01:07 max plain institute in dresden and now
01:10 he’s a professional
01:12 uh data scientist and bioinformatician
01:16 at the center for regenerative therapies
01:18 in dresden
01:20 and so he’s talking actually about
01:21 something that he’s uh
01:23 doing for his professional job which is
01:25 very nice
01:26 and uh over to you fabian thanks a lot
01:29 for
01:30 for sharing your expertise yeah
01:33 thanks thanks for having me here uh for
01:36 giving miss
01:37 uh giving me this opportunity so to give
01:39 a lecture on
01:40 exploring high dimensional data so i
01:42 think uh stefan
01:44 introduced me very well thank you very
01:47 much and
01:48 and also made the point of where we are
01:51 in the course
01:53 so i just wanted to ask you i just would
01:56 uh to say that uh whenever you
01:58 uh wanna ask something i think during
02:00 the course you just like uh unmuted
02:01 yourself and ask the question so that’s
02:03 totally fine to me also
02:05 type something in the chat and i try to
02:07 try to answer and then uh
02:09 like if you’re not saying anything uh
02:11 please mute yourself to reduce noise
02:14 um
02:18 all right so um okay i’m
02:22 i’m going to start right away
02:25 um so today will be about um exploring
02:29 uh high dimensional data
02:30 right and as many of you
02:34 know with a high dimensional and in high
02:36 dimensions
02:37 strange things can happen so if you for
02:39 instance look at this thing here what
02:41 you see here is a three-dimensional
02:42 projection
02:43 of a four-dimensional object rotating
02:46 and this
02:47 looks very funny right and i mean like
02:49 four dimensions is where we kind of
02:51 still can kind of get a glimpse but but
02:53 what if it gets to
02:54 thousands or millions of dimensions what
02:56 do we do with these kind of data
02:59 um so before i start
03:02 uh let me let me give you a brief
03:04 structure of the course
03:06 today of the lecture um
03:18 might have to click into the window
03:21 no i actually want to share okay
03:25 my ipod screen yes sorry for the delay
03:29 um all right so so i basically start
03:33 by introducing what i mean by high
03:35 dimensional data
03:41 um i will then
03:44 cover two dimension
03:48 dimensionality reduction methods
03:51 [Music]
03:58 and also talk a bit like what’s the idea
04:01 of of
04:01 reducing dimensions and third if we
04:04 still have time i’ll talk a little bit
04:06 about
04:06 clustering which we kind of can think of
04:08 as a discrete form of dimensionality
04:11 reduction i’d say
04:13 so what do i mean by high dimensional
04:16 data
04:18 well let’s let’s start with an object
04:21 that i call a data matrix
04:28 um x which is of shape n by m
04:41 and maybe you remember from last week um
04:45 tidy data formats so this um uh
04:49 matrix should be in a tidy format
04:50 already so in the rows we have
04:52 n samples
04:57 and in the columns we have m
04:59 measurements or observables
05:08 um so what do i mean by this so so an
05:11 example could be what you what you
05:12 measure here is for instance you um
05:15 uh take a uh take yourself right and you
05:19 measure for instance the length of your
05:21 thumb and the length of your index
05:22 finger and the length of your middle
05:24 finger and so on you do the same thing
05:26 on there
05:26 with the other hand you measure your
05:28 foot size your body height
05:30 maybe uh the heartbeat rate and so on
05:32 and and uh
05:33 you end up gathering more and more
05:35 observables
05:38 and that’s m observables and this gives
05:40 us the dimensionality of our data
05:44 um and now you can do this with yourself
05:47 and you can do this well you can ask
05:49 your friends to do it with them and so
05:50 on and then
05:52 you can maybe do it with all people in
05:55 asia all over the world and this uh
05:57 gives you the number of samples that you
05:58 have right
06:00 um now how do you how do you treat these
06:04 this kind of data right if we go one
06:06 step back
06:07 and uh think about how how can we
06:10 visualize
06:11 and uh stephan uploaded some slides on
06:14 that and and maybe we’ll also talk on
06:16 that a little bit but
06:18 what can we what can we actually
06:20 visualize easily
06:22 um
06:28 so if you uh if you would have one
06:31 dimensional data right that’s rather
06:33 easy right
06:34 you you can um just uh plot your data
06:38 on a one-dimensional line for instance
06:40 and also
06:41 for two dimensions
06:45 um i think this is pretty
06:47 straightforward right
06:49 um so if this would be the
06:53 entries of our data matrix x1 x2
06:57 um we can just plot them as a scatter
07:00 plot right
07:02 and we could for instance see whether
07:03 they are correlated or not
07:05 but um as soon as we are like
07:08 in in three dimensions or larger like
07:10 three dimensions we can still
07:11 handle we can do a three-dimensional
07:13 plot maybe but this is still
07:15 hard to quantify um this this gets um
07:19 this gets uh non-trivial right and so we
07:22 need techniques to
07:24 to uh think about how to actually look
07:26 at these data and how to analyze these
07:28 data so
07:30 the aim of dimensionality reduction then
07:35 i’d say is to reduce the number of
07:42 observables
07:50 while keeping
07:57 relevant information
08:04 and what this relevant information is
08:06 right this is something uh
08:08 you you have to decide by the problem at
08:11 hand
08:12 um now um
08:15 before i go
08:18 into talking about how you can reduce
08:21 dimensionality
08:22 i wanted to show you some examples
08:26 so i’ll share some slides for this
08:33 um and so
08:37 i think as many physicists are in the
08:39 audience one example i wanted to share
08:41 something
08:41 that i think many of you are already
08:44 know
08:45 which is um where we know where many of
08:48 you know that a
08:49 low number of observables actually
08:51 suffice to to describe the system very
08:53 well and that’s in the realm of
08:54 statistical physics and you
08:56 saw a lot of uh examples from stefan by
08:58 that
08:59 but even in a rather easy example like
09:01 an and paramagnetic
09:03 iso model right you know that there’s uh
09:07 there’s many many spins in this system
09:09 right and imagine you could measure all
09:11 these spins
09:12 um you have a very high dimensional
09:15 representation of the state of the
09:17 system then
09:18 but many of these spins will be
09:20 correlated and we know from statistical
09:22 physics that actually you can
09:24 very well describe the microscopic state
09:26 of such a system
09:27 with just the magnetization
09:30 now what will happen if we come from the
09:32 other side from the data science
09:34 perspective and that’s a
09:36 paper here an archive by sebastian
09:38 vetzel and what he did was
09:40 um he
09:44 did a simulation of a ferromagnetic
09:46 icing model on a 28 by 28 letters
09:49 at different temperatures and so 28 by
09:52 28 gives
09:53 you 784 spins right and we he recorded
09:57 the state of all these spins and 50 000
09:59 different samples at different
10:00 temperatures
10:02 and now you have a data matrix with 784
10:04 dimensions right and then you can start
10:07 to
10:07 say okay if i just would have this
10:10 matrix and i wouldn’t know nothing about
10:11 the system how can i look for order in
10:14 this system and how can i reduce
10:15 dimension and one of these techniques is
10:18 called principal components analysis and
10:20 i’ll explain later what this actually is
10:22 but if you do this you get a result like
10:24 this the first principal component
10:27 which is a one-dimensional
10:28 representation of of this
10:31 high-dimensional data matrix perfectly
10:33 scales with the magnetization
10:35 so in this way you would actually be
10:37 able to kind of discover
10:39 an order parameter without knowing
10:42 anything about the system
10:43 by having detailed microscopic data of
10:45 the system
10:47 so this is rather an artificial example
10:50 right because we know very well but how
10:52 how an ising model is behaved but what
10:54 about
10:55 systems where we don’t really know how
10:57 they work for instance humans
11:00 um so this is a paper from nature in
11:02 2008
11:04 last authors bustamante and
11:07 what they did was they quantified human
11:09 dna
11:10 so they took 3 000 european humans
11:14 and now you can imagine that the dna is
11:18 a long string of
11:19 characters right and and to a large part
11:22 these this is conserved from one human
11:25 to
11:25 an to another human so the dna will be
11:28 pretty much the same
11:29 but at some positions we have variations
11:31 right the dna from one human to another
11:33 is not exactly the same and at half a
11:35 million positions
11:37 the orthos quantified the dna state
11:40 and so this gives you a a a data set
11:43 with a dimension of half a million
11:45 approximately for 3 000 humans and then
11:48 they did the dimensionality reduction
11:49 and again they did a
11:51 principal component analysis and this is
11:53 a two-dimensional representation of this
11:55 high dimensional data set
11:57 and what i want to ask you now is to
12:00 look at the picture and you don’t
12:01 yet have to understand what the
12:03 principal components mean but i i want
12:05 to ask you to type in the ted
12:07 in the chat if you can what you see in
12:10 this picture
12:23 it’s a very difficult or very shy
12:25 audience
12:30 what does it look like it look
12:33 if the meat looks like an elephant
12:38 it is a scatter plot yes and uh we see
12:41 some clusters yes
12:43 there’s also uh some sensible answers
12:47 yes and it’s actually um uh the uh
12:53 now i don’t see my mouse anymore i think
12:57 adolfo got it right
12:58 yes i don’t forgot it right it’s already
13:00 the second answer
13:01 um it resembles a map of europe and
13:09 that’s exactly what the title of the
13:11 paper was uh jeans mirrored geography
13:13 within europe right so
13:15 this was uh i think it’s quite amazing
13:17 so you just
13:18 record the dna of humans right
13:21 and basically this gives you a map of
13:24 europe
13:25 approximately it kind of makes sense
13:26 right that the genes are
13:28 like um uh like um
13:31 well somewhat remain local
13:36 um so here’s another example
13:40 um there’s a very i’d say maybe
13:42 unphysical example but i think it
13:44 it works nicely these kind of examples
13:47 using
13:48 images work nicely to explain
13:50 dimensionality reduction techniques
13:52 so this is a fashion data set called
13:55 fashion mnist
13:56 provided by zalando research and
14:01 again you can uh these these images are
14:04 on a 28 by 28 letters
14:06 uh in a gray scale so you can record the
14:10 intensity of each pixel and you end up
14:12 with having 784 pixels on
14:14 of 60 000 pictures and there’s different
14:17 kinds of pictures
14:18 like there’s uh shirts and tops and
14:21 trousers
14:22 and there’s also shoes i think in there
14:24 and then you can do a dimensionality
14:26 reduction and here
14:29 this is an example for a umap a uniform
14:32 manifold approximation i’ll explain
14:34 later what this is it’s a non-linear
14:36 dimensionality reduction technique
14:38 and again you can see a number of things
14:41 um
14:42 by just taking this as as vectors right
14:45 and
14:45 projecting them in two dimensions you
14:47 see that uh for instance
14:50 um i think this up here which is
14:51 trousers forms one cluster
14:54 which is very different for instance
14:56 from a cluster which consists of
14:59 sandals and sneakers and
15:02 ankle boots so this is the kind of shoes
15:04 cluster
15:05 uh you have bags here which seem to have
15:07 some similarity to
15:09 tops at some point in a certain way
15:13 so um this is a again dimensionality
15:16 reduction right and
15:18 you can see how you can how you can use
15:20 it to group together different kinds of
15:22 things in in this case just pictures uh
15:25 what you also see
15:26 here is that these things cluster so
15:28 there’s a discrete kind of order
15:32 and i’ll talk about clustering also
15:35 again
15:35 in the second part of the lecture
15:38 um so the now i’ll go into
15:42 dimensionality reduction
15:44 and how it works and i’ll start with a
15:46 linear dimensionality reduction
15:48 there’s a number so so this is just the
15:50 linear transformation of your data
15:52 there’s a number of techniques around um
15:55 for instance principle component
15:57 analysis i’ll explain this and there’s
15:59 other
15:59 uh techniques like uh non-negative
16:02 metrics factorization or independent
16:04 component analysis
16:06 or factor analysis they are they are
16:09 kind of similar in that they are linear
16:12 but i only have time to cover one of
16:14 them which is the easiest one which is
16:15 pca
16:17 um for that i’ll again
16:21 share my ipad
16:41 all right what’s principal component
16:42 analysis i’ll first sketch the idea
16:47 um so for the i for forgetting the idea
16:50 it’s sufficient to go into two
16:51 dimensions and i’ll
16:53 make a plot a scatter plot of
16:56 two-dimensional data
16:58 where let’s assume our data looks a bit
17:02 like this
17:06 all right it’s essentially very well
17:08 correlated and
17:09 from looking at this now if you would uh
17:12 give me
17:12 x1 right and i would know x1 i could
17:16 very well predict x2 because i
17:18 i see that these two observables are
17:21 very well correlated and then
17:23 pca is a way to formalize
17:26 this and the first principle component
17:28 is designed in a way that it points in
17:30 the direction
17:32 of highest variability of the data and
17:35 in this example this would be this
17:36 direction
17:38 and this would be principal component
17:40 one and then the next principal
17:41 component
17:42 has to be orthogonal to the uh to pc1
17:46 and again point in the direction of uh
17:48 maximum variability of the data and
17:50 there’s only one direction left here
17:52 which is
17:53 uh the orthogonal direction here and
17:54 that’s pc2 principle component two
17:58 so then you just change the bases of
18:00 your data
18:03 through the directions of the principal
18:05 components
18:06 and then you end up with your
18:08 representation of your data
18:13 like this
18:19 where now in this directions your data
18:21 would be uncorrelated
18:24 and you would have
18:27 most of the variability in this
18:29 direction right
18:32 so the aim will be
18:38 to find a transformation
18:50 such that the components are
18:54 uncorrelated
19:03 and that and that they are ordered
19:08 by the explained variability
19:28 now how does the math of this look like
19:36 sorry
19:50 mathematically we would start with
19:53 a cr with an eigen decomposition of the
19:56 covariance matrix
19:58 um the
20:01 covariance matrix is an object which we
20:03 get from
20:04 or at least this is proportional to this
20:07 object it’s the
20:08 data matrix transformed times the data
20:11 matrix itself
20:13 and you will see that this is um
20:16 a matrix of the shape m by m so it has
20:19 the dimensionality of
20:21 our data and
20:24 if you if you do the computations you
20:27 will see that the
20:28 diagonal elements of these objects are
20:31 proportional to the variability of our
20:33 data in this direction and the
20:35 off-diagonal elements
20:37 are proportional to the covariance of
20:39 components and then we do an item
20:46 decomposition
20:50 and from this we get m eigenvectors
20:53 lambda
20:53 i am eigenvalues lambda and a
20:57 and m i vectors and i group them
20:59 together in a matrix w
21:02 so the columns
21:07 of this matrix are eigenvectors
21:14 of the covariance matrix and importantly
21:18 what we do
21:19 here and we can always do this is we
21:21 sort them by the
21:23 we sort them by the magnitude
21:29 of the eigenvalues lambda i
21:33 and this ensures that the first
21:34 principal component points in the
21:35 direction of highest variability
21:38 now w is just the new uh
21:42 basis for our data and then we can
21:44 transform our data
21:47 so like this the the transformed data
21:51 will look like this we take the data
21:52 matrix
21:53 x times w
21:56 um so in the language of principal
21:58 components this is what’s called
22:00 loadings
22:04 or also the rotation
22:09 and this the transformed data is called
22:11 scores
22:20 now i leave it to you as an exercise to
22:22 show that
22:24 the transformed data matrix
22:28 is a dynamic is a diagonal matrix
22:33 which basically means in this transform
22:36 space the components are uncorrelated
22:39 now the data matrix again here is of
22:43 shape
22:43 n by m
22:47 so the dimensionality of this
22:50 is just the same as the dimensionality
22:53 of our original data so how does this
22:56 help in reducing the dimensions
22:58 well there the idea is because the
23:00 eigenvectors are sorted
23:03 by the magnitude of lambda i that we can
23:05 truncate it
23:07 and if we define wr
23:11 which is an n by r matrix as the
23:16 first r columns of w
23:24 then we can get a a projection in lower
23:28 dimensions of our data
23:30 just by using rw
23:34 at wr
23:39 and this will this object will be of
23:40 shape n by r
23:42 and if we for instance select r equals
23:44 two we have a two dimensional
23:46 representation of our data of course
23:47 this
23:48 throws away some information but most of
23:50 the variability of our data is captured
23:59 um yes so so the take home is that that
24:03 is that principle component
24:07 analysis is a linear transformation of
24:09 the data such that the components
24:11 are uncorrelated and ordered by the
24:13 amount of explained variability
24:16 and what i want to do next is i want to
24:20 show you how you can actually do this
24:22 in r which is uh quite easy and you
24:25 don’t actually need to know the math for
24:26 this
24:28 um so let me
24:31 um excuse me yes uh could you
24:34 repeat how do we decide uh at what
24:39 at which point to truncate w and what
24:42 exactly do we mean by sorting my
24:43 magnitude of the eigenvalues lambda
24:47 okay um if you
24:54 uh this this is the
24:58 covariance matrix right so essentially
25:00 if i
25:01 if i look at this um i saw you didn’t
25:05 do you see my mouse yes yes
25:08 um so lambda 1
25:11 will be proportional to the variability
25:13 in this direction
25:14 okay yes and lambda 2 to the variability
25:18 in this direction
25:21 okay and and so this
25:24 sorting by uh
25:28 by the eigenvalues ensures that the
25:31 first component
25:32 of the first principal component will
25:35 point in this direction of highest
25:36 variability of the data
25:38 right yes
25:42 um so this is what the what the sorting
25:45 means
25:47 uh okay so in this case lambda 1 is
25:51 that the reference that we sort through
25:54 that we get w
25:55 from because lambda 1 is proportional to
25:57 the pc
25:58 one direction um
26:00 [Music]
26:02 lambda lambda 1 is the eigenvalue right
26:05 that corresponds to the
26:06 to the eigenvector which is in this
26:08 direction yes
26:10 yes and it is actually and if you
26:13 if you uh if you if you do kind of
26:17 um
26:17 [Music]
26:20 if you look at this at the transformed
26:22 cobra covariance matrix
26:24 right you will see that in the diagonal
26:26 elements of this you just have the uh
26:29 lambdas right so it’s the that’s the the
26:32 variable the variance in this direction
26:35 okay
26:36 yes thank you exactly so
26:39 was this all i think i remember
26:42 there was another question uh you also
26:44 asked how to drink it how to decide
26:46 uh uh where we trunk it
26:50 um this is not
26:53 this is up to us um you can kind of you
26:57 you can
26:57 uh what you can what i will show you in
26:59 a moment is that you can decide that you
27:01 wanna
27:02 keep like uh ninety percent of the
27:05 variability of your data and this would
27:06 give you a threshold of where you
27:08 where you stop for instance but
27:11 uh this threshold again is arbitrary so
27:13 you can also just
27:15 i mean one usual decision that you do is
27:17 if you want to visualize your data you
27:18 truncate at two
27:20 because that’s easy to visualize um
27:23 you might also take this as an initial
27:25 step
27:26 of dimensionality reduction to maybe
27:29 just like
27:30 going from half a million to 50 or
27:32 something and then do a non-linear
27:33 transformation
27:35 but essentially like in my everyday work
27:37 this is a uh
27:39 it’s an arbitrary it’s often an
27:40 arbitrary choice and yeah in the end
27:42 have to see whether it works or not
27:44 whether it’s sensible
27:45 so that’s a tough question and
27:48 there’s also some research still on that
27:51 okay um
28:03 all right so i prepared a
28:06 um a
28:10 a document a markdown
28:13 document a notebook in our studio and i
28:15 think stefan briefly introduced our
28:17 studio to do
28:18 to you last week you can also
28:22 get the code on from this url and it’s
28:25 also
28:26 by now on the on the course home
28:28 homepage
28:30 um also the code from the last lecture i
28:33 forgot to mention that it’s also on
28:34 github
28:37 in the same folder yeah exactly
28:40 and then um well i wanted to show you
28:43 before is that like i i want to just
28:45 show you as an example one of these uh
28:47 famous uh uh data sets you you might
28:51 have heard of the uh it’s it’s a data
28:52 set where uh
28:53 botanist and statistician fisher
28:56 collected
28:56 uh flowers from a flower called iris
29:01 and there’s different species iris
29:04 guinea car aristotoza iris versicolor
29:07 and these have different leaves the
29:10 petal and the sepal
29:12 and he measured the the sample length
29:14 and the sample width and the petal
29:16 length and the petal with
29:17 all these species
29:21 and it’s a it’s a data set that you can
29:23 easily load in in
29:24 r that’s why i used it and you can it’s
29:27 four dimensional so here you see the
29:29 data it’s uh
29:31 the length width and length and width
29:33 for one species
29:34 and then the other species follow
29:36 somewhere in the data
29:38 and you have this four numbers
29:41 now you can try to visualize this uh in
29:44 a
29:44 for instance pair wise ghetto plots
29:47 plotting the length
29:48 of the sepal against the sample width
29:50 and so on
29:51 and here this is color coded by species
29:54 and you already see some separation
29:56 but it’s hard to choose like in this
29:58 four-dimensional plot
30:00 what you actually what to actually look
30:02 at and so you can perform a principal
30:04 component analysis on this
30:06 four-dimensional data set and then r
30:08 this is just a one-liner
30:10 so the principal components analysis is
30:12 done by p or comp
30:14 and here i plug in the four-dimensional
30:17 data set
30:20 and then you get all the things out that
30:23 we just talked about you get this uh
30:25 transformation matrix w called rotation
30:28 here
30:29 um where now you see that the
30:32 the principal the direction of principle
30:34 component one in this four dimensional
30:36 space
30:37 right it’s it’s a vector um you see
30:40 you can look at the transformed data
30:43 um which is again four dimensional
30:46 at first um here you can
30:50 uh look at the explained uh uh
30:53 variability or explain variance
30:55 um and this is maybe for the question of
30:59 where to
31:00 cut off so that they explained
31:03 variance here you can see the cumulative
31:05 proportion of this
31:07 and with the first two principal
31:09 component you explain
31:10 95 96 of the variability of your data
31:14 and you might say okay
31:16 that’s enough with uh and in in pc4
31:19 uh you from pc3 to pc4 you hardly add
31:23 something so the fourth dimension of the
31:24 data set is probably very
31:26 it’s very correlated to the first three
31:29 there’s not much information left and
31:31 then this is how the data looks now
31:34 in in the direction of principal
31:35 components one and two and what you can
31:37 observe is that
31:39 uh cross principle component one that’s
31:41 the main direction which kind of
31:44 um discriminates between the species
31:47 you can also see that the setoza is very
31:50 different from
31:51 versicolor and virginica and here it’s
31:54 hard to actually draw a boundary
31:56 but just based on the measurements
32:01 this is what’s called a scree plot and
32:03 this is just again a visualization of
32:05 the variance
32:06 explained by each principal component
32:10 now i want to also show you an another
32:13 data set
32:14 where actually principal component
32:16 analysis has a hard job
32:18 to work so this is
32:21 again an image data set where images
32:24 were collected of different objects and
32:25 they are rotated
32:27 right
32:31 and now you can again do a vector
32:34 representation of these images to a
32:35 principal component analysis and this is
32:38 what the
32:38 picture looks like and what the plot
32:41 looks like now and
32:42 and the colors now uh represent
32:45 different objects
32:46 and the dots of the same color are then
32:52 pictures of the objects at different
32:54 angles and you can see
32:56 that if you do this dimensionality
32:58 reduction you can see some structure
33:00 right but it’s not really uh doesn’t
33:03 really work good in discriminating the
33:05 different
33:06 objects and you would expect that maybe
33:09 you can
33:09 reveal this discrete kind of order
33:11 different kinds of objects
33:15 or that’s maybe what you want by doing a
33:18 dimensionality reduction
33:23 here’s another example that that i made
33:26 up
33:26 which is uh a dimensionality reduction
33:29 of a swiss roll so what’s this with
33:31 actually a swiss roll with a hole so
33:33 what i did here was to say okay i
33:36 um uh randomly place
33:39 uh points on a on a two-dimensional
33:43 surface
33:43 but there’s a hole in the middle and
33:45 then i wrap this up
33:47 in 3d like this um
33:50 i transform it like this okay it’s
33:53 it’s this role and it’s it’s called it’s
33:56 it’s a suite actually in in switzerland
33:58 that’s why it’s called a swiss roll and
34:00 desk you can still see this uh
34:02 this uh hole in the middle now what will
34:04 happen
34:05 if you do a principal components
34:07 analysis of this
34:10 the example is shown here and this is
34:13 the same
34:14 data shown in principle component space
34:21 actually in 3d and so
34:24 what you can see is that
34:28 not much happened right it’s pretty much
34:30 the same as before
34:31 just stay the principal component and
34:34 the reason for this is
34:35 that what we do is a linear
34:38 transformation of the data right
34:40 but what i did before was to take two
34:42 dimensional data and
34:43 non-linearly
34:47 projected into 3d well use the
34:50 non-linear function
34:51 and so um projecting this back to 2d
34:54 in a way cannot work with a linear
34:58 dimensionality reduction technique and
34:59 that’s why
35:02 non-linear dimensionality reduction
35:04 techniques are also very useful
35:07 so this leads me to the second part of
35:10 dimensionality reduction
35:13 which is about nonlinear dimensionality
35:15 reduction techniques and
35:17 as it’s usually the case with non-linear
35:20 things
35:21 there’s many of them and then actually
35:23 about one year ago
35:24 i had a look at wikipedia at the
35:28 like a list of some prominent algorithms
35:32 for dimensionality reduction and there
35:34 were 29 i’m sure now
35:36 there’s more and this this doesn’t claim
35:39 to be complete right so
35:41 um this makes very clear that depending
35:43 on what you want
35:44 there’s different things that you can do
35:47 today i will concentrate on uniform
35:49 manifold approximation a rather
35:51 recent one introduced i think in 2018 so
35:54 three years ago
35:58 and it’s used a lot in single cell
36:00 sequencing data in single cell genomics
36:03 nowadays
36:05 um so what is this
36:08 technique about so don’t be scared
36:11 there’s a lot of mathematics on this
36:13 page and you don’t have to understand
36:15 all of that right away i won’t also i
36:18 also won’t have the time to fully
36:20 explain this in detail but i will give
36:22 you an intuition
36:23 of what this does uh
36:26 the basic idea is
36:30 that
36:32 [Music]
36:36 so the basic idea is that we have a
36:39 higher dimensional data set
36:41 but we hypothesize that there’s a low
36:43 dimensional manifold so think of
36:45 a for means for instance there could be
36:47 a line in a two-dimensional manifold
36:50 right and then we want to discover this
36:53 dimensional
36:54 manifold so this
36:57 there’s a couple of assumptions that are
36:59 made in new map one that the data is
37:01 distributed uniformly on a romanian
37:04 manifold
37:05 that the romanian metric is locally
37:07 constant and that the manifold is
37:09 locally connected
37:11 and what umap then gives you is
37:14 it models this manifold with a fuzzy
37:16 topological structure
37:18 and it finds a low dimensional
37:20 projection of the data
37:22 that has a very similar fuzzy
37:25 topological structure
37:26 to your original data so what does all
37:29 this mean

white board 1

37:30 um let’s let’s start
37:33 by i will start by by introducing
37:36 briefly very briefly introducing an an
37:40 important idea from topological data
37:42 analysis because
37:43 uh this algorithm is or this idea is
37:47 heavily based on topological data
37:50 analysis and this is simplices
37:53 a zero simplex is a point one simplex is
37:56 a connection of two zeros implices which
37:58 is a line then there’s a two simplex
38:00 which is a triangle
38:02 a three simplex which is a tetrahedra
38:05 and you can see that you can easily
38:07 extend this contact
38:09 con concept to higher dimensions so you
38:12 can introduce a four simplex five
38:13 simplex and so on
38:16 and these objects are rather easy
38:19 to construct and the nice thing that you
38:22 can do is you can glue together such
38:24 objects such simplices
38:26 and by gluing them together
38:30 you can basically approximate um
38:33 money folds so um you might all have
38:37 seen already some point where you
38:39 uh triangulate the surface right and
38:41 this is nothing
38:42 then uh approximating this manifold with
38:45 simplices
38:51 now let’s look at this example which i
38:54 took from the umap homepage actually
38:56 where you can see points which are
38:59 uniformly sampled actually equidistantly
39:02 sampled i think
39:03 on a not equidistant but uniformly
39:06 sampled on a
39:10 on a line
39:14 in a two-dimensional space right so we
39:16 have a one-dimensional manifold
39:18 in a two-dimensional space and
39:22 uh um so this would be could be our data
39:25 points that were measured
39:26 right and we would be interested in
39:28 finding the topological
39:30 structure of this manifold um so what
39:33 has been done
39:34 here is to draw some spheres around
39:36 these dots
39:38 and the union of all these spheres i
39:40 mean all these spheres are sets
39:42 and the union of all this is what’s
39:44 called an open cover
39:45 open cover is just the thing that uh is
39:49 this is a um is a collection of sets
39:52 that span the whole manifold right so
39:54 the line
39:55 that we see here is inside this open
39:57 cover
39:58 and now you can do some heavy
40:00 topological and mathematics
40:02 and you can then prove that if you now
40:05 contract
40:05 uh construct simplices by uh all
40:09 groups of sets that in that that have a
40:12 non empty intersection which in this
40:14 case basically would mean
40:15 you construct a one simplex here line
40:18 here
40:19 and one simplex here and so on you i
40:21 think you get the picture
40:23 then if you do this and this would look
40:25 like
40:26 no um then this
40:29 structure this graph-like structure that
40:32 you get out of
40:33 this captures the topology of the
40:35 manifold
40:37 the information is encoded in there and
40:38 i think in this uh
40:40 two-dimensional example this this is uh
40:44 quite intuitive that the connection of
40:46 all these points
40:51 tells you about uh tells you that this
40:54 is a connected line
40:56 and there’s no loops for instance
41:01 um so that’s pretty cool and now we
41:04 would like to use this concept to
41:06 uh to learn something about our high
41:09 dimensional data
41:12 is the criterion for the connection
41:14 always like the
41:15 shortest distance between the points
41:18 because i mean here i could
41:20 see it but if i have more complex
41:22 structure
41:23 exactly exactly i’m actually coming to
41:25 this point in a second in this case
41:27 it was a bit like okay let’s just guess
41:30 uh the size of the sphere
41:31 make it a good guess such that it works
41:33 right but if you have high dimensional
41:35 data
41:36 what would be the size of this sphere
41:37 that would be super unclear
41:39 so um i’ll come back to this question in
41:42 a moment
41:46 it’s also connected to a situation with
41:48 like this
41:49 right because um
41:52 our points might not actually be
41:54 uniformly distributed
41:55 okay so um if we would do the same thing
41:59 here
42:00 drawing these circles and
42:04 uh construct the simplices out of that
42:07 we would actually get disconnected parts
42:10 okay
42:11 and this is not what we would
42:13 intuitively like to get right
42:15 because here we
42:19 we saw this line um
42:24 now and this is also this kind of breaks
42:26 the assumption
42:28 of umap which is the uniform
42:30 distribution of uh
42:32 of data points on the money fold so
42:35 how can we circumvent this or how can we
42:37 solve this problem we can
42:38 kind of turn the problem upside down and
42:41 say
42:42 no these data points are uniformly
42:46 distributed we see different distances
42:50 if we apply a constant metric so this
42:52 has to be wrong the
42:53 metric is not constant actually we
42:56 couldn’t
42:57 try to construct the metric like this
42:58 and this comes back to your question
43:01 we might say that the distance in this
43:02 case to the fifth
43:04 neighbor should always on average give
43:06 us a distance of five
43:08 right such that we say um distances
43:10 between neighboring data points should
43:12 on average be one
43:15 um and this would mean that that a
43:18 distance of
43:19 five for instance for this uh data point
43:22 would be rather large
43:23 and a distance of 5 for this standard
43:26 point would be rather small
43:28 so we define
43:34 we now use nearest neighbors k nearest
43:37 neighbors
43:38 to define a local metric
43:42 and we then turn this local metric
43:47 again and i’m skipping a lot of details
43:49 which are important here into a
43:53 construction of simplices into a
43:54 construction of graphs where the
43:56 edges now are weighted and the reason
44:00 that they are weighted now is that the
44:02 uh the distances between
44:04 uh points are different and also that uh
44:06 we actually have two measures of
44:09 um of distance one
44:12 using the local metric from here to
44:15 there and one using the local metric
44:16 from here to there because they might
44:18 not be the same
44:22 um but if you properly do the math
44:25 behind all that
44:26 you can show that this construction now
44:29 um captures essential parts of the
44:32 what’s now called the fuzzy topology
44:35 of this manifold okay so we have a graph
44:39 representation now
44:41 that captures a topological structure
44:44 of our of the manifold we’re interested
44:47 in
44:49 um how can we do that how can we use
44:50 this now to find low dimensional
44:52 representation
44:54 well the idea is that now you start
44:57 um you have this this graph defined and
45:01 now you start
45:02 with a low dimensional representation
45:05 you just place your points randomly
45:07 basically
45:08 and then you construct a similar
45:12 graph representation of your low
45:14 dimensional representation and then you
45:15 shuffle your points around
45:17 until the the topology the graph
45:20 representations kind of match
45:26 so yes you have a graph representation
45:30 of your high dimensional
45:31 data of your low dimensional data you
45:33 interpret the edge weights
45:35 as probabilities and then you minimize
45:38 the cross
45:39 entropy given here where basically
45:42 the important thing is that you try to
45:44 match the
45:46 edge weights in your higher dimensional
45:48 representation with the edge weights and
45:50 the low dimensional representation like
45:52 when this
45:53 ratio gets one the term vanishes and the
45:57 same happens over here
46:01 now how does this work in practice like
46:05 if you want to read more about this
46:06 there’s a nice home page
46:11 i just wanted to show an animation here
46:15 which is called understanding umap there
46:18 for instance here you have an example
46:20 in 3d you have um dots
46:23 which run out from a center in a star
46:26 like fashion right and that’s a 3d
46:28 represent this is a three-dimensional
46:30 data
46:31 and well actually it’s
46:34 10 dimensional data in this case i think
46:37 and this is a three-dimensional
46:38 projection maybe
46:40 and now if you run umap
46:44 these dots are shuffled around randomly
46:47 and you stop
46:48 when you have a topological
46:50 representation which is very similar and
46:51 in this case this is also
46:53 this starship star-shaped pattern but
46:56 now in two dimensions
46:58 and if you want there’s many more
47:00 examples here and you can
47:02 play around with this
47:09 um you can of course do the same thing
47:12 in
47:12 r um i i don’t know
47:16 like i know this was a lot of
47:18 information and i’ve
47:19 i didn’t go into the details are there
47:21 questions about umap at this point
47:28 otherwise i just show you how to do this
47:31 in r
47:35 it’s essentially again a one-liner
47:38 where you have a function umap and what
47:41 i feed in here
47:43 is the swiss roll that i created
47:45 previously
47:47 okay and then um
47:50 i just i i do the um
47:54 dimensionality reduction here with a one
47:56 liner
47:57 then i convert the whole thing in a data
47:59 table object
48:00 to uh to have a nice plotting
48:03 and this is what the result looks like
48:05 now a new map coordinate
48:08 the the swiss role was kind of unrolled
48:11 but it
48:12 heavily lost its shape but what you can
48:15 see is that the topology is preserved
48:17 right
48:18 um so so the hole in the center is still
48:21 there
48:25 um i also want to show you a review map
48:29 representation of the coil 20 data that
48:31 i showed you above so remember this was
48:33 the data set
48:36 where we had pictures which are rotated
48:42 and where the the pca didn’t look very
48:46 informative
48:48 now if we do the same thing with a umap
48:50 and we feed in these pictures this is
48:52 what we get
48:54 so most of the images nicely separate
48:57 right different colors mean different
48:59 objects
49:01 they nicely separate and what the umab
49:03 also gives us
49:04 is this kind is is the topology of this
49:07 data in the sense of that we can see
49:09 that this is a rotation
49:10 right so neighboring
49:14 pictures are similar and you
49:18 rotate the object until you come back to
49:20 the other side
49:27 okay i think yes that’s it for um
49:30 [Music]
49:31 dimensionality reduction just one
49:35 example
49:36 for linear dimensionality reduction
49:37 which was printable components
49:39 and one for non-linear topology
49:42 preserving dimensionality reduction
49:44 which was umap and i now wanted to talk
49:47 a little bit about clustering
49:50 um so
49:53 you can as you already saw in some of
49:56 the
49:56 reduced dimensionality in in the some of
49:59 the
50:00 representations of the data and reduced
50:01 dimensions uh clusters can appear in
50:04 your data which
50:05 essentially mean you have qualitatively
50:08 different
50:09 objects in your data different different
50:11 samples and
50:15 you sometimes you you
50:18 and and it’s good to have a way to group
50:22 different kinds of different classes of
50:24 objects together
50:26 and clustering is nothing
50:29 which is um there is no one definition
50:33 of clustering
50:34 the idea is to find clusters such that
50:37 the data within clusters are similar
50:40 while data from different clusters are
50:42 dissimilar
50:43 but what you define a similarity or
50:46 dissimilarity that’s
50:48 basically up to you and so a lot of
50:50 different clustering algorithms exist
50:54 and i just wanted to introduce three of
50:56 them today
50:58 um and you use this a lot in or at least
51:01 i use this a lot in trying to understand
51:03 high dimensional data
51:05 i’m just going to share my
51:10 ipad again
51:27 and i wanted to show you first i want to
51:29 give you three examples and the first
51:30 one would be hierarchical clustering
51:32 which you
51:33 which uh
51:36 we’re here i just plotted some
51:40 some data in two-dimensional space to
51:42 illustrate the concept of clustering
51:44 and what you need for hierarchical
51:47 clustering
51:48 is first of all a metric
51:54 which defines distance between your
51:57 objects
51:57 that you have here and this could just
52:00 be for instance the euclidean metric or
52:02 manhattan metric or a correlation
52:04 metric you’re basically free to choose
52:06 and then what you do
52:08 is you um you group together
52:12 the two objects which are closest
52:15 together biometrics so with the
52:16 euclidean metric for this example
52:18 this would be those two guys right
52:30 so
52:35 what you then create is a kind of tree
52:37 where where um
52:39 okay we say i said we grow group
52:41 together those two guys so that’s the
52:42 first thing to do
52:48 um and now next we need another thing
52:52 which is called the linkage criterion
53:03 um and this tells you how to now treat
53:06 these clusters because
53:08 uh now you need a way to uh
53:11 to give get the distance between the
53:14 clusters
53:21 and this could for instance be the
53:23 minimum distance
53:27 so this would mean if you want to now
53:29 get the
53:30 the distance between uh this cluster
53:33 and uh this object you would
53:36 take the distance between b and d and c
53:39 and d
53:39 and take the minimum distance or you
53:43 could also
53:44 for instance use the centroid there’s
53:46 many of the clusters there’s many
53:48 choices right
53:51 um now if we kind of go on we maybe take
53:54 the centroid here then the next cluster
53:56 would probably be a grouping of d
54:00 and e
54:06 then i’d say i would add
54:09 those three in one cluster
54:18 then group all these b c d
54:21 e and f until finally we arrive
54:25 at one at an object which
54:29 contains all the elements
54:35 and now if we want to define clusters in
54:38 our data we
54:39 need to define a precision threshold
54:42 along this axis
54:46 and we could for instance put it here
54:53 um and then
55:01 this would cut the tree in this
55:05 in in separate subtrees and we would
55:08 have a
55:08 subtree here that’s up three here and
55:11 it’s up to here and then this would be
55:12 our three different clusters so we would
55:14 have a cluster one
55:15 cluster two and a cluster three and if
55:18 we uh
55:20 increase the precision like up here then
55:22 we would get more and more clusters
55:28 and as this basically
55:31 just needs some uh
55:34 distance computations um it’s it’s this
55:37 is rather easily done in in
55:39 uh also in higher dimension i don’t know
55:46 um
55:48 next i wanted to show you
55:58 um another important one which is a
56:02 centroid-based clustering called k-means
56:04 clustering and you
56:07 one comes across this quite often as
56:09 well when working with high-dimensional
56:10 data
56:11 so here in this case
56:17 um you pre-specify the number of
56:19 clusters k
56:20 so that’s what you have to start with
56:23 how many clusters do you want to get
56:26 you might try out different case in
56:28 practice
56:29 and then you partition your data such
56:31 that the square distance to the cluster
56:34 mean position so that that would be the
56:37 centroid is minimal
56:41 um so how would this look like
56:45 um actually maybe let’s look at the
56:49 final result
56:50 first we’re kind of uh here you see an
56:53 algorithm converging but what you try to
56:55 do is to
56:55 talk to do a grouping of the data such
56:59 that oh no it doesn’t stop there
57:06 sorry so so how do you how do you get
57:08 this partitioning of your data
57:11 one way would be the lloyd algorithm you
57:13 start with random centers that you place
57:15 somewhere so in this case
57:17 this is an example taking from wikipedia
57:20 there’s a center here a center there in
57:22 the center there and then
57:23 for each of your data points you assign
57:26 the data point to the cluster
57:28 where the center is nearest right so the
57:30 yellow center is here so everything here
57:32 would be assigned to the yellow
57:34 cluster now once you did this you
57:39 update the centers of your clusters
57:42 now you take the centroid of this yellow
57:45 cluster for instance
57:46 and the yellow center ends up here
57:50 and now you do a new assignment of your
57:54 data points to the nearest center in
57:57 this case all the
57:58 yellow ones up here would be assigned to
58:01 the blue center actually
58:03 so they are assigned to the blue and
58:04 then you iteratively go on and do this
58:07 go on and go and go on until this
58:10 algorithm converges and you end up
58:12 having three clusters defined by this
58:16 criterion of course there’s other more
58:17 complicated more sophisticated
58:19 algorithms but this approximates the
58:20 solution
58:27 a third algorithm that i wanted to show
58:29 you and i want to show you this because
58:31 i
58:31 work with it every day comes from social
58:34 systems which is a graph based
58:36 clustering
58:38 when you start with your high
58:39 dimensional data you first need a graph
58:41 representation of your data and we
58:43 already saw this in you
58:44 in the u-map example you could for
58:47 instance
58:48 do a k-nearest neighbor network on your
58:50 data and then
58:51 you partition your data such that the
58:53 modularity
58:54 queue is optimized um
58:59 modularity in this case is defined as
59:02 this object where where e i j are the
59:05 fraction of
59:06 links between cluster i and j
59:09 right and what you try to do then in
59:12 this object is that
59:14 the links between
59:17 cluster i and i which is the internal
59:19 links this
59:20 it’s this eii right this is it’s the
59:22 links inside the cluster
59:24 you try to maximize this
59:27 and uh you in in this ai term you have
59:30 the
59:30 ij from i to another cluster
59:34 so this is links out out of the cluster
59:36 you minimize this right so you have
59:38 very
59:42 clusters means a cluster in this case
59:45 means that it’s
59:46 it’s a
59:49 parts of your graph which are densely
59:51 connected inside but only sparsely
59:53 connected to the outside
59:56 there’s a particular algorithm which is
59:59 used a lot nowadays which is the longer
60:01 algorithm so how do you solve this
60:03 because this is not
60:05 uh you can’t so if this brute force
60:08 um you start with a network like this
60:12 for instance uh this would be your
60:15 nearest neighbor graph that you created
60:17 from your data and then you do copy
60:18 attempts you
60:19 start with node zero so all these now
60:23 you start with all these being separate
60:25 clusters
60:26 and then you copy cluster identity zero
60:29 to neighboring nodes
60:30 and you again compute the modularity and
60:32 if the modularity is increased you
60:34 accept this copy step
60:36 and if it’s decreased you uh you reject
60:39 the copy step and you
60:40 do so um until you arrive
60:45 at a steady state like this where copy
60:47 attempts
60:49 don’t increase your modularity anymore
60:52 would be in this case something like
60:54 that where you end up with four clusters
60:56 and then you do an aggregation and this
60:59 means
61:00 now that you
61:03 define a new graph representation of
61:05 your data where this
61:07 cluster here the blue cluster is now a
61:09 single node
61:12 it gets four internal connections
61:14 because there’s
61:15 one two three four internal connections
61:19 and there is
61:20 four connections to cluster green
61:24 one two three four
61:28 and one connection to the light blue
61:30 cluster
61:31 and so you aggregate this you do the
61:33 copy attempt again
61:35 and so on and so on until you finally
61:38 arrive at a picture where you
61:40 where you um at a state where
61:44 you can’t optimize modularity anymore
61:46 and this is um
61:48 how you this is how the algorithm works
61:50 to solve the problem
61:52 of modularity optimization and this
61:54 algorithm
61:56 works very well with many kinds of data
61:58 again for instance it’s
62:00 used a lot to identify
62:03 cell types in a single cell genomics
62:12 all right that’s it already i realize
62:14 i’ve been
62:15 quite fast we would have some time but
62:19 i’m i’m sorry
62:22 good timing one hour um
62:27 so just as a summary um the question
62:30 i wanted to talk about is how do you
62:32 explore high dimensional data
62:34 to identify order in this high
62:36 dimensional data set and i showed you
62:39 a couple of dimensionality reduction
62:41 techniques to visualize your
62:42 in-dimensional data
62:44 and that allow you to identify a small
62:46 set of
62:47 observables to describe the data um
62:51 i showed you a couple of approaches for
62:53 clustering to this
62:54 to identify discrete order in high
62:56 dimensional data and
62:58 i think one important point that i want
63:00 to make is the methods to use to work
63:03 with high dimensional data
63:05 depend on the problem at hand so really
63:08 depends on what you’re looking for and
63:09 often it’s just
63:10 trying out a lot of things to see what
63:12 makes sense
63:14 to understand your data and with this um
63:17 i’m at the end and
63:20 there’s some references in the back
63:22 which i will upload uh
63:24 with the slides and with this yes i’m
63:26 i’m happy to take questions
63:29 okay perfect uh thanks a lot fabian yeah
63:32 thanks for taking the time and
63:35 especially since
63:36 uh data people like you are so in such
63:39 high demand these days and very busy so
63:41 it’s great that you took the time to
63:43 to show us some of your stuff and
63:46 so uh i think what we usually do is that
63:49 we hang around a little bit
63:51 yeah on zoom and then if you have any
63:53 questions you just stay online
63:56 and ask your questions
63:59 yeah and uh other than uh that
64:02 yeah so then see you all next week and
64:05 there’s already a question in the chat
64:13 there were questions in the chat during
64:15 the year during the lecture and i didn’t
64:16 see the chat so yes you can’t see that
64:18 very well if you’re sharing the screen
64:20 yes but i think they were answered
64:22 already by other people
64:24 so so now the latest question is how
64:26 computationally intensive is um
64:29 that’s that’s a good question and it’s
64:31 actually that’s another uh
64:33 advantage of umap it’s pretty fast
64:38 um it’s
64:40 it also it’s you don’t you don’t need a
64:42 lot of memory
64:43 um it scales very well with the
64:46 dimension of your data and also the
64:47 amount of data points so i think
64:49 it’s it’s um i mean just like
64:56 it’s one of the fastest things around at
64:58 the moment um
65:01 i mean if you look at the typical data
65:02 sets uh that that you’re looking at
65:05 where you have i mean 15 thousand
65:07 dimensions
65:08 and then maybe a few hundred thousand or
65:11 ten thousands of uh samples yeah you you
65:14 end up
65:15 with a few seconds yes yes on my machine
65:18 like the with the largest data sets i
65:20 have which are
65:21 of the order that you described um it
65:24 takes about
65:25 30 seconds maybe to compute the uh
65:29 compute the u map which is much faster
65:32 for instance there’s
65:33 this name i don’t know whether you know
65:36 about it but this was used a lot before
65:38 and
65:38 this is much slower usually
65:41 and actually there’s um i didn’t try
65:44 this out but there’s also gpu
65:46 of implementations of umap nowadays and
65:48 this again
65:50 you can easily you can nicely
65:51 parallelize it and so this should be
65:53 much faster again
65:59 you’re welcome
66:04 great thanks so are there any other
66:07 questions
66:08 um i have a question it’s from
66:12 umap
66:15 so when we uh in all the examples of
66:18 umap we saw approximating a high
66:20 dimensional data
66:21 is there any isn’t there any parameter
66:23 in the algorithm that quantifies
66:25 the uh
66:28 the acceptability of the approximation
66:30 whether the approximation is cur
66:32 how how much is it acceptable or not
66:34 because
66:35 or will it approximate any uh high
66:37 dimensional data
66:40 there didn’t seem to be any parameter
66:41 that quantified or helped us to
66:44 understand if the result of the umap
66:47 approximation is
66:48 good or not uh obviously concerning the
66:51 problem at hand
66:53 i’m not fully sure whether i completely
66:56 understand the question but here are a
66:57 few thoughts on this um
67:02 so the approximation in umap is that you
67:05 you approximate
67:07 that um
67:10 your your approximation is that the data
67:12 is uniformly distributed
67:14 on the manifold right and the question
67:17 is
67:18 and maybe the question is is this a good
67:19 approximation or is it not a good
67:21 approximation
67:22 the problem is that we that we hardly
67:25 know the real metric
67:27 of the underlying space so
67:30 if you for instance think of these
67:33 examples with
67:34 pictures right what is a distance
67:36 between two pictures
67:37 what is this metric um
67:41 it’s it’s um i i think this is the you
67:45 you can’t actually like
67:46 unless you have a unless you have a good
67:49 microscopic model
67:50 which in this case we don’t have it’s uh
67:53 not really possible to
67:55 to define a good metric and therefore to
67:58 judge whether whether the
68:00 whether the approximation is a good one
68:02 or not
68:04 um it’s not
68:08 um yes i mean the way here that you you
68:12 do it is
68:13 there’s no there’s no uh right rigorous
68:16 way of doing that and that’s why people
68:17 like fabian exist
68:18 who have experience and have consistency
68:21 checks
68:23 that with some experience can check
68:25 whether you map
68:26 makes sense or not yeah or whether it’s
68:29 so one of the standard problem is that a
68:31 umap can be dominated by noise by
68:33 technical artifacts for example
68:35 yeah and then you need to have some
68:37 insights into data and some experience
68:40 that tell you whether uh whether uh
68:43 uh what you see actually represents
68:46 something that’s real in the data
68:48 or whether that’s something just just an
68:50 artifact you know so that’s
68:52 uh that’s basically one of the the jobs
68:55 that bioinformaticians or data science
68:58 scientists do
68:59 because there’s just no rigorous way
69:02 where you just push a button and it
69:03 tells you whether
69:04 whether what you’ve done is is right or
69:06 not so
69:07 it needs some experience and some some
69:09 insights and understanding
69:11 of the data and the for example the ball
69:14 biology
69:15 uh that is underlying the the data
69:18 so typically with the typical um
69:22 method would be to use multiple
69:24 reduction methods and see which one is
69:26 giving us
69:27 an order which is much more perceptible
69:29 in our in our plots
69:31 you know i guess mean you also have to
69:34 to know the strengths and weaknesses of
69:36 each method
69:38 yeah for example what uh fabian told
69:41 about
69:41 previous uh before t-sne that had a
69:45 tendency
69:45 uh to perform to to give you clusters of
69:49 data points
69:50 yeah it was not very good at
69:52 representing global structures of data
69:54 sets
69:55 yeah and once you know that you can
69:57 interpret
69:58 interpret what you see and then you can
70:00 be careful
70:01 in interpreting uh basically clusters as
70:05 real physical or biological objects
70:09 yeah you maps the you are what you what
70:12 you’ve seen now
70:12 that overcame this limitation that it’s
70:15 basically
70:16 also represented the global structure of
70:18 data quite well
70:20 but i guess also in umab you can have
70:22 artifacts
70:23 especially if your data has noise yeah
70:25 technical noise
70:27 you can have very weird artifacts so you
70:30 need to treat the data
70:32 before you can give it to youmap in in
70:34 typical instances
70:36 yeah so for example log you have to log
70:38 transform data very often
70:40 if your data is says spends huge orders
70:44 of magnitude
70:46 in quant in in in values yeah then then
70:49 sometimes you have the log
70:50 transformations for not the
70:52 the very few very large data points to
70:54 dominate everything
70:57 and so you have to treat the data in a
70:59 way to make it
71:00 say digestible for these methods
71:04 but then maybe i mean
71:07 the the other part of the question was
71:09 in in a way parameters and i didn’t talk
71:10 about parameters at all and there are a
71:12 couple of parameters in the algorithm
71:14 um and i i skipped a lot of details but
71:17 the most important parameter
71:19 is the number of neighbors you look at
71:22 in your
71:24 when you construct the graph if you take
71:26 the first neighbor right you you
71:28 kind of um what you do is you
71:32 um estimate your metric very locally
71:36 um so so this is very this is very fine
71:38 but it’s also very noisy so you often
71:40 will estimate the wrong metric if you
71:42 if you uh increase the number of
71:45 neighbors in there
71:46 um you will get a more robust estimate
71:50 of the local metric but you will
71:52 miss some fine structure of the data and
71:54 maybe actually there’s a there’s one
71:55 example that i can show
71:57 uh from here where if you look at
72:01 i don’t know whether you can see this
72:02 here this is a
72:04 um maybe you can can you see this or is
72:07 this too small
72:08 i’m sorry you’re not sharing the screen
72:12 ah sorry i don’t share the screen so you
72:14 can’t see anything
72:16 um let me show
72:18 [Music]
72:20 let me show this example um
72:22 [Music]
72:27 kind of if you that’s maybe maybe a
72:32 actually a good example for for your
72:33 question where kind of okay here we have
72:35 a circle
72:37 with uh with dots and they on um
72:41 and then like like the distance between
72:43 the dots is not always the same right
72:46 um and so actually if we
72:49 take just this example and we say okay
72:51 here we know the euclidean metric
72:54 uh we can think of how how well does the
72:56 approximation now
72:58 or or how does umap behave if um
73:03 uh in this case right and
73:06 one thing that you can see here is that
73:08 if i increase the number of neighbors
73:13 that this will
73:16 usually lead to a
73:19 actually not
73:25 so you can see that sometimes you manage
73:28 to
73:28 for for some uh there is kind of an
73:31 optimum
73:33 um for looking at the number of
73:35 neighbors which allows you to actually
73:37 capture
73:39 um the the
73:42 circular topology the hole in the middle
73:44 right but
73:46 if you if like if you go too big you’d
73:48 kind of lose
73:49 to start to lose the information about
73:52 the order of these dots
73:54 of these uh measurements right they
73:57 start to get more and more noisy
73:59 if you look at less neighbors
74:03 you get nice ordered lines but you start
74:06 to miss
74:08 things where you have these gaps and you
74:10 start to create clusters
74:11 where there shouldn’t be clusters so
74:14 this is the
74:15 most important
74:17 [Music]
74:19 parameter of the of the method
74:22 i’d say there is like if you if you uh
74:27 the umap home page the like from the
74:29 umap algorithm
74:31 they uh feature a nice section of what
74:33 parameters do you have for your map and
74:36 and uh
74:37 what influence do they have but it’s
74:40 it is generally relatively robust also
74:44 compared to other methods
74:45 if you go for typical like
74:49 it just but this is just empirical kind
74:51 of if you go with
74:52 about 15 to 30 neighbors
74:56 or something you usually derive good
74:59 representations
75:00 of the topology of your data if you have
75:02 enough data points
75:04 um so it’s relatively robust uh to the
75:08 choice of
75:08 of parameters you don’t have to play
75:10 around with it too much
75:14 thank you
75:19 okay perfect very nice do we have any
75:21 other questions
75:22 yeah there’s a question from kai muller
75:25 from the chat which i will read
75:26 which is uh he asked for high
75:28 dimensional data how does human
75:30 umap determine the manifold dimension
75:36 um
75:38 i don’t think that it does
75:41 and
75:47 i don’t think that it does and this
75:51 might be related to that it only
75:54 in the u-map algorithm what you do is
75:56 when you when you construct
75:59 this uh the graph representation
76:02 um from the intersecting
76:06 spheres you only look at the
76:10 one simplicity so you only construct a
76:12 graph and you don’t look at
76:15 [Music]
76:18 at higher order simplices i’m not sure
76:20 what this could be used to
76:22 to say something about the
76:23 dimensionality
76:26 um of your manifold i’m i’m unsure there
76:31 but you might by default it doesn’t tell
76:33 you the dimensionality
76:40 okay um any other question a lot of
76:44 questions today
76:55 okay so if there are no more questions
76:57 you know then uh
76:58 let’s end the meeting you know thanks
77:00 all for joining and thanks again for
77:02 fabian thank you well thank you for
77:05 having me
77:06 yeah thank you okay perfect
77:10 yeah see you all next week bye
77:16 you