.mpipks-transcript | 11. Data Visualization

MountAye

May 18, 2021


Review of last lecture


00:01 uh just uh just to know that the lecture
00:03 is actually recorded
00:05 so you will be visible uh on the
00:07 recorded lecture at the end
00:09 if it doesn’t bother you
00:13 just just let you know okay great
00:16 so hello everyone back to our lecture
00:21 last time we had a special lecture a
00:24 guest lecture
00:25 by one of our local data scientists
00:29 and uh he uh which was a fabian host
00:33 and fabian explained us from like a
00:36 hands-on perspective
00:38 because that’s his job uh how
00:41 you can detect in
00:45 non-equilibrium systems and the
00:47 non-equilibrium systems that fabric is
00:49 working on are of course
00:50 biological systems and what he
00:54 basically showed is how biology how
00:57 older
00:58 and non-equitable systems manifest
01:00 itself
01:01 in low-dimensional structures in these
01:04 high-dimensional data sets that he
01:05 showed you some
01:07 methods that will also appear on the
01:10 website i didn’t get to uploading uh the
01:12 slides in the video yet
01:13 um so so he showed you some methods of
01:17 how to
01:18 reduce dimensionality or how to extract
01:21 hidden dimensions in these
01:24 in the high dimensional data sets and

introduction & slide 1


01:27 so today so i will start by giving you a
01:30 little bit of
01:31 more introduction to data science and
01:35 some of the things that we need for the
01:36 next lecture that’s in the first part of
01:38 the lecture
01:40 and in the second part of the lecture i
01:43 will
01:43 give you another hands-on experience for
01:46 a practical
01:47 example of from start to finish of how
01:50 to
01:51 go through such a data science pipeline
01:55 now to start the beginning beginning of
01:58 the lecture
01:59 uh we’ll go back to our new york city
02:01 flights
02:02 data set so there’s a little gap because
02:05 we had to
02:06 find dates with uh fabian last time so
02:08 there’s a little
02:09 gap this lecture is now connected to
02:11 what i told you
02:12 uh two lectures ago and uh
02:16 let me just share the slide i’ll tell
02:18 you give you a little a brief
02:20 introduction
02:21 to data visualization
02:24 just a short one because the slides have
02:26 been already
02:28 on the website since two weeks or maybe
02:31 some of you have already looked at them
02:33 let me just share the screen
02:37 there we go
02:40 okay great perfect so so i’ll give you a
02:44 quick
02:44 introduction before we go on an enhanced
02:46 on example
02:48 so today we’ll have a hands-on data
02:50 science example and next week we’ll have
02:52 a hands-on
02:53 field theory combined with data science
02:56 example now that will be
02:57 next week um so today i’ll still need to
03:01 introduce you to some methods that we
03:03 will need
03:04 and um if you
03:08 can confirm now that you can see my
03:11 slides
03:12 can somebody confirm is that working
03:15 yes okay perfect yeah so i’ll just give
03:18 you a quick introduction uh to
03:20 how to visualize uh data
03:24 and of course you’re all in your work
03:25 you’re all visualizing data all the time
03:27 but if your data is high dimensional
03:29 very uh uh and uh complex and structure
03:33 uh it actually matters what you use for
03:36 visualization
03:38 yeah and um so
03:41 with this lectures two lectures ago when
03:43 we talked about this new york city
03:45 data set about the flights that were
03:47 departing from the new york city
03:49 airports i showed you all kinds of ways
03:52 of how to
03:53 do very efficient computations
03:56 on these data sets but all these
03:59 competitions like it didn’t
04:00 really give us any real insight
04:04 and the reason was that we were dealing
04:05 with plain numbers but we
04:07 never had anything to look at and uh
04:11 today in the first part of this lecture
04:13 i’ll quickly
04:15 show you a plotting scheme so to say
04:18 that is very powerful
04:20 in visualizing a general and visualizing
04:23 high dimensional data so

slide 2


04:27 before that just a quick reminder so
04:29 before we do
04:30 anything we always want to make our data
04:33 set uh
04:34 tidy you know so we typically we
04:37 collaborate with experimentalists they
04:39 give us the the
04:40 we obtain the data in a very messy
04:42 format and then the first step is to
04:45 tidy the data and that means that we
04:47 need to bring it in a form where
04:49 every column is an observable or
04:52 variable
04:53 and every row is it is an
04:56 observation or a sample sorry to
04:59 interrupt you did you change your
05:02 first slide yes oh you can’t see that
05:05 no we cannot i don’t know what this
05:06 always happens now in zoom
05:08 there was a zoom update
05:12 there’s something maybe i have to share
05:15 the entire screen
05:17 let’s try this um the problem is just
05:21 that i have like
05:21 100 windows on my desktop
05:25 um if somebody has a system that happens
05:30 lately all the time right um that zoom
05:33 is not working properly okay let’s share
05:37 the desktop
05:42 okay now i’m not having embarrassing
05:45 stuff on the
05:46 desktop but that’s not the case okay so
05:49 you should be seeing my
05:52 uh desktop now
05:58 oh okay okay a lot of
06:01 lot of wind okay okay okay
06:05 now you now you know what i’ve been
06:07 working on okay
06:08 so
06:12 can you see uh this messy and tidy slide
06:17 and when i change the slides now can you
06:19 see that it’s just
06:20 right now it works okay perfect
06:24 so so this is uh so so
06:27 just a just a reminder uh that the first
06:31 step that we always do is to make the
06:33 data tidy now where we if you have the
06:37 data in entire defaults format then we
06:39 can perform
06:40 column wise operations that are in most
06:44 programming language
06:45 highly optimized and very efficient
06:49 to run until to program
06:53 i introduced you to this very simple
06:58 r package that allows you to implement
07:01 all of these operations and data science
07:05 and there are many other packages of
07:06 course and other other statements

slide 3


07:09 and typically such an operation
07:13 consists of three steps so you
07:16 filter the data that is in this data
07:18 table package the first
07:20 parameter then you group
07:24 the columns by some condition
07:27 now that’s the third parameter and then
07:29 you perform
07:31 an operation independently on each group
07:33 that is the one in the middle
07:36 so this is a typical step in such a data
07:39 science
07:40 calculation and i showed you how you
07:42 then can combine these steps along
07:44 pipelines
07:46 uh to perform that complex operations

slide 4


07:50 today i want to show you a way of how to
07:53 more intuitively
07:55 interact with the data and the way we
07:57 typically do that you all
07:59 are familiar with uh plotting of course
08:01 and the way we typically do that
08:03 is we have a plot in mind yeah and this
08:06 plot then has a name
08:08 that’s a scatter plot a surface plot
08:11 a histogram and then you look for the
08:13 function
08:15 in and you look for the function and
08:17 matlab also that gives you this specific
08:19 kind of plot
08:21 yeah another way to do that and that
08:24 would
08:25 introduced by someone called leland
08:27 wilkins
08:28 in a book is that you don’t give the
08:31 plots names
08:33 but you construct you construct these
08:35 plots with a grammar
08:37 that means that you have a set of rules
08:40 that allows you to construct
08:42 step-by-step almost any visualization of
08:45 your data
08:47 and once you have that you don’t have to
08:49 remember long names of different kinds
08:51 of plots
08:52 you just add bit by bit like in a
08:54 sentence you add word or
08:56 or there’s word by word to make this
08:59 sentence
09:00 more rich in information but the only
09:02 thing that you need to know
09:04 is the grammar itself and this allows
09:06 you to
09:07 create from the simple grammar uh very
09:09 different kinds of visualizations
09:12 now and this idea that you have a
09:14 grammar of graphics
09:16 you know i’ve got a set of rules that
09:17 allows you to construct
09:19 visualizations uh it is an r
09:22 implemented in the ggplot package
09:25 gtw 2 that’s quite famous
09:29 and in python and that’s a little bit
09:30 newer i don’t know how well it works
09:33 it’s called the plot
09:34 9 package now that analyze the same idea
09:39 and in r we just load this using the dg
09:42 plot the library command and then we are
09:45 able to use all of these commands

slide 5


09:47 in this package now so the basic idea is
09:52 that we start with a tidy
09:56 data table or data frame in r
09:59 and we take this and then we construct
10:03 assign different
10:06 visual characteristics of our plot
10:10 two different columns of this data table
10:15 yeah so the first thing we have to do
10:18 is uh the first thing we have to do is
10:21 we have to say
10:22 what to plot a plug the point or a line
10:25 and that’s called the geometry
10:28 a point line bar circle whatever and
10:31 that’s the geometry
10:33 then we have this mapping that i
10:34 mentioned yeah where we map
10:36 different aesthetic properties of our
10:40 plot
10:41 two different columns in this table here
10:45 now for example we could say that the
10:47 position on the x coordinate
10:50 should be what is in column a
10:53 the y coordinate should be what is in
10:55 column b
10:56 the size of our dots or whatever should
10:59 be whatever is
11:00 reflect whatever is in column d and the
11:03 column
11:04 should miss a c and the column the color
11:07 should be what is in column d
11:12 how these values are then translated
11:16 to specific colors or to specific size
11:19 or so
11:20 is a different question now we have the
11:24 static properties of our plots or our
11:26 dots so where are they located
11:28 how do they look like and then we just
11:30 have to define a coordinate
11:32 system to define where they appear
11:36 on the screen yeah and if you have these
11:39 things together
11:40 we have the simplest version of a plot
11:44 on the right hand side

slide 6


11:48 so the way this works in practice is
11:51 that you
11:52 have these little building blocks
11:55 that you just put together
11:58 line by line now so
12:01 in r this looks like that you first have
12:03 to of course do
12:04 something to create an object and this
12:07 is just this ggplot command
12:10 where you tell the plot what data to use
12:13 as the first argument
12:15 and the second argument is then how to
12:18 map
12:18 different columns of your data
12:21 of your table to different pro visual
12:24 properties
12:25 of your plot and then you add a geometry
12:30 for example point or line and you have
12:32 your first plot already
12:35 you can of course also add more
12:37 properties yeah you can add more
12:39 geometries or you can add more
12:41 uh detailed properties of your plot for
12:44 example
12:45 if you are not happy with cartesian
12:48 coordinates then you can set your own
12:50 coordinate system
12:52 now you can have polar coordinates or
12:53 whatever
12:55 you can have subplots by
12:58 by adding this further rule here it’s
13:01 called facets
13:03 you can change how values
13:06 in this data table how they map
13:10 to different properties so which color
13:12 represents
13:13 which value in your table you can change
13:17 themes for example or different uh the
13:20 way
13:20 lines and whatever are plotted and you
13:23 can of course also save your file
13:25 and these different building blocks of
13:27 your plots
13:28 are connected via these plus signs now
13:31 you put that
13:32 just as many as you want of these things
13:34 of these aspects
13:36 after each other and by this you
13:38 construct more and more complex
13:40 plots yeah and for
13:44 everything that you see here below there
13:45 are some sensible defaults for example
13:49 cartesian coordinates that in many cases
13:52 you don’t need to touch
13:54 now let’s have a little look at our new
13:56 york city data set
13:59 and this new york city data set right
14:02 here we go

slide 7


14:02 now we already discussed that uh
14:06 two weeks ago and in this new york city
14:08 data set we have
14:09 information about flights departing from
14:13 new york city airports for each flight
14:16 we have
14:17 different information we have for
14:19 example the time
14:21 when this flight departed we have the
14:24 origin
14:24 airport we have the number of the plane
14:26 that was used
14:28 the carrier and so on and the delay
14:31 uh for this specific flight and also as
14:35 we already discussed
14:36 you can connect this table that we can
14:39 have
14:40 that you can download from our github uh
14:44 from the from the website um we can
14:47 connect
14:48 these flights then to other sources of
14:50 information for example
14:52 the weather that uh with the weather
14:55 information for a given
14:57 point in time and a given location
15:00 we can also connect that to airport
15:02 information
15:04 uh we can connect that to information
15:06 about the planes
15:07 yeah and we can also get information
15:10 about the airlines if you want
15:12 yeah and that’s how we what we do yes we
15:15 reload
15:16 again just like last time where we load
15:19 again
15:20 all of these different files using the
15:23 function f
15:23 read and then we merge them together
15:27 using these merge commands and when we
15:30 merge them together we sometimes have to
15:32 give for example when we merge with the
15:35 planes
15:36 or let me see here for example when we
15:38 move with the airports that we merge
15:41 with the airports we have to say that
15:43 here the airport identifier is in the
15:46 column origin
15:48 and here the airport identifier is in
15:51 the column
15:52 faa now so we motor all of these things
15:56 together and we have a huge data set
15:58 a data table containing all of these
16:01 information line by line and
16:04 in a tiny format we already did that two
16:08 weeks ago
16:10 now let’s have a simple look at these
16:13 plots let’s just let’s say let’s have a
16:15 simple plot

slide 8


16:16 and the first thing we can do is uh like
16:19 just like last time we can calculate
16:22 the average delay
16:27 for each month so we group the data by
16:29 month
16:31 and for each month we take the average
16:35 over all departure delays and save that
16:39 average
16:39 in the column mean delay
16:42 now so what we don’t get is that for
16:44 each month
16:46 here in the first column we get a mean
16:49 departure delay this is the second
16:51 column
16:52 and below you can see now the simple
16:54 spot you can do you can
16:56 tell this ggplot function to take this
17:00 table
17:01 now a very simple table and map
17:04 the month to the x-axis and the delay to
17:08 the y-axis
17:10 and then you just add a geometry
17:13 which is just a point then you get what
17:16 you see on the right-hand side
17:18 and you see that something is happening
17:20 here in this summer months
17:22 and something is happening over
17:24 christmas apparently
17:26 okay
17:31 so okay there’s something something in
17:33 the waiting room right
17:35 okay so something is happening over
17:37 christmas now let’s go on

slide 9 & 10: different types


17:40 uh we can of course also add different
17:42 geometries to a plot
17:44 yeah so these different geometries so we
17:46 just use the geometry
17:49 of a line now of a point we can also add
17:52 different geometries
17:54 and for the sake of simplicity what i’m
17:56 doing here i’m using the tools that i
17:58 introduced two weeks ago
18:01 not to do all of these things in one
18:03 line so we take the flights
18:06 data set calculate for each month the
18:08 average delay
18:11 and send everything with this pipe
18:14 operator here
18:16 to the g plot and in the g
18:19 plot we just need to define that this
18:22 aesthetic mapping
18:23 that the x coordinate is the month and
18:25 the y coordinate
18:27 is the delay and we save this everything
18:31 we just save in an object g on the left
18:34 hand side
18:36 and now we can take this g and add
18:38 different things to it
18:39 we can add different geometries on the
18:41 left here top left we have the
18:44 the point as before we can add a
18:47 geometry line
18:48 now then we get a line we can add a bar
18:52 that’s called column we can add a bar
18:55 or we can add all of them together to
18:58 the plot
18:59 and then we have all of them together
19:01 this information of what
19:03 what happens with the data that’s not
19:06 done in the geometry that’s we have done
19:08 once in the beginning and now we can
19:10 just operate on this object and add
19:12 different things and change the part the
19:14 way we like it
19:18 yeah so there are also geometries that
19:21 are
19:21 that involve analysis now for example uh
19:25 if you have a background in biology
19:27 maybe then you know your favorite
19:29 dot box plot on the right hand side that
19:32 summarizes
19:33 different uh properties of the
19:36 statistical distribution
19:38 for example here in the left hand side i
19:41 take the flights
19:42 all this combined information i take it
19:46 use the x the carrier as the x
19:48 coordinate
19:49 and the logarithm of the departure delay
19:52 as the y-coordinate
19:53 and i add this box plot where i have
19:56 automatically
19:57 the median and this is some
20:01 inter-quarter range and then i
20:04 always forget uh forget what this means
20:06 is probably the range of the data
20:08 without
20:08 outliers or so now that’s basically in
20:11 some disciplines these box plots are
20:13 used to characterize distributions
20:15 another way to characterize
20:17 distributions are violence
20:19 violent plots that essentially give you
20:22 a plot of the probability distribution
20:24 here
20:25 um just in a vertical manner the thicker
20:29 the violin is
20:30 the more the the higher the probability
20:32 to find a data point
20:34 there yeah and

slide 11 & 12: aesthetic


20:37 uh so of course we can now also play
20:40 with
20:40 how these plots look like so
20:43 for example now i’m and if you look at
20:46 the red
20:47 i’m doing the same operation yeah i’m
20:50 calculating the average departure delay
20:54 for each month and each airport and each
20:57 carrier
20:58 but now i’m just for simplicity i only
21:01 take the big
21:02 three carriers united airlines delta and
21:04 american airlines
21:07 and now i create this plot again now i
21:09 have this aesthetic mapping
21:12 the month should be the x coordinate the
21:14 delay the y coordinate
21:16 and now i have another aesthetic which
21:19 is the color
21:20 i say the color should be the origin
21:23 airport
21:25 and the line type line type should
21:27 correspond to the carrier
21:30 and now i just add the geometry the line
21:32 and i get this plot that you see on the
21:34 bottom here
21:36 you can see that all carriers have a
21:38 problem in the summer month
21:40 and also in the over christmas you know
21:43 except for america so something is going
21:46 on with american
21:47 airlines around march uh no idea what
21:50 this is
21:52 yeah and uh
21:55 no wait that’s not american airlines now
21:56 that that’s new arc
21:58 newark american airlines has a problem
22:01 in march
22:02 yeah you see it’s not the problem the
22:04 plot is not perfect yet
22:06 and uh okay so we can go on now we can
22:09 add
22:10 other aspect we can change other experts
22:13 of the plot
22:14 for example i here say that the film
22:18 should be the airport and then i use a
22:21 box plot
22:22 and then i get an overview over how the
22:25 difference
22:26 airport the airports the different
22:27 airports compared to each other
22:30 for each carrier and what you can see
22:34 is that jfk is doing well for
22:37 some of them but not for all
22:40 you know american airline and for united
22:43 and
22:44 and what is it where is it delta is
22:46 doing well
22:48 um but there is no clear trend here of
22:50 course

slide 13: subplots


22:52 something that’s more interesting is if
22:54 you plot these delays
22:57 for the big three carriers yeah
23:00 uh as a as a function of the hour of the
23:04 day
23:06 yeah so here the x-axis the x-coordinate
23:09 is the hour and i turn that into a
23:12 factor
23:13 from numeric to something that is
23:16 discrete just for plotting purposes
23:19 the the fill color is the origin airport
23:24 and i added here a
23:27 subplot now that’s called a facet
23:30 by carrier if you remember the two
23:34 lectures ago this is a formula we can
23:36 use a formula in r
23:37 to specify how plots are distributed
23:40 across different subplots
23:42 and you can see that for a delta and
23:44 united airlines
23:46 you can nicely see how these delays
23:49 add up during the day
23:52 yeah and yeah and it looks a little bit
23:54 even that is
23:56 let’s have a look at the next slide

slide 14


23:58 maybe uh let’s
23:59 have a look at the next slide so if we
24:01 can have also more complicated subplots
24:04 yeah so for example we have we can have
24:06 a grid
24:07 by using a more complicated formula
24:10 where the
24:11 y-axis yeah the y-direction should be
24:13 the origin
24:14 and the x-direction in this grid should
24:17 be the carrier
24:18 yeah and then we got these plots and you
24:21 can see
24:21 how you can actually see how these
24:23 delays
24:25 add up during the day and it seems like
24:28 a little bit yeah it’s a speculation
24:30 but because we have a logarithm in the
24:32 y-axis
24:34 and we have linear linear increase
24:37 over time uh in these delays
24:41 uh over over time during the day that
24:44 you have an exponential
24:45 build up of delays now that’s quite
24:47 quite interesting

slide 15: plot of 3 variables


24:49 yeah okay so we can do all those kind of
24:53 other fancy thing we can if you have
24:55 more than two variables
24:57 now for example when we calculate the
24:59 average delay
25:02 as a function of the month the hour and
25:05 the origin
25:05 airport yeah then we have more than
25:09 we have then we have then even more
25:11 variables that we want to visualize
25:14 uh we can do that for example with such
25:16 something that’s called a heat map
25:19 yeah and this heat map here the
25:24 the the fill and also the color of these
25:27 tiles
25:28 is given by the mean delay
25:31 while the month and the hour are plotted
25:34 on these axes here
25:36 then we add the geometry of the tile to
25:39 get these heat maps
25:41 then you can visualize relationships
25:44 between two variables namely month
25:47 and hour and it seems like this buildup
25:51 of delays is specifically drastic at the
25:54 summer months
25:56 while it’s not that evident in other
25:58 months

question on assigning facet


26:02 excuse me i have a question syntax
26:06 in this where you’ve written facet
26:10 underscore rap the first argument tells
26:13 us
26:13 the x argument and so origin will be
26:16 plotted on the x scale
26:18 yes so so this facet rep
26:21 is just to say okay take
26:25 one column of the data table yeah it is
26:28 already in this case origin
26:31 and group the data according to this
26:33 column your origin
26:35 and then make one plot for each of these
26:39 origin
26:40 airports and put them next to each other
26:44 as many as fit on the screen
26:47 now and if they don’t fit on the screen
26:48 go to the next line
26:50 it’s basically just this wrap what i’m
26:53 saying here is that you have a
26:54 one-dimensional so to say
26:56 line of plots uh compared to this grip
26:59 that is
27:00 this grid here where was that yeah this
27:03 grid
27:04 is the same thing it’s basically the
27:06 same thing
27:07 uh so here we have these two coordinates
27:10 now these two
27:11 directions origin airport on the on the
27:14 y direction
27:17 and carrier on the x direction
27:20 now this is this this formula notation
27:22 and r
27:23 yeah so that’s it’s a little bit counter
27:25 intuitive but you give it a formula
27:28 uh in order to uh to to tell
27:32 r how all this this package how these
27:34 plots should be distributed
27:36 on your screen or on the on the in the
27:38 pdf file that you export
27:41 yeah and you can if you want you can use
27:46 the reason why the formula i used here
27:48 is that you could do something here
27:51 that you have here carrier plus
27:54 for example um what are we plotting here
28:00 carrier plus uh um
28:04 month or so yeah you can have here
28:08 uh if you do something like this
28:12 you can have more complicated formula
28:15 to say that a combination of carrier and
28:19 month of these two columns
28:21 should be on the x direction here
28:27 and on the y direction you have origin
28:30 now you can
28:30 you can construct more complicated
28:34 grids of plots if you wanted to
28:37 it’s very often that’s not very useful
28:40 and
28:41 just think about that that this is
28:43 what’s on the uh
28:46 that was my that was my question because
28:48 in the next slide the
28:50 original argument the origins have been
28:52 plotted on the x
28:53 scale whereas in this particular case
28:55 the original airport has been plotted
28:56 along the y scale
28:59 so now i exactly so now i’m trying to
29:02 get my mouse cursor back is here
29:06 so okay now i can change the slide
29:08 hopefully
29:10 okay here we go so here i just left away
29:14 the first argument
29:17 and i left the the left one so that was
29:20 originally the y
29:21 direction and now i only have the x
29:24 direction left you know from left to
29:26 right
29:28 and i use here wrap and not grid because
29:30 i want
29:31 this these plots if i have not
29:34 three but like 15 different groups
29:38 i don’t want them to be all in the same
29:40 line because i wouldn’t be able to see
29:41 them on the screen
29:43 rep means once the screen is full go to
29:46 the next line
29:47 yeah nothing else that’s not it’s not a
29:50 complicated thing here let’s just
29:52 just just make each plot for each origin
29:55 one plug for each origin
29:57 excuse me yes uh on this heat map
30:00 the color bar the minimum value is not
30:03 zero
30:05 does it mean that yes yes
30:08 rights that departed earlier than
30:10 scheduled
30:13 yes there are yes exactly exactly they
30:16 departed earlier oh
30:19 it’s not very funny for the passengers
30:22 yes so so but but sometimes that happens
30:25 and it also it’s also a question how the
30:27 data is recorded
30:29 um depends on how the data is recorded
30:33 so especially so this is position
30:37 specifically affects the early mornings
30:40 right and the very late times
30:44 now so this negative is negative
30:45 departure delays
30:47 um i mean that sometimes that can happen
30:51 you know so sometimes
30:52 the question is when that’s not the time
30:56 when the gates close
30:58 uh it’s probably the time when the
31:00 airplane starts or something like this
31:04 yeah and as you as you know if the
31:06 boarding is
31:07 so very often it happens that that once
31:09 boarding is completed
31:11 the the airport they will sometimes the
31:14 airplane leaves a little bit earlier
31:15 than uh than the schedules
31:18 okay thanks but as always there’s a good
31:22 thing
31:22 it’s always good to to to know how the
31:24 data was actually collected
31:26 now you think that a delay is well
31:28 defined but then you can
31:29 measure this in different ways yeah
31:32 that’s always a very important aspect
31:34 aspect and here are also this data set
31:37 they’re also
31:37 missing missing numbers a lot of missing
31:41 numbers
31:41 and that’s when an airport started
31:44 somewhere but didn’t end up on its
31:47 rival location but it’s on other
31:49 airports yeah so that’s also
31:51 that’s also possible

slide 16: error bar


31:54 okay so let’s uh go on just uh
31:58 so we can also use uh gg plot you have
32:01 to do statistical computations
32:03 on the fly and this is uh particularly
32:06 useful for computating
32:08 fancy error bars that’s that’s
32:11 how i use it and so what
32:15 we can do for example here on the top is
32:18 that
32:19 we have the hour on the x-axis the
32:22 departure delay
32:24 on the y-axis and then
32:27 color and uh and fill as the origin
32:30 airport and then for each of these
32:33 combinations because we take on the left
32:35 and raw data
32:36 now we have many different values we
32:38 have many different flights
32:40 and we can then take just a summary
32:43 function
32:44 statistical summary function from the
32:46 gdg plot
32:48 tell this function to calculate the mean
32:53 and use the geometry of a line
32:56 to do the computation for us yeah so for
32:59 so simple things like the mean
33:00 calculation
33:01 that’s pretty good but the nice thing is
33:03 that we can also do
33:05 uh summary functions that are more
33:08 complicated yeah for
33:10 example in this here we have a
33:11 bootstrapping
33:13 confidence intervals yeah that’s that’s
33:16 basically a fancy way of calculating
33:18 confidence intervals
33:20 and we use a geometry of a written
33:23 not to visualize these confidence
33:25 intervals
33:27 and if you deal with confidence
33:28 intervals you know that’s quite
33:30 complicated to calculate and then you
33:32 somehow have to bring them into your
33:33 plot
33:34 so here you don’t have to worry about
33:36 this you have the most fancy methods
33:37 just
33:38 uh in one line and you get a nice
33:41 visualization of the uncertainty of your
33:43 data
33:45 another thing we can do is also here in
33:47 this upper case our
33:49 x-axis is discrete because it’s an hour
33:52 from from 0 to 24
33:55 or 23 or 5 to 23
33:58 but sometimes we have real number values
34:01 yeah and then we need to tell
34:03 these functions which values to put
34:06 together that means we can
34:07 bin the data an example here is the
34:10 temperature
34:11 in fahrenheit on the
34:15 x-axis and we cannot calculate for every
34:19 value of the temperature
34:21 uh we cannot calculate a separate mean
34:23 yeah because temperature is a real value
34:25 it’s uh
34:27 a real valued uh quantity
34:30 yeah and we we need to define a bin to
34:33 summarize values of the temperature
34:36 and that we can do automatically also
34:38 with these start summary bin
34:40 functions where we just calculate
34:43 we just tell the function to bin the
34:45 data
34:47 and for each of these bins again
34:48 calculate the mean
34:50 and plot the result with the geometry of
34:53 a nine
34:54 and we can do the same fancy arrow bars
34:57 and confidence interval calculations
34:59 as before and also you know probably
35:03 that
35:03 these kind of processes are quite
35:05 complicated if you have to do them
35:07 yourself
35:10 and here probably the the me the the
35:12 message is that it’s
35:14 not good if it’s too cold or too hot
35:18 but then there are correlations right so
35:21 when it’s hot here in on the right hand
35:25 side
35:25 there’s also when the holidays take
35:27 place yeah that’s july
35:29 and june yeah in these previous heat
35:32 maps
35:33 where we had these delays yeah so it’s
35:35 not
35:36 it’s not quite quite clear whether it’s
35:38 the temperature that’s bad for the for
35:40 the engines or something like this
35:42 or whether it’s just the number of
35:44 people that
35:45 go on holiday and block the airport
35:48 and lead to delays there

slide 17: interpolation


35:53 now we can do even more fancy things now
35:55 we can do
35:56 interpolation yeah on the right and the
35:59 left hand side you know i just
36:00 uh had the same plot as before
36:04 we had um yeah we have the month
36:08 wait okay here’s a little error the hour
36:11 on the x-axis
36:13 and we do as the hour on the x-axis
36:17 yeah and now we can add like an
36:19 interpolation line
36:20 just in one line with this summary
36:23 function
36:24 and if we wanted to we would be able to
36:26 do like
36:27 non-linear interpolation linear
36:29 interpolation or anything we want just
36:31 with an argument
36:33 and we get our usual nice error bars for
36:36 free
36:38 of course if we can do non-linear fits
36:40 we can also do linear fits so we can fit
36:42 linear models to check for correlations
36:45 and that’s what we do on the right hand
36:47 side and on this right hand side
36:50 what is actually quite interesting here
36:53 is that on the x-axis we have the month
36:56 on the y-axis we have the number of
36:59 seats
37:00 in an airplane and the color
37:04 is the origin airport
37:07 now what we find here is that there’s a
37:09 perf almost perfect linear relationship
37:12 between the month and the number of
37:14 seats
37:15 yeah there’s a perfect linear
37:16 relationship which is positive
37:20 for the for new arc and
37:23 jfk and uh
37:26 negative for lga which is i think
37:29 laguardia
37:30 yeah so no idea where this comes from
37:33 yeah but apparently you are sitting
37:37 in december in a smaller plane
37:40 higher likelihood to sit in a smaller
37:42 plane in december
37:44 if you are departing from laguardia
37:47 while the planes get larger
37:50 throughout the linearly larger
37:52 throughout the year
37:54 for some reason yeah so that’s that’s
37:56 one of these
37:57 one of these things where you should be
37:58 suspicious and check what is actually
38:01 the underlying
38:02 reasoning for this data yeah so so
38:05 that’s that’s something i will show you
38:06 also later
38:07 is that it’s very important to check
38:10 whether
38:10 when you do statistical computations
38:12 whether they actually make sense or not
38:14 yeah big data gives you every result
38:18 that you want if you just look for it
38:21 now just because you have so many
38:23 dimensions so many samples you can
38:25 find every hypothesis you want in these
38:28 data sets
38:29 if you just keep looking for it

slide 18: scales


38:33 yeah so we can play around with scales
38:36 yeah so that means that we can change
38:38 how our plots look like for example
38:42 uh in this plot here on the top that’s
38:44 something we have shown
38:45 seen before that’s the box plot we save
38:48 this
38:48 in a variable p now and now you see how
38:51 i this weird arrow assignment operator
38:55 in
38:55 r why why why why the r community
39:00 likes that you can have that you can
39:02 assign in the different direction
39:03 right so i have the plot and i assign
39:06 the results to a variable p
39:09 and that’s not something so it’s it’s an
39:12 asymmetric
39:12 assignment and so now we have our plot
39:16 and we can add different
39:18 color scales and we can add different
39:20 ways of how our data values
39:23 map to visual characteristics of the
39:26 plot
39:28 for example we can add a new scale score
39:31 for some scale color blur then we get
39:33 different
39:34 blue tones we can also add
39:38 like a manual mapping where we say that
39:42 the uwt
39:43 want to have black gray and white as the
39:45 colors for our airports

slide 19: positions


39:48 now so we can add change visual
39:50 characteristics and we can also change
39:52 of course how things are positioned
39:56 relative to each other yeah and i’ll go
40:00 quickly over this because that’s a
40:01 little bit of a detail
40:03 yeah so we have for example we can
40:05 create this uh plot here on the right
40:07 hand side
40:08 where we for each plot for each month
40:12 origin and carrier carrier calculate the
40:16 average delay
40:17 for the three carriers and then we can
40:20 make a plot
40:23 where we assign the month to the x-axis
40:25 now to the white
40:27 the delay to the y-axis the fill
40:30 color to the origin airport
40:34 and the transparency of this color
40:37 to the carrier yeah and then first
40:41 we can now plot all of this using a bar
40:44 plot
40:45 and if we have a bar plot we can decide
40:47 how to
40:48 put these bars relative to each other
40:51 and i’ll just
40:52 give you three examples we can stack
40:55 them on top of each other
40:56 that’s on the right hand side
40:59 now we can dodge them
41:02 that means that we put them next to each
41:05 other that’s in the middle
41:07 and we can use a fill
41:10 position that means we always fill them
41:13 up to one
41:14 that means we look at the fraction that
41:16 a certain carrier
41:17 and origin airport contribute
41:21 to the total delays and
41:24 uh let me just see if there’s any
41:28 yeah you can see here for example so
41:30 that that
41:32 that that here for example a large
41:35 fraction of the delays in march
41:37 comes actually from this newark airport
41:41 uh while other months uh for example in
41:44 the summer month
41:45 the larger fraction of the lace actually
41:47 comes from the other airports
41:49 um jfk and lga
41:53 yeah we can also do something

slide 20: corrdinate system


41:56 we can also change how the data is
42:00 how we can change the coordinate system
42:02 you know so now we always assume that we
42:04 have
42:04 cartesian coordinates but we can of
42:06 course have any other coordinate system
42:09 so here for example we are plotting uh
42:12 as the x-coordinate the wind direction
42:16 as the y coordinate the departure delay
42:20 and then we just calculate the average
42:22 delay
42:23 again using the summary function
42:27 for a certain certain intervals of the
42:30 wind direction
42:33 yeah and we can then plot that for
42:35 example in different ways
42:38 um we can plot that of course in
42:41 cartesian co-coordinates
42:43 something that’s more instructive is
42:45 actually when we talk about
42:46 directions is to use polar coordinates
42:50 yeah and you can see that i do that just
42:53 with one adding just one line
42:54 one more rule to the plot
42:58 and now i can add more
43:01 aesthetic mappings for example i can
43:03 separate
43:04 as before these different contributions
43:07 from the wind direction
43:09 by airport and this is what i’ve done
43:11 here i just added one more aesthetic
43:13 mapping here
43:15 i said that these bars should be next to
43:18 each other and not on top of each other
43:20 or so
43:21 and i have the polar coordinates as
43:23 before
43:24 yeah then you get the plot that you have
43:26 on the right hand side
43:28 and in this plot what you see
43:31 is that there is a relation with the
43:33 wind direction
43:35 of the departure delays specifically
43:38 when the wind
43:39 comes from what is that southwest
43:43 a little southwest west for the two
43:46 airports
43:47 uh lga
43:50 and newark you know whatever is the
43:53 reason for that
43:54 it’s actually for all of them yes and
43:56 actually for all of them
43:58 uh it’s actually for all of them
44:01 but it’s specifically strong for lga and
44:04 new arc
44:05 and if you look at the location of new
44:07 york that’s where the sea is
44:09 that’s probably also where a lot of the
44:11 wind
44:13 the strong winds come from from this
44:15 direction
44:16 yeah okay yeah this was just playing
44:19 around with the data
44:20 and you get some some insights
44:23 from just looking visualizing the data
44:27 and these insights are of course much
44:29 harder to get if you just look at data
44:31 tables on the console as we did two
44:34 weeks ago
44:35 and what you can also see here if you
44:38 create such plots
44:40 yeah so you can make more and more
44:42 complicated plots
44:44 but the complexity of your code
44:47 never changes yeah it does only
44:50 increases just linearly because you’re
44:51 adding just one bit by bit
44:53 to your plot one layer by layer to your
44:56 plot and you can make
44:57 as complicated plots as you want
45:00 from this now without adding
45:04 more and more complexity actually to
45:06 your code or with
45:07 without requiring requiring more and
45:11 more specialized
45:12 functions and that is the advantage of
45:17 having such a
45:19 grammar of graphics now that allows you
45:21 to to have simple rules
45:23 uh that visual rules that allow you to
45:26 add more and more components to a plot
45:29 and then of course we can make these
45:31 plots look

slide 21


45:32 nice yeah we can add things like
45:35 axis labels for all of our columns
45:39 typically you get a data table that is
45:42 that where some experimentalist has used
45:45 their own notation for things
45:47 doesn’t make much sense most of the time
45:49 you want to have your own
45:51 um you want to have your own uh
45:54 um names for the for the x’s and for the
45:57 colors and for the
45:58 for the legends and uh specifically
46:01 including units if you want to publish
46:03 that and then you can do that easily
46:06 with this labs command and there’s also
46:09 a title command if you want
46:11 here you can add a title and
46:14 annotate your plot as much as you want

slide 22: extensions


46:17 and then you can get as complicated as
46:21 you want you can
46:22 download extensions for example some
46:25 nice extensions add
46:26 new geometries and new coordinate
46:29 systems to these plots
46:31 so here these uh plots that are used in
46:33 anatomy
46:34 they add the human body and mouse body
46:37 coordinate systems
46:38 and you can then easily without adding
46:40 having more complexity
46:42 than what i already showed you you can
46:45 have visualize your data
46:49 that you for example imaging data on the
46:52 mouse or
46:53 human body or whatever you want to do
46:55 now that looks like
46:56 a ton of different extensions to this
47:00 okay so this is a very efficient way of
47:02 visualizing data

slide 23: there is also python implementation


47:03 that relies on a grammar or a set of
47:06 rules
47:07 i showed you an r implementation but
47:09 there’s also a python implementation
47:12 and the python implementation is rather
47:14 new so it’s just
47:16 i don’t know what quality this is and
47:20 what we now want to do
47:24 is i want to uh
47:28 i want to show you how to use these
47:31 tools that we
47:33 that we’ve seen in the last couple of
47:34 lectures in a specific

The following slides are nowhere found in the slides shared on the course website. The lecturer went through the jupyter notebook about a RNA sequencing project.


47:37 data science project and for this we’ll
47:40 just
47:41 go through the code of a real data
47:43 science project and this is a project
47:45 that fabian actually did
47:47 while he was in the group and the
47:50 uh starting point of this project
47:54 is a so-called sequencing experiment now
47:57 so i’ve already showed you this table
47:59 and here on the x-axis yeah
48:02 on the the color the rows in such an
48:05 experiment
48:06 on such that that look that would be so
48:08 to say a matrix that
48:10 experimentalists would send you so here
48:13 every row
48:14 is a different gene and every column is
48:20 a different cell
48:21 now we have now maybe then twenty
48:23 thousand cells thirty seven thousand
48:25 cells
48:26 and for each of these cells we have
48:29 roughly
48:30 ten thousand measurements yeah and
48:33 these measurements correspond to how
48:36 strongly
48:38 uh a certain gene yeah that’s on the
48:42 in the row here is expressed in this
48:45 particular cell so these numbers here
48:47 correspond to how many products of these
48:50 genes these experimental techniques
48:53 found in a given cell yeah and the way
48:56 and these genes
48:57 you might hurt they tell us a lot about
49:00 what cells are doing and how they’re
49:02 behaving and what kind of cells they are
49:04 so they’re very important
49:06 molecular measurements of what’s going
49:09 on inside cells now so for example here
49:13 this gene now that has this id here
49:16 that’s a cryptic name and row four
49:20 is not expressed it has a little bit
49:23 information in this particular cell but
49:26 not in other cells
49:27 what other genes like this one here have
49:30 very
49:31 high expression values they have very
49:34 high counts of products from these genes
49:38 that were detected in these experiments
49:42 so what i have to tell you is that these
49:44 experiments are extremely messy
49:47 yeah so so especially there are there is
49:49 a step where
49:50 ex the data is exponentially amplified
49:54 and that exponentially amplifies errors
49:57 in these data sets so it’s a
50:00 it’s a big mess yeah and now we have to
50:02 find some structure
50:05 in these simple in this high dimensional
50:08 data set in these
50:10 genomics experiments and to show you how
50:13 this is working i’ll share another file
50:19 um i’ll share another screen
50:22 [Music]
50:24 where are we
50:28 here
50:31 here we go now i’ll just give you a uh
50:34 like a hands-on look and
50:37 on how this actually works yeah i don’t
50:40 tell you too much about the biological
50:43 background in this project because it’s
50:45 not yet published
50:47 um so um can you see you can you should
50:50 be able to see
50:51 the browser right
50:54 and you can see that this here is
50:57 actually a combination
51:00 of python yeah the first block
51:03 and r yeah so this here
51:07 here he’s loading some r packages
51:10 and here he’s loading python packages
51:13 and all of this is a jupyter notebook
51:15 yeah so yeah combining r and python to
51:18 take the best of both worlds
51:23 yeah and then we then there’s a lot of
51:25 data loading going on
51:27 yeah so i’ll just go through of course
51:29 we don’t have to look in detail how the
51:31 data is loaded
51:32 uh some some biological background
51:35 information about what
51:36 different genes are doing and
51:40 so on yeah and now
51:45 we start with the pre-processing of the
51:48 data
51:49 so as total that this data is messy
51:53 and this data has like
51:56 80 percent nonsensical
52:00 the nonsensical information in other
52:03 words this is dominated by
52:05 technical noise our technical noise is
52:08 extremely strong
52:10 and it gives rise to very weird results
52:13 so the first step we always have to do
52:16 and this
52:17 particular example in genomics but also
52:19 in other data sets
52:20 is we have to look at the data and and
52:23 polish it in a way so that we can are
52:27 actually in principle able
52:29 to detect information here
52:34 so for example this plot here on the top
52:38 shows you basically what is the
52:40 percentage
52:42 of all information in a cell
52:46 that goes to certain genes yeah they
52:48 have these weird names they’re
52:49 completely
52:50 irrelevant but you see that this gene on
52:53 the top here
52:54 that has an even weirder name
52:57 in some cell comprises eighty percent of
53:00 the
53:02 information yeah and that
53:05 does not make any biological sense but
53:08 because if you have
53:09 thirty thousand genes in a cell it can’t
53:11 be that basically the cell
53:13 is packed completely packed with
53:15 products from a single gene you know
53:17 that that’s
53:17 cannot happen in real life and that’s
53:20 why we see that here there are a lot of
53:22 cells
53:23 yeah so everything where we have more
53:25 than maybe 30 percent here
53:27 where we don’t have any reasonable
53:31 information now that means that we need
53:34 to do quality control now we
53:36 need to filter out cells that actually
53:40 have meaningful information and we have
53:42 to
53:42 keep away cells that don’t have
53:44 meaningful information
53:46 yeah and what we do is we look at such
53:50 histograms here
53:52 so we took a look at these histograms
53:54 and we calculate probability densities
53:58 over all cells so all columns of this
54:01 matrix
54:02 and on the x-axis here is the
54:06 total amount of information that we have
54:09 for
54:09 for a cell so the total number of
54:11 molecules that we detected
54:13 for a single cell and you can see this
54:15 follows a distribution
54:18 and the first thing that you see there
54:20 are two blocks now some cells are worse
54:22 and some cells are better so there is
54:25 already some variance in the data
54:27 just because the quality of our
54:30 measurement
54:31 is different for two between two groups
54:33 of cells
54:35 but all of these are actually good
54:37 values yeah they’re all good
54:39 yeah and uh we just take out some cells
54:42 here that is the
54:43 vertical line that are below one million
54:46 of these counts now these we throw away
54:51 now we can also
54:53 see look at other accounts so um
54:59 so this is for example how many genes
55:01 that we detect
55:02 and here we also cut off
55:06 here we also cut off
55:09 basically cells that are low quality
55:12 in these tails here we just remove them
55:14 from the data set
55:16 because we know that if we kept them in
55:18 the data set in the long term
55:19 we would have problems with
55:21 dimensionality reduction now they would
55:23 these things dominate in the end
55:26 things that are based on machine
55:28 learning clustering
55:29 dimensionality reduction so we remove
55:31 them from the data set
55:34 and now that’s sorry excuse me
55:36 replications
55:37 yes so is there any uh is a systematic
55:41 way to set
55:43 the threshold to filter out
55:46 in this case uh it’s a matter of
55:49 experience
55:50 yeah there’s not a systematic way
55:53 normally
55:53 [Music]
55:55 you would set the threshold in such a
55:57 data set
55:58 you would set the threshold here in the
56:00 middle between the two
56:02 peaks but between but because the two
56:04 peaks are both at reasonable values
56:06 and they’re all of the same height yeah
56:09 then
56:10 uh you know i said we would get we would
56:12 lose 50
56:13 of the data and that’s a little bit too
56:15 much yeah but they’re all reasonable but
56:17 we have to check later
56:19 that if we find two groups of cells in
56:22 the data
56:23 that these two groups are not
56:26 just representing uh these two peaks
56:30 here in the quality of the measurement
56:32 yeah the later we have to we now go go
56:35 on with analysis
56:37 and if there’s something is suspicious
56:39 we go back to this stage here
56:42 and we might have to be more rigorous
56:45 with this
56:45 cut off here yeah so in this case in
56:49 this particular case there’s no rigorous
56:51 way of doing that
56:52 it’s a matter of how much do you expect
56:54 what is the good measurement
56:56 and uh basically now this is this is a
56:59 this is a
57:00 this is a pretty good example in terms
57:03 of these counts here
57:04 now if you if you’re working on genomics
57:06 so this is this is actually zebra fish
57:08 so we have less than in other animals
57:11 in total and um but but here we don’t
57:15 have basically
57:17 sometimes you have a another peak here
57:20 at very low values
57:22 and that we would then completely cut
57:24 off
57:26 so here the problem more is this one
57:28 here this plot
57:30 we have a lot of cells that have a high
57:33 percentage
57:34 of mitochondrial accounts so that that
57:37 are genes
57:38 or the dna that is in the mitochondria
57:42 and these genes do not produce
57:45 this dna does not produce many gene
57:48 products
57:49 yeah so it suspicious if you have too
57:51 much of that in this
57:53 in these cells and here we take out
57:56 roughly i guess 20 or 30
57:59 20 20 of the cells
58:02 we lose in this step
58:05 yeah and we can also plot both with
58:08 respect to each other
58:09 for example we can here on the x-axis
58:11 have these different values that
58:13 represent the quality of our data
58:15 and then just in half these bars these
58:19 vertical and lines that we had in the
58:20 histograms
58:22 just in this scatter plot here yeah and
58:25 cut
58:25 and see which what are the kinds of
58:27 cells that we use here
58:29 uh visually yeah
58:33 okay now so now now we get rid of the
58:36 bad stuff the other things that are
58:38 totally crazy now the next thing
58:41 we need to do is to make
58:44 cells comparable so cells
58:48 still have different uh cells that
58:51 have different measurement qualities
58:54 they’re still
58:55 different based on technical reasons so
58:57 for some cells we have a lot of
58:58 information
59:00 now we have a lot of these counts that
59:02 we detected
59:03 and in other cells we have less but we
59:06 want to make them comparable to each
59:08 other
59:08 and that’s why we have to normalize the
59:11 data
59:12 yeah and there are fancy ways in
59:14 genomics there are fancy ways of doing
59:16 this and you can see we’re
59:17 doing all of that basically and uh
59:21 so but we have to normalize the data you
59:24 have to make them comparable that’s what
59:26 you always have to do
59:27 and what we also do here is because
59:30 these
59:30 these uh data counts you could see
59:34 like in the matrix that i showed you
59:36 there were very large numbers and very
59:38 small numbers
59:40 yeah and what we have so these
59:43 counts they live on an exponential scale
59:46 that’s their these distributions are
59:48 very skewed it’s a few cells that have a
59:51 huge amount or a few genes
59:52 that have a huge amount of these counts
59:55 of these
59:56 measurements yeah and that’s not
59:58 something that works very well with
60:00 dimensionality reduction
60:02 or clustering methods so we take the
60:04 logarithm
60:05 we log transform the data for
60:08 any further processing now that’s also
60:11 something that you do if your
60:13 data is too spread the other it comes
60:15 from some
60:16 exponential process also yeah then you
60:19 have the log transformers you want it to
60:21 be something
60:22 like normally distributed something
60:24 symmetrically and
60:25 rather compact as a distribution
60:30 yeah okay let’s go on so here we can do
60:33 a variance stabilizing
60:35 uh transformation and we do more stuff
60:37 on the data
60:40 and then we can go and now we can start
60:44 to understand the data the first thing
60:46 we need to do
60:48 is to see does it actually make sense
60:51 what we do
60:52 what we have here are we actually
60:54 looking at biological information and
60:56 real information
60:57 or are we just looking at technical
61:00 aspects
61:01 of the experiments and as a first step
61:05 what we do here is we
61:08 plot a little pca fabian showed that
61:11 last week what a pca is
61:13 and these data on the pca on the
61:15 principal component analysis plots
61:18 they look like this and you can see
61:21 so here i plot for example the total
61:24 amount of these counts in a cell you
61:27 know that’s a
61:28 that’s a technical that’s just measuring
61:31 the quality
61:33 of the measurements and you can see
61:35 there is some variability here so that
61:37 these cells have a little bit more these
61:40 cells are a little bit more
61:41 less and some of this technical
61:43 variability is captured
61:46 by the um by these experiments
61:50 by this principle component analysis so
61:53 here we’re fine
61:54 with it it’s not extreme we don’t have
61:56 disconnected clusters
61:57 so we are fine with this it’s already in
62:00 a good shape
62:01 and also we know that in cells can have
62:05 these differences
62:06 in the total number of molecules in the
62:09 cell for biological reasons
62:13 now and then we can look at these plots
62:15 here what is the percentage of variance
62:18 actually explained by certain principal
62:21 components now that’s actually the
62:23 y-axis is
62:24 something else but we can order these
62:27 principal components
62:29 based on how much they explain the total
62:31 data
62:33 and what we do and that this is sort of
62:35 saying like a very professional
62:37 way of doing things is we do all other
62:39 calculations not on the real data
62:43 but on the principal components of the
62:45 data yeah that’s an
62:46 intermediary step that we do just to get
62:49 cleaner results in the end
62:51 so so what we do here is we take the
62:53 first
62:54 20 25 or so principal components
62:58 that constitute like 99 of the variance
63:01 of the total
63:02 plot and we say the rest is noise that’s
63:04 a way of getting rid of the noise
63:06 in the data and now we go on
63:10 yeah and do dimensionality reduction
63:12 further dimensionality reduction
63:16 let me make this larger
63:24 now i hope you can see these plots
63:28 so this is a u map yeah so this is the
63:31 this is a umap
63:32 uh that is a non-linear way of reducing
63:36 the dimensions that
63:37 fabian showed to you last week
63:40 and you can see once we do the
63:42 non-linear dimensionality
63:44 reduction our data looks already much
63:47 more structured so these cells here
63:51 are from actually from the brain these
63:53 are brain cells
63:55 and of course there are different kinds
63:57 of cells in the brain
63:59 and because they’re different kinds of
64:01 cells in the brain we also
64:03 expect here a structure
64:06 to pop up in these low-dimensional
64:09 representations
64:11 typically these clusters correspond to
64:13 different kinds of cells
64:16 yeah so we don’t know that’s just a gray
64:18 bunch of cells here
64:19 of dots and the two dimensional planes
64:21 we don’t know what the x’s are
64:23 and we don’t know what these cells are
64:25 and now we have to dig a little bit
64:26 deeper
64:28 and we do clustering so here is
64:31 clustering
64:32 and that’s clustering that’s one of
64:35 these community based clustering
64:37 algorithms
64:39 and we did that for several
64:43 resolutions of the clustering yeah so in
64:45 clustering
64:46 in most of the time you have to tell the
64:48 algorithm
64:49 how many clusters you want yeah
64:53 and uh that’s that’s kind of a
64:54 resolution
64:56 that you give to these algorithms and uh
64:59 by doing this
65:00 you uh and what you see here are
65:03 different clusterings
65:06 with different resolutions now so here
65:09 you say okay
65:10 give me what is that 15 clusters
65:13 then you got this plot on the bottom
65:14 left if you say okay
65:16 give me what is it eight clusters then
65:19 you get this
65:20 these clusters on the top left
65:24 and you can have more clusters if you
65:27 want yeah
65:28 they’re different stages but we don’t
65:30 know yet
65:31 what makes sense yeah we don’t know how
65:33 many real clusters there are in the data
65:37 but we can take all of them and we can
65:39 go one step further
65:41 how do we know that such a cluster is a
65:43 real biological cluster
65:46 we know that if the cells in a cluster
65:50 all share some property that is not
65:53 shared
65:54 by other that is not shown by other
65:57 cells
65:59 now then we know that this cluster is
66:01 something real
66:02 uh that really is going on in the brain
66:06 and the way we do that is
66:09 let’s go down then we look at the
66:12 literature
66:13 so now we look at the literature we look
66:15 at papers
66:16 yeah so we say we look at papers and
66:18 then in these papers
66:20 we see okay there are different cells
66:22 and kinds of cells in the brain
66:25 and people have done experiments
66:28 genetic experiments for example where
66:30 they found for example
66:32 that stem cells express a certain gene
66:36 now for example this gluolar gene here
66:38 that’s expressed by stem cells
66:40 and now we plot this umap with the color
66:45 representing how much of the products of
66:48 this glue
66:49 gene we found in a certain cell
66:53 and now that makes a little bit sense
66:54 and now here in this corner here on the
66:56 top left
66:58 these are our stem cells
67:01 and then we can see okay so there are
67:03 also neurons in the brain and
67:05 many other things let’s what’s going on
67:07 yeah so there’s another cell type that
67:09 expresses this
67:11 cell yeah that’s the more advanced cell
67:13 type
67:14 from stem cells yeah that’s all now in
67:17 the next step here let’s express
67:18 in these cells here and then you could
67:21 can go down
67:22 and identify all of these clusters
67:27 step by step and identify
67:30 what kind of cells you have in the data
67:34 yeah you can do that with more genes
67:37 even
67:38 so there’s a lot of genes like these
67:40 genes here that identify neurons
67:43 and different kinds of neurons so here
67:45 we have a little
67:46 just feature of this plot that is
67:48 identified by this gene
67:50 and if you talk bibologies all of these
67:52 names uh
67:54 are associated with different shapes or
67:56 different functions of cells
67:59 we can also do more fancy stuff and to
68:01 look at different gene scores and add
68:03 groups of genes
68:05 do statistical computations yeah and
68:08 once we’ve do
68:09 done that we decide okay
68:12 for these set of genes here we have a
68:15 unique
68:16 we fulfill this condition that each of
68:18 these clusters here
68:19 for these clusters we fulfill the
68:21 condition
68:22 that each of these clusters has
68:26 a certain biological function or
68:28 represents a biological function
68:30 because we found a gene in the
68:32 literature
68:33 that corresponds to a certain cell type
68:35 in the body
68:37 that is expressed in one of them but not
68:39 in the others
68:41 yeah so that’s uh that’s very important
68:45 and then we can give these clusters
68:47 names here for example
68:49 retinol or radioglia cells
68:52 uh uh only oligodendrocyte
68:56 precursor cells and so on and neurons
68:59 and then you find okay so these orange
69:01 ones here are the neurons
69:02 yeah and uh here were the stem cells
69:06 and then you can start thinking okay
69:08 these sensors somehow
69:09 turn here into these neurons and they
69:12 mature
69:13 now they get they get more mature and
69:15 then at the end they turn into these
69:17 neurons here
69:18 and then we have other cell types in the
69:20 brain like microglia
69:23 and so on that we also can find here
69:27 now remember in the u map the
69:31 distances in the u map is the humor
69:35 keeps the global topology intact
69:38 now that means that cells that are close
69:40 in here are
69:42 actually also very similar on uh
69:45 in the in this high dimensional space
69:49 so it’s actually tempting that you think
69:50 about these
69:52 paths here as a trajectory that cells
69:55 take while they take from while they go
69:58 from stem cells
70:00 into neurons in the brain
70:04 now so let’s go on now there’s a lot of
70:06 consistency checks
70:08 yeah so you have to check all kinds of
70:10 genes
70:11 yeah a lot of them and discuss a lot of
70:14 with the people who actually know
70:16 who do nothing else in their lives but
70:17 to look at these cells and the brains
70:19 and they know
70:20 all of these genes and all of the papers
70:22 in these genes
70:23 yeah and i also do more fancy stuff
70:28 yeah and uh then you can do with
70:32 different clustering uh and now we have
70:35 a
70:36 identification and this one i said is
70:38 the one
70:40 that we can live with with these eight
70:42 clusters
70:43 while for these higher clusters now we
70:46 have
70:47 several several clusters representing
70:49 the same cell type
70:51 and we can in principle come back later
70:53 to these higher clusters
70:56 and typically biologists once want you
70:58 to do to get as many clusters
71:02 as you can yeah um
71:05 yeah so now we have these classes and we
71:06 can have some measurements
71:08 of how good these clusters are actually
71:12 and there are specific plots for these
71:15 for this
71:15 for example these are called dot plots
71:18 and
71:19 um what these plots show is on the
71:22 x-axis
71:23 our gene names and on the y-axis
71:27 are the cluster names yeah
71:31 and the color represents how much
71:34 a gene on average is present in the
71:38 cluster
71:40 and the size of these plots tells you
71:44 how what is the fraction of cells that
71:47 have this gene on
71:49 in this cluster now so and if we go now
71:53 here
71:54 what we want to see is that these genes
71:56 that we have here
71:58 are only on in one cluster but not in
72:01 other clusters
72:02 now for example a good thing is this one
72:05 here
72:06 this only go this oc cluster
72:10 now there’s only one cluster here
72:13 there’s only there are these these genes
72:16 here that we have
72:17 identified for this cluster that is
72:20 called
72:21 marker genes the other computation tools
72:23 to detect them
72:25 they are only you only find that in this
72:28 cluster
72:28 but not in the other cluster same for
72:32 this mg this microglia
72:34 we only find these genes in this single
72:36 this one cluster but not in any other
72:39 cluster
72:40 and while we go for what’s a messy one
72:44 opc’s
72:45 or we go for this neurons here
72:49 then it’s more messy and it’s not so
72:53 clearly defined
72:54 so if we go back now for example
72:57 now if we go here if we look at these
72:59 plots then these
73:01 mg’s microglia they were very clean
73:04 in this plot that i just showed you yeah
73:07 and that’s also represented in this new
73:09 new
73:10 you map they’re a different cell type
73:14 that is not produced presumably by the
73:17 same stem
73:18 cells as the neurons
73:21 also these ocs yeah as oligodendrocytes
73:25 i think
73:26 yeah that’s separated from the rest so
73:29 there’s no overlap here there’s a
73:30 distinct cell types
73:33 while for things like that we had an
73:35 overlap between these material neurons
73:38 at the sixth cluster that’s called six
73:40 here
73:41 yeah there we found that
73:44 there is an overlap actually in these
73:46 marketers they have the same genes
73:48 that they express and probably that
73:50 means that this cluster six
73:52 is an artifact yeah that we cannot take
73:55 too seriously
73:56 and on the right hand side you can see
73:58 that in the next step now we took these
74:01 i have went a little bit broader and
74:03 took this cluster six
74:04 into the interval neurons here
74:08 now so now of course i showed you
74:10 basically a finalized version
74:12 normally you go back between these steps
74:14 and the earlier steps
74:16 again and again the call it back to the
74:17 quality control
74:19 until you have that do not have any
74:21 trace
74:22 of experimental technical parameters
74:25 in your plots and then also you go back
74:29 between these clusterings your
74:31 experimental friends
74:32 who do the experiments and the
74:35 literature
74:35 until you find something that is really
74:37 corresponding to
74:40 to to what makes biologically sense
74:43 it’s not that these techniques give you
74:46 automatically
74:47 something uh that this was like a
74:50 mathematical criterion
74:53 that would tell you uh what makes sense
74:55 and you have to do that always have to
74:56 do that yourself
74:59 that’s not that you push a button and
75:00 then everything is
75:02 works automatically and that’s why these
75:04 people who do these kind of
75:06 analysis are very much looked looked for
75:09 on the job market
75:12 okay so uh let me just see if there’s
75:15 any
75:16 other anything other interesting i think
75:17 there’s you can go on and on forever you
75:20 know you can give the experimental lists
75:22 you know so which genes are on and off
75:25 in which cells
75:26 and once the experimentalists have these
75:29 lists they can do more
75:30 experiments uh they can create for
75:33 example
75:34 new animals that lack these genes
75:37 and what i want to show you
75:40 is now something that’s on the bottom
75:44 you know so that the analysis is very
75:46 lengthy
75:50 now there are many aspects of course
75:51 that are irrelevant for biology
75:54 um i just want to show you
76:01 a lot of stuff a lot of stuff now that’s
76:04 what the biologists are interested in
76:05 right
76:06 so these they hold these jeans um
76:09 all the stuff all the stuff a lot of
76:11 stuff even more stuff
76:20 yeah if more more more more also a lot
76:22 of calculations
76:24 yeah more heat maps to check
76:28 you know which genes are on where and so
76:30 on yeah so that’s all
76:32 uh uh consistency checks
76:35 and one thing i want to just to show you
76:38 is uh something that’s called let me see
76:40 do we have
76:42 [Music]
76:43 okay is something that’s called
76:46 trajectory inference you know that’s
76:48 so so so these are cells from a brain
76:52 and what cells do is they start that
76:54 they divide and they produce other cells
76:57 and then they get more specialized over
76:59 time
77:00 so these cells they start as a stem cell
77:03 and then
77:04 they mature until they at some point
77:06 they are in you
77:07 a new one so what we did here is
77:11 uh what you did here and what you do in
77:12 many cases is you take
77:14 okay i have a snapshot data so this
77:17 fish or these fish were killed for the
77:20 measurement
77:21 i don’t have a time course or anything
77:23 now i don’t uh as a snapshot measurement
77:26 just one measurement
77:28 but i have different cells that are at
77:30 different stages of this dynamic process
77:33 of cell maturation so what then people
77:36 do is uh can we get the temporal
77:39 information back
77:40 yeah and this is the trajectory
77:42 inference of how
77:44 these different clusters and cell types
77:46 relate to each other
77:48 and where you then can calculate rates
77:51 of how
77:51 one cell go one cell type
77:54 the flux that from one cell type leaks
77:57 leads to another cell type
77:59 now and in principle based on this you
78:01 can then come up with stochastic
78:03 models that you can in principle also
78:05 compare them to
78:06 uh to other experiments or theoretical
78:09 work
78:10 now that’s i just wanted to show you uh
78:13 at the end
78:13 and let me just go through here if
78:15 there’s anything else that’s worth
78:17 showing you
78:18 um a lot of stuff now then you can
78:21 compare to humans and other animals
78:24 you can see where there are similarities
78:27 uh
78:27 these fish here that we’re looking at
78:29 are very interesting because they can
78:31 regenerate their brain they can build
78:33 new neurons
78:34 that’s something we cannot do so it’s so
78:36 much that we want
78:38 we want to learn what are the
78:39 similarities and difference why are we
78:41 not
78:41 able to do that as humans yeah and
78:45 um yeah and then uh
78:48 we are already you know at the end of
78:51 this very lengthy analysis
78:53 yeah so this is a typical data science
78:56 project
78:57 from start to end and so you can see
79:00 here a mixture of
79:01 r and python and
79:05 the important thing is that you cannot
79:07 what we showed you
79:09 in the last lectures you cannot just go
79:11 and take the data and
79:12 throw a umap on it or do some machine
79:15 learning on it
79:16 so a large part of the data
79:20 of this pipeline here is to actually
79:24 clean up the data and think about what
79:26 is the part of the data that makes sense
79:29 and what is the part of the data that
79:31 does not make sense
79:33 yeah so for example think about this new
79:34 york city flights
79:36 it doesn’t make sense to have a negative
79:37 departure delay and it was a
79:39 it was a very uh that was a very good
79:41 question
79:42 yeah it actually uh makes sense and this
79:44 is so to say a data set
79:46 that is used for for teaching a lot so
79:48 it’s already cleaned up a lot but
79:50 typically you expect
79:51 to have a lot of nonsensical
79:54 measurements
79:55 uh in your data yeah or sometimes you
79:57 have a departure delay of
79:59 10 billion years or so that that would
80:02 be the correspondence in a real data
80:04 somebody made a typo somewhere yeah and
80:06 then you have that in your data set and
80:08 you have you have to
80:09 have to filter it out again now this
80:11 happens all the time if you
80:13 are not looking after this yeah so if
80:16 you’re not taking care of this
80:18 then all of these nice plots here that i
80:20 showed you
80:21 they won’t work now they only work
80:23 because we clean up the data
80:25 we normalize the data we transform the
80:28 data in a statistical way to make it
80:31 uh nicely behaved yeah it’s
80:34 no huge outliers or so and then these
80:37 methods
80:38 like uh umap and so on work on the data
80:42 but you always have to do these
80:44 pre-processing steps
80:46 before that to make that work and the
80:47 next step is you you have these fancy
80:49 methods
80:50 but taking a loan they don’t make sense
80:53 so you can see here how this order has
80:56 ordered this
80:56 trajectories cells move here along in
80:59 time
81:00 move along this line and turn into
81:02 neurons here
81:04 and of course we have many different
81:05 neurons in the brain
81:07 yeah and but to understand that what’s
81:09 happening here this data to make sense
81:11 of this
81:12 now you have to come up with hypothesis
81:15 you have to connect these hypotheses
81:16 with
81:17 what is already out there in the
81:18 literature and then step by step you can
81:21 construct
81:22 an understanding about what are actually
81:24 here the degrees of freedom
81:26 that you see in this data set you know
81:29 and so this is this is an example and
81:32 this
81:32 is an iterative process that you improve
81:36 over time and so this is an example
81:39 that’s a purity
81:40 purely data science project or there’s
81:43 very little physics in it
81:45 and next time we show you how all of
81:47 these connects to something that we
81:49 actually
81:51 did in the first part of this lecture
81:53 namely field theory
81:54 face transitions criticality and so on
81:58 okay great so i’ll stay online and in
82:01 case there are any questions otherwise
82:03 see you next week
82:08 bye
82:21 thank you it was very interesting yeah
82:24 so when are you when are you going