.mpipks-transcript | 09. Data Wrangling


Apr 27, 2021

intro and slide 1

00:07 so
00:11 okay so you think i made you co-host
00:19 it’s not working
00:42 okay
00:50 [Music]
00:56 so
01:05 okay i think there are no more people in
01:08 the waiting room
01:09 so we have a slightly different setting
01:11 this time
01:12 uh can somebody confirm that you can
01:15 hear me
01:17 yes okay perfect great so we have a
01:20 slightly different setting because
01:22 i um today we start a new topic and i
01:25 need my computer for that
01:27 and um so
01:31 in in the previous lectures before
01:33 christmas i gave you
01:35 an intro introduction uh about the
01:38 methods
01:39 that we need the theoretical methods
01:40 that we need in order to understand how
01:43 order emerges in non-equilibrium systems
01:47 and i also we also discuss how
01:50 order manifests in non-equilibrium
01:54 systems so now in the new
01:57 year we will look at the other side
02:01 we’ll now introduce methods
02:04 that allow us actually to identify order
02:08 in experimental data
02:12 and of course i’m not talking about a
02:15 few
02:16 data points here we’re talking about
02:18 large
02:19 high dimensional data sets as have
02:21 become common
02:23 in many fields of science like social
02:25 science but also biology now
02:28 so to begin let me share the screen
02:31 so i hope that rooks
02:37 so i hope that works
02:49 so actually that was a lecture that i
02:50 gave last year already
02:53 you know and um so
02:56 can you see you can’t you can see the
02:58 screen i hope right
03:00 okay so just trying to get the meeting
03:02 controls
03:05 [Music]
03:11 okay it doesn’t matter okay so i i
03:14 assume you can see the slides
03:15 now let me just uh make them full screen
03:20 now that’s actually from a course that i
03:22 gave last year together with
03:24 fabian ross of course data science for
03:26 physicists
03:28 and today we’ll discuss the methods that
03:30 we need
03:31 that actually allow us to identify
03:35 order in high dimensional data sets
03:38 and what is what do i mean with high
03:40 dimensional and large data sets

slide 2

03:42 this is one of the data sets that i
03:45 showed you in the very
03:46 beginning of the lecture that motivated
03:49 partially this lecture and this is a
03:52 data set that
03:53 was produced by collaborators in
03:55 cambridge
03:57 and in this data set on the x-axis you
04:00 basically have part of the dna
04:03 so in the dna you have maybe a billion
04:06 or three billion base pairs
04:08 three billion base pairs roughly and
04:10 mouse and human
04:12 and in these experiments for each of
04:14 these base pairs on the dna
04:16 or each of these positions for each of
04:18 these three million
04:20 uh sites of the dna we can make take a
04:24 measurement
04:25 and measure whether there is a chemical
04:28 modification
04:29 on the dna or not yeah and
04:34 on the y axis we have different cells so
04:38 we can do that
04:39 in individual cells and on the y axis we
04:41 have roughly 600 cells
04:43 from a mouse embryo so from each of for
04:47 each of these
04:50 cells we can take roughly 3 billion
04:53 measurements here along the sequence of
04:56 the dna
04:58 that means that we have a data set here
05:00 that is typically a few terabyte in size
05:03 and that has three billion dimensions
05:06 of course not all of these dimensions
05:08 are really meaningful
05:10 but to start with we have something that
05:12 is very
05:13 huge in terms of size and now we need
05:17 something we now release some tools some
05:19 computational tools
05:21 to process such data sets
05:26 yeah so what we want to do is we want to
05:29 start with
05:31 these measurements that i just showed
05:32 you now for example we use
05:35 we measure that that’s a biological
05:37 figure it’s not so important to
05:38 understand the details
05:40 i’m sorry because i think i only can see
05:44 your first
05:45 slide which slide can you see
05:48 data science oh now it’s uh
05:52 okay okay okay okay so
05:55 how can we do that now it’s working i
05:58 think
05:58 now it’s working uh go to the next slide
06:00 is that working
06:01 it is the second slide sequencing cells
06:03 yes
06:04 no it’s not okay okay let me stop
06:08 sharing
06:11 let me share the desktop maybe
06:15 um okay can you see
06:20 the title slide yes now the first slide
06:24 is that when i think so yes yes and now
06:27 uh
06:28 another slide yes okay great great
06:31 so i was talking all the time probably
06:32 you didn’t see this cell this slide
06:34 before here
06:36 yeah so this is what was on the x-axis
06:38 here
06:39 on the x-axis we have these three
06:41 billion base pairs
06:43 and for these base pairs we can make
06:44 measurements on the x-axis let me
06:48 get a pointer
06:51 so we have here the x-axis and this
06:53 x-axis
06:54 has these three billion dimensions
06:57 these three billion measurements for
06:59 every base pair of the dna
07:01 that we can take on the y-axis
07:04 each row on the y-axis is a different
07:07 cell
07:08 and for each of these cells that are
07:09 taken from a mouse embryo
07:11 we can take these measurements these
07:15 hugely dimensional measurements uh
07:18 in this case a living embryo now
07:22 these are breakthroughs that happen in
07:23 biology but also similar breakthroughs
07:25 with similar detail we can measure
07:28 social
07:29 systems for example if you think about
07:31 the social sciences social networks
07:33 and so on we have huge amounts of of
07:36 data

slide 3

07:37 that we would like to understand yeah
07:39 this particular example
07:42 of biology we would like to understand
07:46 how we can transform the top part of
07:49 this big picture here
07:51 where we take measurements that have
07:53 lots of different dimensions
07:55 for this case many different cells you
07:58 can do that with
07:59 millions of cells if you want if you
08:00 have the money now
08:02 how can we transform these measurements
08:05 into a physical
08:06 theory of what’s generating these
08:09 measurements
08:11 now that’s what’s that’s what we want to
08:13 do and

slide 4

08:14 uh to do this we need to identify
08:18 order in these structures in order to
08:20 build a hypothesis
08:22 now this is the rule this is how this
08:24 this kind of
08:26 data sets arrive on our desk
08:30 so here you see tables that contain
08:33 different bits of information
08:36 for example on the
08:40 top right here you have a table that
08:43 tells us
08:44 where on the dna we have chemical
08:46 modifications
08:47 now that has in this case 50 million
08:50 rows
08:51 we have another table here on the top
08:55 left that tells us something about
08:58 the same position on the dna it tells us
09:02 which what is the topological structure
09:05 of the dna is it
09:07 is it compact is it is it like a
09:09 spaghetti ball is it more open
09:11 or not tells us about the something
09:13 about how this dna
09:14 lives in real space now then we have
09:18 other kinds of information and we can
09:20 ask for example
09:21 how
09:24 how is this particular bit on the dna
09:29 what else do we know about that is there
09:32 a gene in this region
09:34 is there some other interesting stuff
09:35 going on in this region
09:37 now we can ask what happens by the to
09:40 these genes what are the
09:42 genes doing that’s on the topic of
09:44 different experiments again
09:46 in these regions of the dna yeah and we
09:49 then have for example information about
09:51 these cells where are they located in
09:52 the embryo
09:54 uh what were the cultural conditions of
09:56 these cells for example and so on
10:00 now so now we have these uh
10:02 high-dimensional data sets from
10:03 different sources and we now
10:05 need to combine them to create a big
10:08 picture that allows us as physicists
10:11 to generate a hypothesis

slide 5

10:14 yeah so
10:18 and this is an overview of this lecture
10:20 today and i have to uh
10:22 make a shocking uh confession that will
10:25 actually be using
10:26 r in this lecture that’s something we’re
10:28 not using typically in physics
10:30 but in this case it actually makes sense
10:32 it’s actually suited but the syntax of
10:34 r is a little bit uh not what we’re used
10:37 to in
10:37 physics so i’ll give you a quick
10:40 introduction
10:41 to r to the language of r and
10:44 i assume that most of you have already
10:47 learned program and some programming in
10:49 some kind of other language like matlab
10:52 or python or c
10:55 so i’ll just give you a quick
10:56 introduction to the syntax of
10:58 r and then i’ll show you what are the
11:01 tools
11:02 that we have available to deal with
11:06 these large
11:07 data sets that are coming up in science
11:11 yeah so so what are the tools that we
11:13 have available what are the
11:14 computational tools and how do we select
11:17 the tools that we actually want to use
11:21 now then the very important thing is to
11:23 bring the data that has many dimensions
11:26 in a shape that we can actually deal
11:29 with
11:30 in a computationally efficient way now
11:33 that’s called tidying the data to bring
11:35 it a specific format
11:36 so that we can vectorize our
11:39 computational or numerical operations on
11:42 this data sets
11:43 as i’ll show you once we’ve done that
11:46 how we can then perform
11:48 very easily computations on
11:51 these data sets and uh and finally i’ll
11:55 end with showing you how we can combine
11:57 this
11:59 these steps to produce data processing
12:01 pipelines
12:03 and of course we want to do all of this
12:05 in an elegant way
12:07 that is not so to say heavy in terms of
12:09 syntax we want to focus on the structure
12:11 of the data
12:12 but not on what is actually
12:15 on writing things and see also where you
12:18 have like hundreds of lines of codes for
12:20 a simple
12:21 operation yeah so this is the structure
12:24 and if you have time in the end i’ll
12:26 uh show you something more
12:30 on data visualization

slide 6

12:34 okay so this is arya so most of you
12:36 won’t have used r before so we
12:39 don’t use that in physics very much uh
12:42 r is um it’s not better than python or
12:47 anything as it’s very similar to python
12:49 and that it’s an interpreted language
12:52 it’s extremely flexible
12:54 so nothing is fixed in rng it’s like you
12:56 have type function
12:58 that rewrites itself while
13:01 you execute this function it’s called
13:03 extreme dynamism
13:06 um r is very popular in statistics
13:09 now you probably have heard in this
13:10 context
13:12 and uh it’s very easy to include a c
13:15 code in r so if you worry about speed
13:17 you can always
13:18 write things and see and that’s also
13:20 what we’re usually doing
13:22 the main advantage of r is that’s a huge
13:25 repository
13:27 of packages available particularly in
13:29 data science
13:31 genomics biology and so on
13:34 and one particular thing that is
13:37 actually
13:38 of the most practical relevant is that
13:41 it has
13:42 extremely convenient and high quality
13:45 plotting
13:46 functions yeah
13:49 so so these are the benefits of our that
13:51 of some downsides the syntax is very
13:54 difficult for our physicists to swallow
13:57 i’ll show you later why and typically
13:59 because
14:00 nothing is fixed in r it’s very dynamic
14:03 it’s typically slower than python
14:05 in general terms so you wouldn’t use
14:08 r to write as stochastic simulation or
14:11 something like this
14:13 but for the tasks that we’re doing today
14:15 is actually uh
14:17 actually a very good choice now so it’s
14:20 slowest
14:21 r is typically slow but the core
14:23 functions
14:24 are written in c so once you once you
14:27 rely on
14:28 core functions on vectorization and
14:30 things like that
14:31 then it’s very fast but you know i have
14:33 to know how to use it
14:35 yeah but taken together there’s no
14:37 particular reason
14:38 for for not using python for this
14:42 the only reason is that for the
14:45 for the um for the tables and for the
14:48 data i showed you the previous slides
14:50 r has been so to say the standard
14:53 language
14:54 in these experiments in genomics and
14:57 there’s a lot of packages and a lot of
15:00 libraries developed for these genomics
15:03 data
15:05 that are collected in huge repositories
15:08 in our
15:10 meanwhile python is sketching a little
15:11 bit up uh
15:13 in this context so but that’s why while
15:15 we’re using r in
15:16 our group uh it’s the context of
15:18 genomics these particular experiments
15:20 that i showed you on the previous slide

slide 7

15:25 so typically when you use r you use it
15:28 in a
15:30 development environment an interacting
15:32 environment
15:34 that you can download from rstudio.com
15:36 and then you have
15:37 something that you looks very familiar
15:39 to you if you use matlab
15:41 or python you have a notebook here
15:44 a little bit like a jupiter notebook if
15:46 you use
15:47 python for example now where you can
15:50 write code and where you then have also
15:52 then directly the output of your code
15:54 and the same uh the same and the same
15:57 window in the same file you can
15:58 export html documents
16:03 if you want you have a console here to
16:06 type in
16:07 code now and then you have your
16:09 workspace like in matlab where you see
16:11 your variables
16:13 and you have a window that that is made
16:16 for looking at help
16:17 pages and looking at plots and other
16:20 stuff
16:22 now and um i’ll show you
16:25 later getting a little bit more
16:26 interactive i’ll show you later how this
16:28 works and
16:29 practice here but this is just how how
16:32 working in r looks like looks exactly
16:34 like working in any other language
16:36 actually

slide 8, 9, 10, 11

16:38 yeah so so this is some basic r
16:41 so this this is just just to show you
16:44 that
16:44 r looks very familiar to uh
16:48 other languages actually in many in many
16:51 respects
16:52 and before i do that let me just show
16:54 you how these
16:55 what these boxes here mean and i’ll
16:58 share
16:58 for that a little
17:02 our window
17:05 so let’s go here
17:09 um
17:12 let me show the screen again
17:14 [Music]
17:16 okay now you should see an
17:19 r you know you should see this r studio
17:22 window
17:23 now so this that i just showed you and
17:26 here we have the code
17:28 now i can run the code and click on this
17:30 uh
17:31 little arrows here and then i run a
17:34 certain
17:34 chunk of code i can run this if i want
17:38 and load some data and i have a console
17:41 down here
17:43 that allows me to type commands if i
17:45 don’t want to use this notebox here
17:47 and if you look at the bottom i can now
17:50 type assign a variable
17:51 here a and set it to one with these
17:54 weird assignment operators
17:57 as in set a to one and if i press a
18:01 i just type a and enter then i’ll get
18:04 this
18:04 output the the value that is stored in
18:08 a now i can also output more complicated
18:13 values like like this table here that i
18:16 already loaded yeah and then i can look
18:19 at this in the console
18:21 and this console output is now what you
18:23 see in the slides
18:26 let me open the slides again can you see
18:27 the slides
18:32 are we back are we back with the slides
18:36 yes okay great perfect so it’s working
18:38 yeah
18:39 so so here we have this i mean i’ve made
18:41 this fake console here and
18:43 and powerpoint now this is just to show
18:46 you
18:47 it’s basically the same as in any as
18:49 many other languages like python
18:51 so here i call the function yeah and
18:54 the arguments are given in these
18:56 brackets uh
18:57 let me give you um
19:04 let me give you a pointer
19:09 here we go so somebody wrote in the chat
19:11 that the text is very
19:13 tiny was that related to our studio to
19:15 this
19:17 development environment
19:22 yeah probably yes and so so we don’t
19:25 rely on that very much
19:27 um yes rc
19:30 okay okay so that’s good to know
19:34 okay so just to just show you if i type
19:38 like a function
19:39 i would have called a function and an r
19:42 i do it in the usual way i can give
19:44 different arguments to this function in
19:46 this case
19:47 i take a normal normally distributed
19:50 random variable
19:52 and i tell r to give me five of them
19:55 and then i have some parameters that i
19:58 can identify by names
20:00 now so so the parameter mean are set to
20:03 one
20:04 and the parameter standard deviation are
20:06 also set to one
20:07 now these parameters have names
20:09 sometimes they have names
20:10 and they can call them with their names
20:12 it’s very convenient some if you have a
20:14 long
20:15 list of parameters and you don’t want to
20:18 give them all
20:19 i also can write my own functions that
20:22 looks a little bit like
20:24 mac and c so here i define a function
20:28 that’s called my sum and this function
20:31 has three parameters
20:32 a b and c and c has
20:36 a default value 1.
20:39 now and then this function returns a
20:42 value that is just
20:43 equal to the sum of these three
20:46 arguments here
20:47 a plus b plus c now if i call this
20:50 function
20:50 i set a to 1 and b to 1
20:53 and c is 1 by d4 so i have don’t have to
20:56 state that
20:58 i only have to state that if if i want c
21:00 to be a different value
21:02 then this function returns the value of
21:04 three not just like in
21:06 any other programming language
21:09 now we also have loops of course an r
21:12 and so this is a for loop
21:14 where you can for example have a loop
21:17 that goes from one to five
21:19 and then i can print something out or so
21:21 i have a while loop
21:23 now so while some condition is
21:26 a full fields or we want to print
21:28 something
21:30 now typically in r you don’t want to
21:33 have loops if you have a loop in
21:34 r uh that means that you’re doing
21:37 something wrong
21:38 so these loops are slow because r is
21:41 slow
21:42 and if you have such a loop then it
21:45 means that you didn’t vectorize
21:47 your operation that you’re doing
21:49 something bit by bit that you could also
21:51 do
21:52 in one step now typically if you have a
21:55 loop
21:55 then there’s something wrong with your
21:58 code
22:00 or you’re very extreme extremely
22:01 inefficient
22:03 and so i guess i’ll have like written
22:06 like many
22:07 thousands of lines of r code
22:10 uh in my life and i have had uh exactly
22:13 one loop and uh in this
22:17 in these many thousands of lines and
22:19 several years
22:20 and this is one loop i use for a
22:22 stochastic simulation
22:25 now the mistake was that you would never
22:26 write a stochastic
22:28 simulation r but i did that because it
22:31 was very lazy
22:32 now so if you use these loops then
22:34 there’s something wrong
22:35 because these loops are very you can
22:38 typically replace them with much more
22:40 efficient operations now so these are
22:44 the standard constructs that you have in
22:45 any programming language
22:47 you also have the if clauses here like
22:50 if
22:50 the value of i is smaller than 5 then
22:53 five
22:54 print hello and otherwise print not
22:56 hello yeah so that’s
22:58 that’s just also like in any other
23:00 language and you use
23:01 these curled brackets like in c to
23:04 define the scope
23:06 of a certain uh frame of the of the code
23:11 now that’s very should all be very
23:14 familiar
23:16 now where things get a little bit
23:18 different different
23:19 are the data types of r
23:22 now so they our house has standard data
23:25 types
23:26 now so for example here we can define
23:29 vectors in principle everything is a
23:30 vector
23:31 or that’s the most simple data type your
23:34 a
23:35 for example the data the the the vector
23:38 a
23:39 we define with this c here
23:42 now that’s a little bit strange so in
23:48 matlab
23:58 was that a question probably not
24:03 i’ll just just repeat it if it was a
24:05 question okay
24:06 so this is uh this is just a vector
24:09 how can i go back in the code
24:16 yeah okay so this is a typical vector we
24:19 we use the the letter c
24:22 to create such a vector that means
24:24 combine
24:26 and this vector has the elements one two
24:28 three
24:29 and we can also put other stuff in this
24:31 vector like this nas
24:33 that’s for example a measurement that
24:36 is not available that didn’t work for
24:38 example yeah but it’s very convenient to
24:40 have
24:41 a representation on the computer for
24:43 measurements that
24:44 didn’t work for example we can also have
24:47 a vector of
24:48 other types so here this is a string or
24:50 character vector
24:52 of m and f’s so we can define this in
24:55 the very same way
24:57 and we can access elements of such
24:59 vectors
25:00 with these squared brackets yeah like in
25:03 c
25:03 for example but others but unlike in c
25:07 we start counting by one and not by zero
25:11 the next element is a list then the next
25:14 data type is a list
25:15 and then now it gets more interesting a
25:18 list
25:19 can contain several elements of any type
25:23 for example a list what elements of a
25:26 list
25:26 can be vectors now for example if i want
25:30 to make a list
25:31 and the first element of the list is our
25:34 vector
25:35 a and the second element of the list
25:38 is our vector b now they have
25:42 different types but you can nevertheless
25:44 put them
25:45 together in a list and then i can access
25:49 these uh then i can access these
25:53 lists here these list elements by the
25:56 name so i gave
25:57 gave it a name number and gender
26:01 and if i want to access the first
26:03 element number by name then i just use
26:05 these dollar signs here
26:07 and i can also assign access this first
26:10 element
26:11 by its number then i use the double
26:14 squared brackets that’s not so important
26:17 just to show you that these title types
26:18 exist
26:20 when it gets more important for
26:22 statistics is that we also have
26:25 categorial variables and that’s
26:26 something probably you don’t know from
26:28 from matlab
26:29 or c i don’t know about python category
26:32 so
26:33 so we have here a vector
26:36 that saves the gender of something
26:40 of somebody yeah so we have a long and
26:43 then
26:43 so suppose we do this measurements like
26:45 um 100 millions
26:47 of times yeah and then we have males and
26:50 females
26:51 and in principle we could store
26:54 the words male and female 100 million
26:57 times
26:58 in memory but that would be very
27:01 inefficient
27:02 what we could do instead is to say
27:05 okay i have a variable that i can only
27:07 take two values
27:10 yeah so i need this one byte at most
27:14 uh to store these which of these two
27:17 values
27:18 a given element has and then
27:23 i save an additional to that i save
27:26 labels to these two values
27:28 and that’s what a factor is doing it’s a
27:30 categorial variable
27:32 that can only take limited number of
27:35 values
27:36 for example uh this valve this vector b
27:39 here can only take two values
27:41 and i tell r that this can only take two
27:44 values
27:45 by making this a factor now this this
27:48 category variable is called a vector
27:51 and when i type then look at this vector
27:56 f here then it gives me the output
27:59 m f and f m f yeah and it tells me these
28:03 levels here
28:05 these are the typical values these
28:06 elements can take
28:08 yeah and if i want to make these these
28:11 levels
28:12 of these labels of these two values
28:14 different
28:16 i can call these two female and male
28:20 you know by changing these levels and
28:23 then
28:23 i have i can output this again
28:26 and have male female female male and so
28:29 on
28:30 so what happens here in r is that still
28:33 internally i don’t use any more memory
28:36 although the strings are longer
28:38 what i changed here is a lookup table of
28:40 r
28:41 where r looks up where these two values
28:45 how these two values my vector can take
28:47 are called
28:49 for any plotting or any printing
28:51 purposes
28:52 now that’s a that’s a factor and it’s
28:54 very efficient
28:55 if you think about for example biology
28:58 for these measurements that i showed you
29:02 in the beginning you have these hundreds
29:04 of millions of billions of measurements
29:06 and you have 200 million measurements
29:10 for chromosome 1. now you could either
29:13 write in your memory have a vector
29:15 that has the element chromosome 1 200
29:18 million times
29:21 or you just save a number an identifier
29:24 for chromosome one
29:26 and give it a label like this one here
29:30 with a real name in a separate table
29:33 then you don’t have to save the string
29:36 chromosome one two one a million times
29:38 but only once in this table where you
29:41 look up what is actually the name
29:42 of this factory level so this is a very
29:44 efficient way of saving
29:47 variables that have that appear
29:50 very often but can only take a limited
29:54 number of values and that’s that called
29:56 a factor
29:58 now that’s the data type that you’re
30:00 probably not familiar with
30:04 yeah and then we can go on yeah and we
30:07 can
30:07 now go to data types that can really
30:10 store
30:11 the data that we’re looking at for
30:13 example experimental data
30:15 and the simplest way you can do that is
30:17 called an r a data frame
30:20 a data frame python also has data frames
30:25 and as far as i know and so these data
30:28 frames are nothing but lists
30:30 of vectors as i remember the lists
30:34 now this is a list it can save any kind
30:37 of element that you want
30:39 and if we can if we if each element of
30:42 this list here
30:43 has the same length then we can
30:46 interpret this list as a table
30:51 yeah and this is what we do in these
30:52 data frames so we here we have
30:54 three vectors x y and z
30:58 and they have different types so this is
30:59 a numerical vector
31:01 this is a character vector and this is a
31:04 boolean or logical vector
31:06 and we now can create this data frame
31:09 and say that the first element is x
31:12 the name of the first element is x and
31:16 there’s the second element the second
31:18 column should be y
31:19 and the third one should be said sorry i
31:22 don’t know why this has happened
31:23 all the time um
31:26 the third one should be that and what
31:28 you can see here
31:30 is how this is then represented if you
31:33 look at the output
31:35 of such a data frame so here is the
31:37 first vector
31:39 that’s now the first column in our table
31:41 the second column result on our table
31:44 is this one and the third column of our
31:46 table is the boolean vector
31:48 now this is a data frame is it
31:50 essentially the
31:52 r version of a table and the only
31:56 and internally this data frame is
31:58 basically a list
32:00 of vectors that have the same length
32:07 okay so this is a data frame so and this
32:10 data frame looks close to what we could
32:13 be
32:14 dealing with or is actually what we
32:15 could be dealing with or
32:17 so sufficiently general to allow us to
32:20 store any amount of uh
32:24 experimental data now like a table is a
32:26 general way of storing data
32:28 and these data frames are essentially
32:31 tables
32:32 and they can they allow us to store any
32:35 kind of
32:36 data that we might measure

slide 12

32:40 now the problem is now we have this data
32:42 table but what
32:43 as i told you in the beginning these
32:45 data frames
32:46 that i told you in the beginning that we
32:48 actually have
32:50 data data that is terabytes in size
32:55 so now we need a way of performing
32:59 operations on these tables in a very
33:02 efficient way now so we need the right
33:05 methods
33:06 and how important that is to pick the
33:09 right methods is here
33:11 on the left hand side so here you can
33:15 see these bars
33:16 and these bars are time measurements
33:19 that it takes to perform certain
33:22 operations on data size of
33:24 certain size yeah and
33:30 so so here this is 500 megabytes of data
33:33 so
33:34 pretty small data set and then you
33:36 measure this the length
33:37 of such an operation here that’s denoted
33:39 by the bar
33:41 and you can compare different versions
33:44 different packages that allow you to
33:48 to look at this data so for example
33:52 one popular popular method in r is
33:55 called
33:56 dplyr the varia is extremely popular and
34:00 very easy to learn
34:01 way of managing these tables
34:05 another pandas
34:08 is basically the are the python version
34:12 of such a data frame and then we have
34:16 here data
34:17 table on the top the blue one
34:20 now so this is a very fast c
34:23 implementation of these
34:24 operations on these data frames it’s
34:27 very fast and memory
34:28 efficient so identity you have some
34:31 things that are more
34:32 used and companies
34:35 uh also in this list now let’s but this
34:38 all seems very reasonable so we have
34:41 here 12 seconds
34:43 20 seconds 90 seconds
34:46 nothing of that stops us from doing from
34:48 getting results
34:50 now we increase the size of our data set
34:54 to 5 gigabytes or 50 gigabytes still
34:57 very small compared to what i’ve been
34:59 talking about
35:00 so here look let’s let’s have a look at
35:04 these 50
35:05 gigabytes now suddenly
35:08 there’s a huge difference here
35:11 a lot of these packages a lot of these
35:13 methods you do not produce
35:15 any results at all yeah
35:18 they they run out of memory for example
35:22 and some of them like this very popular
35:25 one
35:26 takes just a huge amount of time
35:30 while others especially this data table
35:34 method here
35:34 performs very well so we have here
35:39 what is that 30 minutes so that sounds
35:42 reasonable
35:45 no no that’s seconds there are 13
35:46 seconds so 13 seconds so this was
35:49 sorry so this was uh 0.2 seconds here at
35:52 the top
35:53 and now we’re at uh at 13.56 because
35:56 that sounds pretty reasonable
35:58 but if you go down here to these methods
36:00 that we have here so the 13 seconds is
36:02 something that you can
36:03 wait in front of the computer
36:07 and still have some interactive way of
36:09 working
36:10 now if you go down here of course a lot
36:12 of these methods just fail
36:14 but there’s also some of them just takes
36:16 so long
36:17 that any reasonable working like this
36:20 diploid here
36:21 takes so long that any reasonable way of
36:23 working with data
36:24 uh is that not possible yeah
36:28 so that’s why choosing the right method
36:31 here
36:32 is uh important
36:35 and also what’s important is to look at
36:37 how these methods
36:39 scale with the size and the complexity
36:41 and the dimensions of your data
36:44 so what this tells us here is there’s a
36:49 huge difference
36:50 yeah and actually what i did when i
36:54 was a poster for example like every
36:56 physicist i used matlab i was used to
36:58 use
36:58 matlab yeah and then very quickly almost
37:01 immediately
37:02 matlab failed on such operations
37:05 yeah and then when these genomics
37:07 methods came up
37:09 yeah i used the red one here dipper
37:13 this is extremely popular and very easy
37:16 to get into it’s very flexible
37:18 and very well documented and so on and i
37:21 used that
37:22 but then experimental progress moved on
37:25 exponentially now so when i started i
37:28 was looking at like data
37:29 a few hundred megabytes of size and
37:32 at the beginning of this talk i showed
37:35 you data that were
37:36 like a terabyte of size yeah and very
37:38 quickly i went into this problem here
37:41 that i wasn’t able to get any results at
37:44 all and
37:44 meaningful times so at some point i had
37:46 to rewrite all my code
37:48 and what then turn out to be a
37:51 reasonable choice for really large data
37:53 is this data table package it is our
37:56 version
37:57 and also the python version as you see
38:00 here
38:01 is not much slower than that than the r
38:04 version
38:05 this data table is written in c is very
38:07 fast
38:08 and uh once you if you
38:12 stay strict to it then you can expect a
38:15 performance that is not much worse
38:17 than actually c but with orders of
38:19 magnitude less
38:22 programming work so
38:25 today i’ll use this data table package
38:28 here
38:29 and i will also introduce some concepts
38:33 here
38:34 that are applicable to any other of
38:37 these methods or any other way
38:39 of dealing with data actually

slide 13

38:43 okay so let’s use this data table
38:46 package
38:47 now this is very fast and and works for
38:50 extremely large
38:52 uh data sets the downside is that this
38:54 syntax is not very friendly for
38:56 beginners
38:57 it’s a little bit cryptic it’s very
38:59 condensed and very compact
39:01 uh but it turns out that this at least
39:04 for me was the only
39:07 way that i could uh deal with data
39:10 like in an experimental field where the
39:13 sizes of data sets are exponentially
39:15 increasing
39:16 this was the only way that made me work
39:19 in an efficient way yeah so we start
39:22 this data table
39:23 we reload this package just by calling
39:26 library
39:26 data table no and then we if we have a
39:30 data frame we can just convert it to a
39:32 data table
39:33 by this command as data table that’s
39:36 very simple
39:37 nothing else is necessary

slide 14

39:41 so in this lecture uh i would like to
39:44 look at some real
39:45 data yeah so and this real data i could
39:48 now load like one terabyte of data of
39:51 experimental data
39:52 and uh but that would not be very
39:55 efficient actually for this lecture
39:56 because it would be very
39:59 slow and also very difficult
40:02 to follow so i’ll look here at a simple
40:05 data set
40:06 and we’ll do some operations on this
40:08 very simple data set and this is
40:10 actually data set that any of you can
40:11 download
40:12 and i’ll also upload or share a link
40:15 where you can download the code of this
40:18 lecture that i’m
40:19 using in this lecture and the data
40:22 itself
40:24 okay so this is the data set this is a
40:26 called
40:27 new york city flights and these are all
40:31 flights
40:32 that departed
40:35 or arrived in new york city airports
40:39 in 2013.
40:42 yeah so the this uh this one that this
40:45 data set consists of
40:46 several files of several tables
40:50 one is the actual information about
40:53 these flights here
40:55 yeah so we have for each slide we have
40:57 information about the year
40:59 that’s 2013. but we also have
41:02 information about the month
41:03 the day the hour we have an information
41:07 about the flight number and identifier
41:09 of the flight
41:10 we know we can uh know
41:14 where it came from this flight
41:17 now we have a tail number that is that
41:19 would that mean that we can identify
41:21 the airplane that was used the carrier
41:25 so is it united airlines american
41:27 airlines
41:28 lufthansa and then we have some
41:30 information about the delays
41:32 in these slides so delays in the
41:34 departure delays in the arrival
41:37 how long was the airplane in
41:40 air and many other bits of information
41:46 now this is the information for all
41:48 flights
41:49 now we want to interpret this
41:51 information that means that we need to
41:52 connect this information to other
41:55 uh other uh if you want to understand
41:59 for example where these
42:00 delays come from in this data set
42:04 we want to connect it to additional
42:05 information the one thing you might be
42:08 looking at
42:09 is the weather now if you ask for
42:12 whether why is a flight delayed we can
42:14 look at the weather
42:16 yeah and this weather is a different
42:19 source of data
42:20 and for this weather we can also get the
42:22 time
42:23 you know the month and day and hours we
42:26 can
42:27 look at the weather at a certain
42:29 airports
42:30 you know at certain locations we have
42:33 temperature humidity wind speed and
42:36 other information
42:37 about the weather at a certain location
42:39 at a certain time
42:42 now we have information about the planes
42:44 so we have this
42:45 tail number here so that it identifies
42:48 the planes
42:49 and then we can also download some
42:50 information about these airplanes
42:52 we can for example look
42:56 what manufacturer the airplane was made
42:59 we can look at the model we can look at
43:02 the
43:03 the year this airplane was built number
43:05 of engines the number of seats
43:08 the type of airplane and so on
43:11 we can get some information about the
43:13 airport
43:14 where we have uh airport information
43:17 an airport identify the name of the
43:20 airport
43:20 the longitude and latitude and altitude
43:23 of this
43:24 airport and so on and finally we can
43:27 also get some information about the
43:28 airline
43:29 now that corresponds to a certain flight

slide 15

43:35 yeah so and now one thing you need to
43:37 notice
43:38 is that these tables here they share
43:41 information
43:43 yeah for example if we want to learn
43:47 uh what the weather was for a given
43:49 flight
43:50 then we can look at the year and the
43:52 month of the day
43:54 at the origin airport of this flight
43:58 and look and compare that to the same
44:00 columns
44:01 in this other table here
44:04 on the left hand side that contains the
44:06 weather information
44:08 now if we want to learn about the
44:10 airplane that was used
44:13 we can take this column here this bit of
44:15 information the tail number
44:18 and compare that to
44:22 this the corresponding tail number in
44:25 this planes data set
44:26 and look what kind of manufacturer this
44:28 plane was uh made by
44:30 uh which year it was produced and so on
44:34 yeah so these bits of information are
44:36 connected
44:38 uh by different variables and we can use
44:41 these overlaps between these datasets to
44:43 bring them all together later
44:46 but first let me show you how to load
44:48 data and loading data is actually also
44:50 not a trivial task
44:52 now the loading data can take hours or
44:55 minutes
44:56 depending on what function you use for
44:58 that
44:59 and in the scope of r the fastest
45:02 functions
45:03 are the ones from the data table package
45:07 they’re called
45:07 f read now they’re an f right that
45:10 allows you to write data
45:12 and they’re basically parallelized
45:14 versions of reading
45:16 huge amounts of text data of ascii files
45:20 and it’s very simple you just use the
45:23 command
45:23 f3 it’s called fast read to load a
45:26 certain file here for example the text
45:29 file
45:30 and f read will do all of the rest
45:34 uh that you need to do that you need to
45:35 do it will identify the columns
45:37 will identify that the data types and so
45:39 on typically you don’t need to do
45:41 anything
45:43 for example we can read in this flights
45:46 data sets
45:47 here that contains information about
45:50 flights
45:51 and it’s actually the flights that
45:52 departed from new york city
45:54 and this is then how this data set looks
45:58 like
45:58 if we just print what is in once we load
46:01 these
46:02 flights we can we can print what is here
46:05 and see what’s in there
46:07 we have these different columns they are
46:08 like 2030 that’s the year
46:11 then we have different months we have
46:12 different days of the months
46:15 and we have departure times
46:19 and we have real departure times we have
46:22 scheduled departure times and we have
46:25 delays
46:27 now we have arrival times and so on we
46:30 have here carrier we have flight numbers
46:33 the airplane the tail number of the
46:34 airplane and so on
46:37 origin destination and so on yeah
46:41 and we have a timestamp here a time
46:43 signature
46:45 of when this flight actually took place
46:49 now so let’s go uh
46:53 to our studio and see how this looks
46:55 like
46:56 [Music]
46:58 um
47:01 here we go now this is our studio i hope
47:04 you can see our studio now it should be
47:06 still
47:07 sharing yeah and uh
47:11 so here the top rows so okay so this was
47:14 a little bit small right
47:16 um let me zoom in
47:25 okay so i hope this is better um
47:29 so here we’re loading
47:33 all of these files actually the same
47:35 thing that we did
47:38 [Music]
47:39 that i had on slides so for each of
47:41 these different bits of information
47:43 we now load these files into memory
47:47 now so i had already done that before so
47:49 here they are
47:50 in the workspace and we know we can
47:52 click on these files so we can
47:53 either now just type flights
47:57 yeah and get this huge table out
48:00 or you know so we see we have here
48:05 300 more than 300 000 flights that we
48:08 have information about
48:10 or i can just click on this file here
48:14 and get a little nice table view of
48:17 the data that we have now this is the
48:20 flight information
48:21 it’s also high dimensional not as
48:23 extremely high dimensional
48:25 as the example i gave you but it has
48:27 still enough
48:28 information that we cannot easily detect
48:31 patterns
48:32 in this data set yeah so this is
48:36 this is the data we have 300 000 flights
48:40 and now we can try to detect some
48:44 structures
48:45 in this data set
48:49 let’s get back to powerpoint
48:55 okay here we go now we’ve loaded the
48:58 data we’ve got all of these different
49:00 data tables now or data frames and
49:03 memory

slide 16

49:04 and now we’re trying to make sense we’re
49:06 trying to make sense out of these
49:08 statuses
49:09 the first thing that we always need to
49:12 do
49:13 is to bring data in a format
49:16 that we can deal with it that the data
49:19 and this format is typically called date
49:22 tidy data or a long data or long format
49:27 now that two simple rules that you can
49:31 use to decide whether
49:32 a data table or a data framework or
49:35 table is tidy
49:36 so one thing you need to make sure is
49:38 that every column
49:41 is a different observable for example
49:44 different types of measurements
49:47 every row are then different
49:50 observations of these measurements of
49:54 these observables or different or
49:56 variables
49:57 yeah once you stick to these rules your
50:00 data is tidy
50:02 and when your data is tidy so every
50:05 column
50:06 is a different observable and every row
50:09 is an observed observation
50:11 then we can use vectorized operations
50:15 in r or python to perform operations
50:19 on entire columns of these
50:22 data tables yeah and that
50:26 first is extremely fast you know because
50:28 it’s vector-wise we’ve informed these
50:30 operations
50:31 one column at a time and it drastically
50:34 reduces the complexity
50:36 of the code now the
50:39 the data that i showed you are one of
50:41 the standards
50:43 data sets and data signs and uh
50:46 that that you you look at for for
50:49 teaching
50:50 and uh these date this data is already
50:53 tidy
50:54 so let’s have a look now at these
50:56 flights
50:57 so every column is a different
51:00 observation this is a different
51:02 observable yeah for example a different
51:04 measurement
51:05 a different departure time a different
51:08 arrival time a different flight number
51:11 a different time points
51:15 and every row is a different observation
51:19 so a different flight
51:21 so this data table is already in a tidy
51:24 format
51:26 and it’s already in a format that we can
51:28 deal with so we have nothing to do here

slide 17

51:31 in a standard example of a non a messy
51:34 data set
51:36 that we also often find in physics but
51:38 also here in this case in
51:43 in in genomics is that we have
51:46 matrices that we share data as a matrix
51:50 for example in this matrix here that’s a
51:52 typical matrix here
51:54 so this is a measurement one of these
51:55 genomics measurements so here on each
51:58 row
51:59 is a different gene now so and the first
52:02 column here is the name of this gene
52:05 that’s a very cryptic name these genes
52:07 have cryptic names
52:10 each row is in gene and each column is a
52:13 different cell
52:16 yeah so this data set
52:19 is not tidy
52:23 because the same measurement so sorry
52:25 that these numbers
52:26 mean how many products of this given
52:30 gene
52:31 we have measured in a given cell
52:35 so these are the numbers it’s not
52:37 important to understand the biological
52:38 context
52:40 and but this data set is not tidy
52:43 because the same
52:45 observable namely the number of these
52:48 these gene products is repeated in
52:51 different columns
52:54 and you can easily see that by having
52:57 this format
52:59 we’re not very very flexible so if i now
53:01 for the same gene
53:03 and for the same cell make another
53:05 measurement
53:06 what do i do where do i put that in this
53:08 matrix so we have
53:10 then we have to have a separate matrix
53:12 and i have to live with separate
53:14 matrices
53:15 in parallel so this data set is not

slide 18

53:18 it’s not tidy and
53:22 i cannot perform easily parallelized
53:25 operations on this data table and now we
53:27 want to make this data table
53:29 tidy i know there are special functions
53:33 that allow us to do that and
53:37 these functions are
53:41 basically i just give you the names and
53:43 they have similar names basically in all
53:45 contexts to make such a data table tidy
53:48 like the one that i have here this
53:50 function is called
53:52 mels it’s called you also said you melt
53:54 a data table
53:56 and that brings us from this matrix
53:58 format
54:00 to a format where every
54:03 column represents an observation and
54:06 each row
54:08 represents i sorry each column
54:10 represents an observable
54:11 like a measurement and each row
54:14 corresponds
54:15 to a measurement or an observable yeah
54:18 so
54:18 the right would you see on the right
54:20 hand side that would be
54:22 the tidy version of the data that show i
54:25 showed you on the
54:26 on the last slide so the first column
54:28 would be the cell
54:30 the second is the gene the name of the
54:32 gene the second column would be the name
54:34 of the cell
54:35 and the third column would be the value
54:38 that i measure
54:39 for the g products now so that would be
54:42 the tidy format
54:43 and now i can have a long i have a long
54:46 vector here
54:47 of numbers in this
54:50 one column and i don’t have 200 columns
54:54 as in the previous slide
54:57 okay so how do we do that in our this on
55:00 data in this data table package the
55:02 command is called
55:03 melt we just give it the name of our
55:06 data table
55:08 and then we identify id variables
55:11 so these are variables uh
55:14 that need to remain so to say intact
55:19 uh once we once we reshape this matrix
55:22 now in our case this is the id column
55:26 yeah and the value name is
55:30 uh how we want to call
55:33 the values that are in these matrix
55:36 elements here now and we just call this
55:39 expression now that’s what we call
55:42 and the variable name is the cell that
55:44 we have here already
55:46 that tells us that this variable name is
55:48 distributed
55:49 now into this column now and then if you
55:52 look at these colors here
55:54 you can see how this command distributes
55:56 what is actually now distributed
55:58 into several columns
56:01 how this reshapes this data table into
56:04 something that has a longer format it’s
56:06 called actually also called long format
56:09 uh where we have uh the cell name
56:12 this column and then the corresponding
56:14 measurements in this column
56:17 for the products yeah you can look at
56:19 this i’ll upload the slides of course so
56:21 you can look at this
56:22 uh in detail afterwards
56:25 it’s a little bit abstract but this is
56:27 so say how we reshape
56:29 data tables to make them tidy

slide 19

56:33 yeah and we can also uh make them messy
56:37 again now so i’ll go quickly over that
56:39 because that’s not something we use
56:41 but sometimes you just want to have your
56:43 good old matrix back
56:44 because some functions some other
56:46 package wants to have a matrix
56:48 yeah and that’s called then this
56:50 function is called d
56:52 cast and that just reverses
56:55 uh that reverses the operation that i
56:58 told you on the last slide
56:59 where we give you where this function
57:02 takes the data table
57:04 uh the name of the data table variable
57:06 as the first
57:07 argument and then a formula of how the
57:11 rows and column columns
57:13 should be now separated in this new
57:17 matrix that we want to have now it’s not
57:20 so
57:20 important so i’ll not go into it it’s
57:22 not so we
57:24 rarely use that and
57:28 now we have some mess some tidy data
57:30 yeah so
57:31 we have some tidy data now every column
57:34 is an observation every row sorry i
57:38 always mess it up
57:39 every column is an observable or a
57:41 variable
57:42 every row is an observation

slide 20

57:45 once we’re in this format we can now
57:47 have a very simple syntax
57:50 now and this syntax actually captures
57:53 ideas
57:54 that uh we also have in other
57:57 maybe more popular languages like sql or
58:01 sql that’s that’s
58:04 more that captures operations on this
58:07 data table
58:08 that are quite generic so so this is the
58:11 general syntax that we use in this data
58:14 table package
58:15 so we have three now we take this data
58:18 table
58:18 d and in squared brackets
58:22 we have three arguments
58:25 yeah and the way you read these three
58:28 arguments
58:29 is that you take the data table d
58:32 and then you have something at the first
58:34 argument that operates
58:36 on the rows yeah you have here a
58:39 condition that
58:40 tells you which kind of rows you want to
58:42 take
58:44 the second one operates on the columns
58:47 yeah and for example you can in the
58:51 second
58:51 argument you can perform a calculation
58:55 on the columns and the third ones the
58:58 third argument
58:59 is a grouping variable now the third one
59:02 is it’s like it’s like a typical matrix
59:05 a statement y and j the third one is a
59:07 grouping variable
59:09 where we can group rows together by
59:11 certain conditions
59:12 and perform these calculations here
59:16 group by group now so the way to read
59:19 that is to take
59:20 the data table d subset the rows using i
59:24 that can be some expression some logical
59:27 statement for example
59:28 we calculate what is in j
59:32 and do this calculation group grouped
59:36 by whatever is in this third argument
59:39 and i’ll now show you step by step
59:41 how this looks in detail
59:44 and these uh arguments here these three
59:48 albums are
59:48 a very very general way this is also say
59:52 parametrize um operations on these
59:56 large tables now so with these just
59:58 three arguments we can do
60:00 a lot of stuff already and if you think
60:03 about it it’s actually
60:04 abstract but it’s very simple
60:09 so now first let’s have a look at the
60:11 very first argument

slide 21

60:15 yeah so let’s just say we have a table
60:18 and we just want to get
60:19 a subset of rows from the table
60:22 now for example we want to look at all
60:26 planes
60:26 that have four engines
60:30 now the way we do that is we just
60:33 write this logical statement here as the
60:36 first argument
60:38 so engines equals four
60:41 and if we do that then we get back a
60:43 data table that only has the planes
60:48 that have for four engines
60:51 in them that have four engines now we
60:53 filtered
60:54 the rows of these data tables for the
60:57 rows that have
60:58 the value of the engine column equals to
61:02 four
61:04 now we can also select a subset of
61:07 columns here on the right hand side
61:10 and this we do by this notation here
61:12 this dot is actually a shortcut for
61:15 for for and for a list yeah so
61:18 what we do is we give a list of the
61:21 names of
61:22 columns that we want to extract
61:26 as the second argument so for example we
61:30 take this planes data set
61:32 and we want to get the tail number and
61:34 the year
61:35 this airplane was built and then we get
61:38 a shorter
61:39 a smaller data table back that only has
61:42 these two columns
61:44 in them yeah so this is all something
61:46 that you could also do with any other
61:48 package or any other programming
61:50 language of course
61:54 now we want to get a little bit more
61:58 complicated

slide 22

61:59 now we want to do an operation on these
62:01 data tables on these
62:02 large data sets and these operations in
62:05 general have a quite general scheme
62:09 yeah so and this general scheme is
62:11 already
62:12 in this syntax notation that i showed
62:14 you a few slides ago with the i
62:17 and j and the group is already
62:21 so say in there so that the the steps
62:26 that we take
62:27 is called the group aggregate combine
62:29 scheme
62:30 so what we do the general operation on
62:33 such a data table looks like that we
62:36 first group our observations
62:40 by some meaningful way so for example we
62:43 want to group
62:44 all slides that happen in the same month
62:50 yeah and then we aggregate aggregates
62:52 means that we
62:53 put extract all of these groups all
62:56 flies from the same month
63:00 at a time and aggregates mean at the
63:04 third step that we
63:05 perform some kind of summary function
63:08 on these flights that departed in a
63:11 certain month for example we want to
63:13 then calculate
63:14 the average temperature or the average
63:16 delay
63:17 now this is the third thing here where
63:19 we aggregate
63:21 all of these different flies that belong
63:23 to the same group
63:25 and calculate a single result from that
63:30 and i’ll also give you now a specific
63:32 example of this

slide 23

63:34 so what we can do for example
63:37 is that we can group by carrier so we
63:40 want we might want to
63:42 ask are all carriers equally bad
63:46 or equally good or is there one carrier
63:49 that is worse than other carriers
63:51 now the way we do that is we
63:55 first group that’s the last column
63:59 here we group all slides
64:03 and we take the flights data table and
64:06 then we group the flights
64:08 all flights together that have the same
64:10 carrier
64:13 and for each carrier we then calculate
64:17 the average delay using the mean
64:20 function
64:21 now we take the mean of the departure
64:24 delay and this third argument
64:27 just tells us that we should remove an
64:30 ace so
64:30 and not the numbers for example we’re
64:32 not a proper measurement was taken where
64:34 this information is missing
64:36 now we just just ignore this yeah so for
64:39 every
64:40 carrier we take all flights
64:45 and calculate the average delay time
64:50 now so this is what we do here now all
64:52 carriers
64:54 that’s here the grouping is the third
64:55 variable and then we perform this
64:58 operation and the second variable
65:00 because it operates on the columns
65:02 and the operation we do is we calculate
65:04 the average
65:05 of the delay and we save it in a new
65:08 column
65:09 that’s called mean delay
65:13 now what we get from that is for every
65:15 carrier
65:18 we get one number which is this average
65:20 delay
65:23 now and you can see that basically that
65:26 confirms what you
65:27 expected already that united airlines is
65:30 not performing very well
65:34 now we can also have more complicated
65:36 procedures now we can for example
65:39 uh combine that with only
65:42 with a subsetting in the roads we don’t
65:44 want to take all flights
65:46 we always want to look we only want to
65:48 look at flights
65:49 in the evenings you know very late
65:51 flights
65:53 and then we can for example do the same
65:55 thing we uh
65:57 yeah and we can we can also have a more
66:00 fine grade
66:02 fine-grained or we can look at different
66:06 combinations of variables here for
66:08 example
66:09 in this case here we
66:12 take all flights that
66:15 depart after 8 pm
66:21 and the remaining flights we group by
66:24 month at the origin airport
66:29 and we again calculate the average
66:32 departure delay
66:35 and now for every combination of month
66:38 and airport
66:40 we get a value for the average delay
66:42 here at the bottom
66:44 now for example in january uh the
66:47 newark airport had an average delay
66:50 of 14 minutes
66:55 yeah and jfk was doing much better
66:58 yeah and so you can see that this also
67:02 kind of is something that that people
67:05 expect to see here actually
67:07 yeah um okay so this is how you how this
67:12 group aggregate combined paradigm
67:16 can be fit into very compact syntax
67:20 on basically arbitrary complex tables
67:25 now we can also so this was
67:29 now a summary statistics where we
67:31 performed where we
67:32 summarized several rows yeah all flights
67:36 that have the same carrier
67:37 or all flights in a different in a
67:39 certain month and a certain
67:41 airport we summarize and summarize
67:44 all of these slides into just one
67:47 quantity for example the average delay
67:51 so we perform the summary here what we

slide 24

67:54 can also do
67:56 is we can calculate we can add a new
67:59 column to our table
68:02 that has one new value for each
68:06 row that is in the original table so we
68:09 don’t perform
68:10 a summary it’s also sometimes called
68:12 windowing
68:13 for whatever reason and so we have
68:16 one new value for each row that we had
68:20 in our original data set for example
68:23 here the top row
68:24 if i don’t want to have the flights
68:27 uh if i wanted to have this the speed in
68:30 kilometers per hour
68:32 now i can for example then get uh
68:35 calculate the distance over the air time
68:39 in minutes
68:40 now that’s the speed and miles per
68:42 minute
68:44 and then i multiply by 60 and convert to
68:48 kilometers
68:49 and then i have the speed for every
68:51 flight i have the average speed in
68:53 kilometers
68:53 per hour i can also rescale for example
68:57 i can combine
68:59 now these simple computations here with
69:01 the grouping
69:02 for example i can rescale all distances
69:07 by the average distance
69:11 added of a certain carrier for example i
69:14 can ask
69:15 is this flight much further
69:18 or much shorter than a typical flight
69:21 conducted by the same carrier yeah so
69:25 then i can calculate a rescale
69:27 distance that is just the distance
69:30 divided by the average distance
69:33 of this carrier here and then again i
69:36 get a new
69:37 variable a new value for each row that i
69:41 had in my original table
69:44 and the way i do these operations is by
69:46 these symbols here these define
69:48 symbols column and equal signs
69:52 now the way this looks like is now that
69:54 we have fear
69:56 our original 300 000 rows
70:00 but for each row we have calculated now
70:03 a new
70:05 column a new value that we save in a new
70:07 column
70:08 speed uh kmh uh
70:11 that is the speed in kilometers
70:15 and we also have the the rescale
70:17 distance
70:19 now for example this flight here was a
70:21 little bit ten percent shorter
70:23 than the average flight of this carrier
70:27 now and uh this also is saved into a new
70:31 column
70:32 that we’ve given these names here
70:35 now this is something that is that is
70:38 also happening very fast
70:40 and memory efficient
70:45 so now comes now comes more complicated
70:49 things so this was in principle easy
70:52 step we add more columns
70:53 we perform a summary and this data table
70:58 package allows us to write a very simple
71:01 syntax
71:02 in order to do that now if you try to do
71:05 these groupings
71:06 in c also you write many many lines of
71:08 codes
71:09 to do that now

slide 25, 26, 27, 28, 29

71:12 the next step we want to make use that
71:16 these that these to get some more
71:19 understanding about these delays here
71:22 in new york city we want to make use of
71:24 the fact that these whether we have
71:26 different bits of information
71:28 and these different bits of information
71:31 share
71:33 columns or share information that allows
71:35 us to link them
71:38 yeah for example we can link the weather
71:42 to a certain flight by matching the year
71:45 the month the day and the hour
71:49 and the airport yeah
71:52 and with this we can connect the flight
71:56 to the weather that that occurred when
71:58 this flight
71:59 departed yeah we can also
72:05 connect the airport information for
72:07 example
72:08 so this faa is an identifier that has a
72:11 different name in this table
72:13 but it’s essentially the same as in the
72:15 origin airport
72:16 and we can link we have the information
72:18 to link these tables together
72:21 now the planes the the tail number is a
72:24 unique
72:24 identifier that allows us to look up the
72:27 plane
72:28 of our at our of our flight
72:31 in this data table this data set of all
72:34 plates
72:36 and also the same with an airline we can
72:38 look up the carrier
72:40 identifier in another data table and get
72:43 the name of the carrier
72:45 now so these these tables are
72:48 interlinked
72:49 and now we need efficient ways to merge
72:53 these these big different bits of
72:55 information together
72:58 yeah and this merging happens
73:03 in a way that is also often called joins
73:05 now this is a general principle that you
73:07 do basically
73:08 in any kind of of data handling uh
73:11 task independent of the
73:14 programming language that you’re using
73:17 and
73:18 the first join and then you can perform
73:21 these joints or these mergings of these
73:23 data tables
73:24 in different ways and these different
73:27 ways
73:28 differ in the way which information you
73:31 keep in case you only find it in one
73:34 table but not in the other table
73:37 now for example so let’s begin with one
73:41 kind of join
73:43 it’s called the left join
73:47 that’s called the left join and what you
73:49 do
73:50 is that you keep all rows
73:53 that are in the left data table now in
73:56 this a here
73:58 now we want to combine in this case the
73:59 information in a
74:01 and b yeah and then we can look up
74:06 okay so here we have a value x1
74:09 that corresponds to a
74:12 and we have a value of x one a and now
74:15 we can look up
74:16 what is the other columns x two and x
74:19 three
74:20 now so x two is one and x three is this
74:23 is true
74:25 yeah and then we can combine these two
74:27 columns in the right way
74:29 that they correspond to the value of x1
74:32 a same for b now we can look up b
74:36 in both columns so they don’t need to be
74:37 in the same position
74:39 in both tables we can look up the b and
74:42 get the values of the remaining columns
74:45 and sort them here to the end of this
74:47 table
74:48 and now we have the three the c
74:52 now c is in table a and the first one
74:55 on the left one but not in the right one
74:59 and a left join means that we take c
75:05 from the left one we don’t have
75:08 information
75:09 from the white one so x3 is empty
75:13 and we disregard all rows that are on
75:16 the right
75:17 table but not in the left table so
75:20 that’s a left
75:21 join and an r or in this data table
75:24 framework this is done by the function
75:27 merge that takes two tables
75:31 and the left join is then specified by
75:33 saying that
75:34 all x so all first column all entries in
75:37 the first
75:38 column should be true should be taken
75:42 there’s also a short notation for this
75:44 joining
75:45 and that’s just when you index one data
75:47 table
75:48 by another data table so it’s a very
75:50 compact syntax
75:51 for performing a very complex operation
75:56 and this is something these joints are
75:57 also something that are not trivial
75:59 computation think you’re doing that with
76:01 with something that has terabytes
76:05 in size yeah so in principle you have to
76:07 look up
76:08 all rows in both data table and find the
76:10 corresponding matches
76:12 now you need to find have uh very
76:15 efficient algorithms
76:16 to do that and depending on what you
76:19 choose here
76:20 which package you choose to perform this
76:22 algorithm you can wait for days
76:24 or you can wait for minutes now that
76:26 makes a huge difference
76:29 so if there’s a left join there’s also a
76:30 right join and right join just means
76:32 that we
76:33 keep all rows from the right table
76:38 but we disregard rows from the left
76:40 table if they’re not
76:42 in the right table that’s the right join
76:45 and
76:45 in all this just is that works by uh
76:48 also by the merge command
76:50 and we just give the argument that oh y
76:53 yeah that you know that we should all
76:55 keep all
76:57 uh columns in the second y data set
77:01 and there’s also then a short notation
77:03 for these right joints
77:07 there’s so called inner join that’s the
77:10 most restrictive joint
77:12 and this
77:16 inner join just keeps values
77:19 keeps rows where we have information
77:22 in both tables now if we don’t have
77:26 information in both tables
77:28 then this row will not end up in the
77:30 final
77:31 data set for example c only is in the
77:34 left
77:35 table d is only in the right table
77:40 yeah and then we uh
77:44 none of these ends up in the final
77:46 results
77:48 that’s the inner join yeah and uh
77:52 so again here there’s a merge command
77:54 for that we just say all
77:56 equals false and there’s always a
77:58 shorthand notation
78:00 with an additional argument for that
78:04 so in all of these commands here i
78:05 didn’t specify
78:07 the column that we should use for
78:09 joining i didn’t specify
78:11 that x1 is the column we should use for
78:14 comparing
78:15 now i didn’t specify for example if you
78:18 look here
78:20 which column we should actually use to
78:22 match slides to the
78:23 together and i didn’t do that
78:27 bef because this column has the same
78:30 name
78:32 in both data sets so x1 pops up
78:35 both in a and b and then r automatically
78:40 assigns these columns these columns that
78:42 have the same names
78:44 uh uses them to match these different
78:46 tables
78:48 so if there’s an inner join there’s also
78:50 an outer join
78:52 or a full join and this full join
78:55 retains
78:55 all values from all rows
78:59 so that it happens if we just now join
79:02 or merge these two tables
79:04 then we get our column x1 that is shared
79:07 between these tables
79:09 and our column x2 from this first table
79:13 where we don’t have a value for d
79:17 and the column x3 from the right table
79:21 where we don’t have a value for c
79:25 now so this retains the most information
79:32 okay so now we can do some

slide 30, 31

79:36 merging so now we can for example ask
79:39 are these delays affected by bad
79:43 weather now so the first thing we need
79:45 to do
79:46 is to merge the flights
79:49 and the weather data sets because they
79:52 share columns with the same names
79:56 for example time hour or so of
79:59 our
80:03 whereas the origin airport now since i
80:05 didn’t plot that here they share columns
80:07 with the same names that the ones that
80:08 are connected with these arrows
80:10 that works automatically so i merged
80:12 them
80:14 and now i have a new table that
80:19 contains information about the flight
80:22 the carrier
80:23 and also the delay but also at the same
80:27 time
80:27 the columns from the weather data set
80:30 that have information about
80:32 wind speed and precipitation and about
80:34 the temperature
80:37 now and now we can go on and ask for
80:40 example
80:42 let’s group this combined
80:45 table here let’s group that
80:49 on the one hand by the rows that have a
80:52 wind speed larger than 30
80:55 and by wind speeds smaller than 30.
81:00 yeah and then for each of these two
81:04 groups
81:05 calculate the average departure delay as
81:08 before
81:11 now so if the wind speed is smaller or
81:14 equals to 30
81:15 here then the departure delay was 12
81:19 minutes
81:21 if the wind speed was larger than 30
81:24 then the departure delay was 28 minutes
81:27 so much larger
81:29 and if i couldn’t evaluate
81:33 this condition here for example if i
81:35 don’t
81:36 have information about wind speed yeah
81:39 then i get
81:39 also a value for these nas for these
81:42 non-available data
81:44 so that means that that if it’s very
81:47 windy there’s a
81:48 higher chance or there’s a typical
81:50 higher average delay
81:52 in these flights in new york city
81:57 now we can also ask we can also merge
82:00 the flights table with the plain table
82:09 oh so is anybody stuck at the joint
82:11 slide
82:18 now so which what is the slide that
82:19 you’re seeing at the moment
82:24 are planes equally reliable
82:28 our place equal so that’s the right so
82:30 matthew matthew is having a problem
82:32 um but at least some of you are seeing
82:36 the
82:37 the right slide okay um
82:42 okay so we can merge the flight
82:44 information
82:46 with information about planes
82:50 and now i give this additional column
82:52 here this additional
82:53 argument i tell the merged command to
82:56 use this column tail number
82:59 to to match the information to match the
83:01 rows
83:03 now if i do that then i get also my
83:05 flight information about the delay
83:08 about the carrier and so on but i know
83:11 also get information about the
83:12 manufacturer
83:14 about the model number and about
83:18 the year this airplane was created
83:21 and because we had already a year column
83:24 namely the year of the flight
83:26 we now have two different columns your x
83:29 and your y
83:30 your x is the column of the flight when
83:32 did the flight take place
83:34 and your why is the year when
83:37 the airplane was built
83:41 okay and now we can just
83:44 do our calculation now we can ask are
83:48 certain manufacturers uh more
83:51 or less reliable are the planes by
83:53 certain manufacturers more or less
83:55 reliable
83:56 yeah so that was we do the same thing
83:59 yeah so we group by manufacturer
84:04 now by this column here we group now the
84:07 rows
84:08 and then for each group we calculate
84:11 the average departure delay
84:16 here and save it into a new column mean
84:18 delay
84:20 and in this case we also calculate the
84:21 standard error now which is
84:23 just the standard deviation divided by
84:25 the square root
84:26 of the sample size now we can have two
84:29 computations or as many computations as
84:31 we want
84:32 in the same row yeah and here we see
84:36 that okay so we have here airboroughs
84:38 industries i don’t know what
84:39 i forgot what is actually the uh the
84:42 outcome of this
84:43 boeing is a little bit later than airbus
84:47 and then we have here two two times uh
84:49 airbus
84:51 and uh the problem now here is that
84:54 airbus apparently changed its name and
84:56 one of them is older than the others and
84:58 that’s why they have different delays
85:01 now these are smaller airplanes here
85:03 like
85:04 at this m and blair and bombardier these
85:07 are smaller planes
85:08 and these smaller planes have higher
85:12 delays now i can ideas like
85:15 uh i’m not kind of there there’s also a
85:18 smaller
85:19 yeah and some of these smaller planes
85:21 seem to have
85:22 higher delays

slide 32

85:25 yeah and then we can also do a trick now
85:28 so
85:29 typically you do many many operations
85:32 and such a processing of such data says
85:34 you do many many operations
85:37 after each other yeah and
85:40 what we typically do in programming we
85:43 do one operation
85:44 and save the result in a new variable
85:47 now then we do the next operation and
85:49 save the result in a new variable now we
85:51 do another operation and save the result
85:53 in a new variable
85:55 now this is very inefficient if you have
85:57 long pipelines
86:00 of maybe hundreds of different steps in
86:02 your analysis
86:03 and how you transform the data
86:06 yeah and in our
86:10 uh and also in other languages there are
86:13 ways
86:14 to chain operations after each other
86:18 now to make these pipelines in a very
86:21 efficient
86:22 and in condensed
86:25 condensed way in terms of syntax now so
86:28 there’s a
86:29 there’s a tailway that is implemented in
86:31 this data table package
86:33 now there you just change they just put
86:36 the squared brackets directly off with
86:39 each other
86:40 for example here we create a new column
86:45 wind speed and kilometers per hour and
86:48 here we just
86:49 multiply the wind speed in miles per
86:51 hour by
86:52 1.61 to get the kilometers per hour
86:57 and then directly in the next step we
87:00 group by month
87:01 and calculate the average wind speed in
87:04 kilometers per hour so these are two
87:07 steps and we don’t have to save the
87:09 results and intermediate variables
87:11 we just put them down we chain them
87:13 directly after each other
87:16 so there’s a more common way to
87:20 chain or to make these pipelines and
87:22 this is
87:23 in another package that’s called
87:24 magritter
87:26 now we load this package just at the
87:29 beginning of our code
87:31 and this package does one thing it
87:33 provides us with an
87:34 operator this one here
87:38 and everything this operator does is
87:41 it takes whatever is on the left of it
87:47 and pass it is to the first as the first
87:50 argument
87:51 to the function that is on the right of
87:52 this operator
87:55 now so this is very very simple we take
87:58 what is on the left
88:00 and pass it to the function of the on
88:02 the right as the first argument
88:05 now and with this we can then have
88:08 longer chains if we want
88:10 now for example we take weather
88:12 calculate the wind speed
88:14 in kilometers per hour and then
88:18 use this chaining upper this this pipe
88:21 operator here
88:22 from this package and pause it
88:25 to the next argument here we
88:29 have here an updated data table with
88:32 this new
88:33 package and that that’s what we take
88:36 and we pass it to the next step of our
88:39 analysis
88:40 where we calculate the average wind
88:42 speed
88:44 per month and we take the result again
88:48 and passes to basically any function
88:51 that we want and here we pass it to for
88:53 example the the head
88:55 function and this head function just
88:57 gives us the first five rows or so
88:59 of our data table now so with these
89:02 operators here you can make very very
89:04 long pipelines
89:05 and you still have your codes uh in a
89:08 very
89:10 basic you have a list then in your code
89:11 you have a list of steps
89:13 that you perform one after each other
89:16 and so your code is still
89:17 in a very convenient way that you can
89:20 still

slide 33

89:22 understand yeah so this is a more
89:25 uh more complex example of such a
89:28 pipeline here
89:29 that we take the flights data sets
89:34 now merge it with the weather data set
89:39 now merge it with a plane state merge
89:41 the result
89:42 now with the planes data set
89:45 merge the results with the air ports
89:48 data set
89:49 and here i have to specify the columns
89:52 yeah that because they have different
89:53 names in both data sets
89:56 and we merge it with the airlines data
89:58 sets at the very end
90:00 yeah and now we have a data set that has
90:02 all the different information
90:04 all the different source information the
90:06 same table
90:08 yeah and now we can for example remove
90:11 flights that don’t have information
90:13 about the departure delay
90:15 and we can do the steps that we did
90:18 previously for example we can get the
90:20 speed of the plane the average speed
90:23 in kilometers per hour we get the
90:25 rescale speed so every faster or slower
90:30 than what is typical for a certain
90:32 airplane model
90:34 and a certain carrier now and we can
90:38 also calculate things like correlations
90:41 now so for example here
90:43 we calculate the correlation between the
90:46 temperature
90:49 and the rescape speed of the airplane
90:54 now and here what do we do here
90:59 ah so what do we do here so we calculate
91:03 the difference in speed of the airplane
91:08 for flights that have a delay larger
91:12 than 20 minutes
91:14 and where the delay is zero
91:18 or smaller than zero yeah so here’s the
91:21 difference in the speeds that have a
91:23 large delay
91:24 and that have a negative delay or no
91:26 delay
91:28 i recalculate that by carrier that’s
91:30 just an example here
91:32 and uh what we see here is that
91:35 uh now so here this this correlation
91:38 here
91:39 is typically positive now so this
91:42 correlation is positive
91:44 that means for some reason the warmer
91:46 the temperature
91:48 the higher the speed of the airplane
91:51 now there’s a positive correlation yeah
91:56 now so whatever this means yeah and
91:59 also what we see here is that this
92:02 difference in speed
92:04 is very often positive now american
92:08 airlines are so it’s positive
92:09 that means that these airplanes fly
92:11 faster
92:13 when they have a delay compared to when
92:16 they don’t have a delay
92:18 and you can also compare the the
92:20 carriers
92:22 united airlines has a little bit more
92:25 speed
92:26 you know than than american airlines
92:29 and you can get all kinds of insights
92:31 here from that now that’s of course not
92:33 extremely insightful yeah but it
92:35 illustrates
92:36 how you can do simple operations and
92:38 complex data sets
92:40 in just a few lines of codes and i don’t
92:44 have to tell you that in c
92:45 or matlab you would be spend a lot of
92:47 time and a lot of lines of codes to
92:49 implement that

slide 34

92:52 yeah just to summary this part and uh
92:55 if we have time i’ll show you a little
92:56 bit about plotting no we don’t have time
92:59 um so i showed you that these data
93:03 frames are efficient ways of storing
93:06 high dimensional data of different kinds
93:10 and
93:13 depending on
93:16 where your data is stored and how large
93:19 is it
93:20 you can choose different tools yeah if
93:23 your data is stored locally
93:25 and it’s very large and you rely on
93:27 speed then data table
93:29 is a good way to go if your data is
93:34 is for example on a server on the
93:36 internet
93:37 yeah and it’s not that large then you
93:40 would use
93:40 other codes that are better at
93:42 interacting with net network uh
93:44 resources
93:45 yeah for example like this diplo package
93:48 yeah and the key step is actually always
93:52 to bring it to clean up the data to
93:53 bring it to this tidy format
93:56 and once you’ve done that we have this
93:59 split
94:00 aggregate combine paradigm where you
94:03 group
94:04 your data by different conditions
94:07 by basically any condition you want and
94:10 then for each group
94:12 you extract the corresponding rows of
94:14 the data table
94:16 and perform operations on this for
94:18 example to perform summaries
94:20 on this data on this on these rows
94:23 and the package that i showed you here
94:25 allows you to do all of these steps
94:27 in a single very short line of code
94:31 yeah and finally i showed you how uh
94:34 pipelines actually help you to create
94:38 to structure your code uh if it gets
94:41 very complex which is usually usually
94:43 the case
94:44 yeah and uh what i will do is i will
94:47 upload some slides
94:48 that shows show you how to visualize
94:51 data
94:51 now that’s just i mean you all have your
94:54 favorite topic plotting packages
94:56 i’ll upload to some slides that shows
94:58 you how to basically
95:00 uh how people developed a grammar for
95:03 graphics that allows you basically to
95:05 make an infinitely complex
95:09 clause or data visualization using a
95:12 very simple
95:13 grammatical construct so i’ll upload
95:16 this on the website
95:17 and then let me know if you have any
95:20 questions
95:21 and otherwise see you all next week
95:25 so next week i should say so the next
95:26 week we go into a machine learning and
95:28 we’ll have a guest
95:29 speaker which is our local which is
95:33 basically that the chief data scientist
95:37 from on one of the json institutes here
95:40 from the
95:41 center for regenerative therapies
95:44 and he’ll share fabian us and he’ll
95:46 share some of his insights
95:48 into how to use machine learning to
95:51 detect that
95:51 actually low dimensional structures
95:56 and high dimensional data sets now as i
95:59 see as you probably realize now we’re
96:01 able to
96:02 do computations fast computations on
96:06 on huge data sets but what i showed you
96:10 was a little bit tedious you know
96:12 this was i actually didn’t show you
96:13 actually how to come up
96:15 with uh low dimensional structures or
96:17 with order and data
96:19 yeah so and this is something we’ll deal
96:21 with uh next week
96:22 and we’ll have a guest lecturer then
96:25 which is uh which is fabian here from
96:27 from dresden
96:28 okay see you all next week bye
96:32 i’ll stay online for a while to case of
96:34 any questions
96:43 uh i was wondering if um
96:47 if you dealt with like temporal data
96:50 or are most of the data that’s coming
96:53 out of these experiments just kind of
96:55 you know there there’s some static
96:58 information about the
96:59 genomics and proteomics and all that yes
97:02 so that’s typically static data
97:04 but this although it’s static it
97:06 contains dynamic information
97:09 yeah um indirectly so you have
97:12 measurements so of course like like very
97:15 often biology you have to kill the cells
97:17 that you actually want to
97:19 want to measure or you have to kill the
97:21 animals that you actually want to
97:23 look into and but you have some
97:26 indirect dynamical information i’ll
97:29 actually
97:30 share you send me an email about a paper
97:33 we just
97:34 uploaded something
97:39 that actually answers the question about
97:41 how this field theory
97:43 and data science approaches work
97:46 together and this is actually an example
97:48 from genomics
97:49 i share it with you in the chat um
97:58 just see where this is here we go
98:09 okay so i’ll share it with you in the
98:11 [Music]
98:14 chat all right thank you
98:16 there you go and that that that that
98:18 actually answers the question in your
98:20 email
98:20 and just took and you needed the needed
98:22 the christmas holidays
98:24 some time to to finish and upload it
98:27 and um this is actually an example where
98:30 you have
98:31 static data in genomics you can only
98:34 have static data
98:36 but indirectly you have dynamic
98:39 information
98:40 so actually actually actually although
98:42 the map in this
98:43 this specific example here we have
98:46 static data because of course you have
98:48 to kill the cells you have to kill the
98:49 embryo
98:50 but you can still have time causes
98:54 with different embryos or different
98:56 cells now so that you
98:59 kill one embryo at one stage that’s of
99:02 course mouse non-human
99:03 yeah so you kill one embryo at one stage
99:06 and then next embryo one day later
99:09 yeah and the next embryo one day later
99:11 so then the day you get some implicit
99:13 temporal information and actually in
99:16 this
99:16 same paper we also conducted a time
99:19 course
99:20 with a very high temporary resolution
99:23 but
99:24 you always have at each time when you
99:26 kill you look at different cells and
99:28 that’s that’s always the uh so
99:31 that’s the best you can hope for is
99:33 something semi-static
99:35 yeah but but nevertheless yes so we
99:37 still have dynamic information
99:40 indirectly via the measurements that we
99:43 can make for example
99:44 along the dna yeah and
99:48 so that’s although we cannot make direct
99:51 dynamic information we can have
99:54 indirectly generate hypothesis
99:57 that we then can indirectly basically 1
00:00 test using state 1
00:01 static data if you have a look at this 1
00:04 at this manuscript and you’ll see how it 1
00:06 works and let me know if you have any 1
00:07 questions 1
00:08 it’s very a little bit more logical and 1
00:10 this method 1
00:11 also has a supplement with all the field 1
00:13 theory in it 1
00:15 okay thank you 1
00:18 okay great perfect 1
00:29 okay so if there are no more questions 1
00:31 then see you all next week 1
00:34 bye