IU X Informatics Unit 4 Lesson 1 Data Deluge Science & Research I


Now we come to a new
set of examples. That was the end of
the business set. This is set of big data and
science or more generally in
researcher applications. Here is a slide from the economist
which had a special issue in 2010, already out of date. And it first starts with
the Sloan Digital Sky Survey. And it points out that it only
got 140 terabytes of information. And the successor of
the Sloan Digital Sky Survey, the Large Synoptic Survey Telescope,
which will start in 2016, will get 140 terabytes
every five days. So that is
[COUGH] 28 terabytes a day. So that’s obviously a big increase. And it also here is giving
things we’ve already discussed, the Walmart
[COUGH] handles 1,000,000 transactions every hour. And its database is,
which is actually smaller than the example of eBay, but
you must remember this is 2010. I’m sure Walmart’s databases
are much larger now. It says that 2.5 petabytes but, and then it points out,
this is very much larger than the amount of information
in the Library of Congress. And
[COUGH] it points out that even in 2010 that this is surely much larger now,
Facebook had 40 billion photos. And that’s another example
of the human genome that took ten years to
analyze the genome. The first time it was done in 2003. But now it can be
achieved in one week, and soon it will be achievable
in hours or days. So this is just pointing out
that the devices and processes that produce data are just getting
far more effective everyday. Here is an example
related to the genome. And it plots actually the cost for
genome. And in 2012, the cost for
genome is around $10,000. And it started off, of course, a
long time ago, in has $100 million. Well, not so long ago, it was 2001. Another important feature described
was it compares the cost of getting a genome with the performance of
computers, which is Moore’s law. And you can see that
starting in 2008, there was a dramatic decrease
in the cost of genome sequencing compared to the reduction
in cost of computing. And now that’s over
effective 100 difference, which points out a well
known remark that. If you start asking how much
computing it takes to process these genomes, the computing cost is
getting more and more significant. Because the cost of getting
the data is going down so much faster than
the cost of computing. Here’s a sort of number
which you can estimate. If you had a world where everybody
had their genome sequenced, say, every two years, which some people
might think is a possible world of the future, where genes are used,
the individual genomes are used for personalized medicine
in the new approaches to health, that will get you
3 petabytes of data per day. Here’s another slide about
radiology or medical imagery. It says that you have almost 70 petabytes of data per year
coming from medical imagery. So that’s a very big number. And then it actually divides it
between the various sources. And actually it points out
that if you have a cardiology, that will take the hours
to over 1 exabyte. That’s 1,000 petabytes. There is some paper that actually
the amount of data processed, produced every year is actually
measured not in exabyte but in zettabytes, which is 1
zettabyte is 1 million petabytes. Here’s an example which we will do
in more detail later on as our first use case. It is the large Large Hydron
Collider which has become important and newsworthy item. Because in July, they announced
the probable discovery of the Higgs. In fact, most people think they
have discovered the Higgs boson. When we say probable for the scientific discovery
to become accepted, it requires it be checked in
various ways and things like that. Because it is possible to
either to get false data or misinterpret the data and
things like that. This particular discovery
is pretty solid. Because, essentially, it consists of
results that come from two totally different experiments,
running at the same accelerator, but different experiments. They’re called the CMS experiment,
which, if you look at the ring at the top, the
ring on the top left, that’s the. The actual LHC accelerator. That LHC accelerator is a tunnel
which is 27 kilometers in length. And it’s actually, I think,
300 feet below the ground. And you will see it
marks on that circle, the various experiments ALICE,
ATLAS, CMS, LHCb. And the ones relevent for the Higgs
are CMS and ATLAS, which are, from this sort of crude level,
relatively similar apparati. And they both saw the Higgs, so. [COUGH] It’s worth noting at the top
right, that’s an actual picture of the ATLAS instrument with
some actual people there. Note that the instrument is giant,
which is sort of interesting because the particles it’s
studying are very small. Protons have a size
of around one fermi, which is 10 to the minus
13 centimeters. As well as this apparatus,
the actual tunnel consists of magnets which bend the protons
to keep them in shape. So that they actually
travel in the circle and properly accelerate to get to
the energies when they can collide. So this is probably currently
the largest science experiment measured in terms of data. It has 15 petabytes of
data from various types. And this data involves
a lot of computing. It’s suggested there
are around 200,000 cores. Those cores are scattered
over the world. There are some at CERN. That’s the so-called
Tier 0 facility. Then there are seven CMS,
seven Tier 1 facilities. Those are essentially
situated in countries. There is one in the USA. And then there are 50
Tier 2 facilities. ATLAS has a similar arrangement. And Indiana University has
an ATLAS Tier 2 facility. So this is big science, and
with a single instrument has a few experiments. Each experiment, CMS and ATLAS
have around 3,000 people on them. And so you have 3,000 people devoted
to these single experiments. And they publish papers as a group
and those papers have been submitted announcing
the discovery of the Higgs. So just to show you what
a Higgs looks like. Here is some actual data from ATLAS. The fundamental technology
here is actually simple. It is quite unsophisticated
compared to maybe some of the ways that data is analyzed, actually eBay
and Google, and things like that. What you do is you
take the raw data. The raw data is passed through
extremely sophisticated analysis, converted it into information,
cleaned up data. And that information is, again, we have this gray area as to what’s
information and what’s knowledge. But we can say that the information
is the set of particle. We take the raw data,
which is energy deposited and charged particle detections,
and things like that. Or, a so-called, shrink of light,
detecting the type of a particle. And then we convert that
into a whole set of particle momenta and
possibly particle types. Actually I should go back
to the previous slide. If we look at the middle of the. Middle of that page on the right, there is an actual
picture of an event. The two protons emerge
from the bottom corner. On the top corner they
collide in the middle. And then these lines coming out and
the blue energy deposits, they represent
what’s produced in this event. And the types of events you want to
look at, the Higgs particle events, they correspond to high disruptive
events where the protons are torn apart. And those deposits
a lot of energy and a lot of particles,
traveling perpendicular, transversely to those
initial protons. Whereas a typical collision, the protons actually just
graze off each other. And you just get
a spray of particles in the directions of
the original protons. But this is an extreme event,
a candidate for a Higgs and we will come back
to looking at that later on.

Leave a Reply

Your email address will not be published. Required fields are marked *