3-1-1-2-data-deluge.mp4


All right, fellows,
now wecome to the data deluge. This is just the enormous amount of
data that’s going on all around us. And this is the second
lesson on this motivation of this big data analytics course. We will see actually that these
sites tend to be on the outside because actually people no
longer focus on the data deluge, they focus on what you
get from the data. So no longer are these joyous
descriptions of how many petabytes, exabytes or
zettabytes of data you’re getting. They’re telling you
all about how many decisions they’ve
made using big data. But still, it’s very amazing and important to document
the amount of data there is. So let’s get going,
thank you very much. So this is the KPCB chart, they update this every year. They don’t always put
the same slides, but they have this type
of slide every year. And then 2014 end up,
which is around here, we’re up to nearly 9 zettabytes. Beginning 2016, 13
[COUGH] around 16 zettabytes towards the end of the year. Alright notice that the largest
science is around 100 petabytes which i an amazingly small fraction,
.000025 of the total. If we want to understand what
a zetabyte is, what if we have 100 gigabytes on your laptop, typical
amount today, maybe of that order. That is 10 to the 10
laptops is a zettabyte. Remember, we have nine of them. An exabyte is a pretty big number, that is one-thousandth
of a zetabyte. A petabyte is also
a pretty big number, and that’s one-thousandth of an exabyte. And finally, we get down
to terabytes, and remember, that’s the size of disks,
and you can buy this 2 terabyte disk for
about $90 today. And terabytes is 1000 gigabytes,
remember, 100 gigabytes is a typical
piece of laptop storage. Note that an interesting number
about, an interesting feature of 7 zettabytes, it’s about
a terabyte per person. Pretty interesting, there’s a terabyte per person
on the internet today. So this explosion,
huge numbers is what’s driving the data deluge,
data science, the fourth paradigm,
Cloud computing and so on. Okay, here is 2016’s chart
that updates the previous one. I actually think the data is wrong. It measures billions
of petabytes of data. I think it is millions
of petabytes of data. Or otherwise you won’t agree with
the previous slide where the total is measured in zettabytes. Zettabytes is a million petabytes. Anyway here is the amount which is
getting up to 10 zetabytes here in 2015, but this slide also
has another thing going on. It’s showing the exactly opposite
curve which is the cost of storage, which is going down from $0.18
here for a gigabyte of storage, back down here, to $0.06 in 2015. Actually the cost of storage
seems to be decreasing more than the amount of data, saying that
we can actually store a larger fraction of the available data
today than we could in the past. Anyway, it’s quite striking
how technology and storage technology is advancing so
fast. Cuz again, this is the type of thing in the
past people said would not happen. We couldn’t do this type of thing. Then when we couldn’t,
techology would slow, Moore’s law would stop and then we
couldn’t store as much data and we would have a challenge as
the amount of data increases and our ability to store it decreased. Here’s just a minor detail from a
nice service, Business Intelligence. This has the total cloud
traffic in zettabytes per year. It’s sort of interesting you know,
this is the macro unit. And we have mobile, Cloud. This is, of course, pointing
out the mobile Cloud traffic is increasing more
rapidly than the total. It’s still a small part. Many people think that not so
far in the future mobile will be the dominant use of all personal
traffic as this thing I pointed out. But Multicore is making
clients smaller, and therefore only things like my poor
old eyes are getting old, and they don’t like such
terribly small screens, that’s all that’s keeping
these personal devices small. Meanwhile, they’re going
to have smart watches, and Google glasses, and heart sensors,
and feet sensors or what have you. All helping me in
providing the wearable me. So, here we are. Cloud traffic increasing
significantly every year. This is standard Moore’s
law exponential. Here is this 1.8 billion
photos per day uploaded and the remarkable thing is you
can’t see poor old Flickr, which is probably the best known
of these various services. There’s a tiny yellow
sliver here you can see. Who is dominant? Facebook, and then suddenly in 2013, up comes WhatsApp, and
why here comes Snapchat. With Instagram, which is
part of Facebook, I believe, hanging in there. So amazing number of photos
uploaded every year. And as I mentioned, Kodak once was the dominant film supplier, and
people took photos lovingly, and they filled
shoeboxes full of photos. And that was the unit of number
of photos owned by a person. But now, with the Cloud as is, you
can fill many digital shoeboxes or virtual shoeboxes, or
electronic shoeboxes in the sky. This slide here updates
the previous one with another year. Well, it actually has 2 solid years,
cuz this comes from 2016, 2 complete years. Going up to sort of, I don’t know,
3.3 million photos per day, quite impressive number. Showing rapid rise in WhatsApp,
the light blue. Snapchat is possibly
slowing a little. Messenger is bursting
onto the scene. And Instagram is not doing so well. And Facebook is plugging away,
the core Facebook. And cuz it’s really quite
striking to see how the world of photography has changed due to
the incredibly impressive Cameras they now put on smartphones and the
way anybody uses their smartphones. And so, previously,
you needed your single lens reflex or what have you, camera. It was a large chunky thing. Now you can get comparable quality
with as many pixels on a large thing sorta this size, as you could on a
[INAUDIBLE] smartphone. Implicit in this image growth
is incredible technology growth. The IoT for this being
the cameras on smartphones. And as a companion to
the previous slide on photos from the social media site, here we
have the actual messages themselves. I recently got a WeChat log in
because I was told everybody in China uses WeChat. And so, as I was in China
they wanted to talk to me. I had to install WeChat, which actually works perfectly
smoothly, a fine system. But if we look here you can
see WeChat is somewhat below, Facebook Messenger has
a more rapid increase, and WhatsApp, which is
the largest of these systems. It took an old fashioned
person like me, the interest in messaging as opposed
to email is just sort of curious, because email is not that
different than messaging. Yet, it’s an email, I’ve actually done up a graph in the
growth of email, but I do not think it’s growing significantly as
the messaging systems are. I’m not quite certain why you
couldn’t have an e-mail system that was as effective as these
messaging systems are messaging. So anyway, that’s a service
[INAUDIBLE] anyway. This is just an example of
the digital technology and the digital transformation changing
the way people live their lives. Here was a corresponding
YouTube figure. We have 20 hours, 6 years ago, that’s around here, 2009. We have 100 hours around in 13,
actually 14 seems to be similar. I don’t quite understand that
whether it’s leveled off or this prediction here was
a little inaccurate. But if you go to this YouTube
press release site it still says 100 hours of video
uploaded per minute. It would be actually
interesting if it leveled off. Because YouTube, although it
has many competitors, is still a pretty important, dominant
service with many people using, YouTube including this course, using
YouTube to deliver its message. Here’s an interesting thing
from KPCB about music. So here we have streaming Remember
this course is streaming. We don’t have music,
we have my harsh coughing voice. And here we have digital tracks
presumably sold online, but digital. And here we have
physical music sales. So everything is declining,
except streaming music. People don’t even want to
store their music locally, on their iPod or
whatever they used to store it on. They just want to have their
smartphone, or phablet, or tablet, or maybe a window on
their PC and just look at them, listen to the music, or
look at the digital TV and web TV and that’s it. That’s presumably the future and this is all part of
digital disruption. The way of doing
business is changing. We know that that was a huge fight
in the music business about this. Probably both sides won, in the sense that the digital
version actually dominated. And effectively one
[INAUDIBLE] is actually going down, but replaced by this. But, in the end,
artists did get paid. Not everything is free, you have
to go to iTunes or Google Play and buy a lot of the things you want,
but just some things are for free. This course is offered for free,
but I’m not a rock music star. Here’s our first of some
slides about sizes. So in every 60 seconds, this is
actually still a little old, but in every 60 seconds you have
whatever it is, over 300 new Twitter accounts, 100,000 tweets. Tumblr, Tumblr is actually
increasing a lot recently so that’s probably wrong. iPhone applications. Flickr which we know is measurable. [COUGH]
1,700 Firefox is 700,000 searches. 168 million emails
are sent every minute. And 60 blogs, and 70 new
domains registered, and so on. These are just
fascinating numbers here. The trouble is actually to keep
up to date, to keep this slide, which is beautiful. I don’t know any place that
keeps these slides up to date. They’re produced in
some heroic effort and then they’re not kept up to date. So, that’s life. Still very impressive. Here’s anther version focused on,
actually U.S. Government data. Here’s U.S. Geological Survey. Here’s the global IP
traffic from Cisco. And, it’s sorta interesting, it’s
measured in 225 xobytes per month. If that’s per year,
you have zetabytes of IP traffic. And, that goes nicely with
the zetabytes of data stored. Here’s the Department of Energy. And here, remember we have 15
petabytes a year is some measure of the amount of data from the large
[INAUDIBLE] [COUGH]. Here is the global variables. Remember this is meant to
25 to 50 billion by 2020, and according to this at the moment
[INAUDIBLE]. See these numbers
are not consistent. Cisco has a larger number,
but anyway, it’s 2020, they make it as 20 billion. So what do we have here? EOSDIS, that’s the classic Earth
science storage from satellites. Here we have the NIH explosion
coming from gene sequencing. We have, that’s here and here,
this is NIH Cancer Institute. Here we have Internet of Things. Here’s the Internet of Things,
so this is wearables, which is subset of
the Internet of Things. Obviously Internet of
Things lives in cars, lives in street monitors and
so on, lives in homes. Here we have a NOAA,
the climate or weather people. And here’s that data increasing. And here we have DoD with
drones pouring out data. What’s it say,
43 terabytes of data per drone. That’s because cameras are digital,
they take their data and they get more slow since there is
little CCDs get smaller every. Every year and you get more pixels, and therefore more
data from your drone. So this is a pretty interesting
INAUDIBLE] from scanity.com. [MUSIC] So now we come to an analysis
of a book called, Taming the Big Data Tidal Wave,
a variant of the data deluge. A tidal wave of data,
and this is a 2012 book, already old, sort of, from a chief
analytics officer from Teradata, which is a major database company. And he covers the following
industry, business oriented applications, web data, which
he calls the original big data. Where we know that data from
how people browse websites is actually gathered and
used in all sorts of ways. To motivate minglings,
to redesign those sites, to do recommender engines,
to suggest, from your browsing, what else you might want to look at,
and so on. Auto insurance was discussed here, we can have sensors to
monitor the driver, or actually less obtrusively, the cars. And the information about what
the cars are suffering from the bad driving can be automatically
recorded and analyzed, maybe to set insurance rates, and maybe also
to help drivers drive better. And also could mend cars
by identifying errors before they actually happen. There’s a lot of forms of text data,
of which email is the most simplest example, sentiment
analysis of tweets is text data. Here we have Natural Language
processing, and also there’s an example which we
do in the introduction from eBay. Where they analyzed how
people purchased lamps. And were able to
improve the interface, from just looking at what people
did and what they ended up doing. So they can more quickly
point you at what you wanted. One of the biggest, and actually
this is one of the 51 use cases, [COUGH] big data applications is logistics, or analyzing truck fleets,
or delivery fleets. So military and commercial and
consumer applications, where GPS data can
track all these things, vehicles, trucks, people, and so on. That’s GPS, we have RFID, which is
sort of a tag which allows you to do better manufacturing logistics,
cuz you know where everything is. You have the smart grid, which is
the electrical utility industry. And the smart grid can analyze
local usage in a building, and tell that building how
better to do what it’s doing. And it can also monitor
the actual power as it transfers through the network,
maybe make the network better used. There’s the gaming industry, where
you can look at RFID tracking chips, and that can help
you identify fraud. We have, just after this, analysis
from GE of industrial engines, which is sort of
the industrial Internet. And the sensor data and how that’s
an enormous amount of data, can allow you a far more reliable
monitoring of large equipment. Video games is a little like
the web data example, and even the auto insurance example. You have telemetry, which is monitoring what’s
going on in the case of users. And this is monitoring
how people play games, to actually design better games,
to see what people like. And there’s a whole broad
social media example, which is the telecommunication
industry and Facebook and so on. And this linkage of people to people
allows you to find new customers automatically from a game
recommender engine-like technology. Here we have the GE
example I mentioned, so in 2012,
who knows if that’s still accurate, GE were gathering seven
times the data of Twitter. And so GE engines
are a big data source, and that data is used just to monitor
the health of the equipment. There’s a very nice example
of the industrial Internet, which there are several. You can find quite a lot of
interesting discussions about, cuz it’s not just GE. GE has set up, in fact, a whole new
software endeavor in Silicon Valley, because of the importance
of the industrial Internet. Here’s a little more detail,
they have 25,000 engines, they have 3.6 million flight
records per month, and each of them has 200 parameters,
18 million parameters per month. And this allows fuel efficiency, aircraft space capacity analysis,
and produces all sorts of improvements
to the aircraft and the engines. And it basically improves
the whole maintenance process. Hopefully will make customers happy,
cuz they identify various faults. And this is a pretty nice talk, I
have a link down here from a talk at Berkeley, which I
strongly recommend. So here is a summary of some sizes
from science and technology. We’ve already done the 15
petabytes per year for Large Hadron Collider, radiology
is big, 69 petabytes per year. There’s a thing called the Square
Kilometer Array Telescope, SKAA, which is half a zettabyte
per year of raw data around 2022. Usually these are optimistic, Earth
observing is still quite small, only a few petabytes per year. Earthquake science is now
actually terabytes more. PolarGrid is hundreds of
terabytes per year, and is growing very rapidly, this is due
to the improvement of the radars. Which is just every time you improve
your radar, add more channels, you get better resolution,
you get a lot of data. And the important application
is just looking at the results of simulations, so you have these
computers doing simulations. You can take a collection of
simulations, a collection of results, analyze those results to
get much more insight after you, that’s actually a pretty good
example of machine learning. And, that’s around a tenth
of a zettabyte per year from a exascale simulation. Here’s a nice example from NIH,
and this is a famous one. Here’s the cost to sequence
a genome, here we have $100 million for the first genome
in 2001, not so long ago. And now we have down here a little,
around $1000 at the budget genome
sequencing organization. And what’s striking is here’s
this cost to analyze the genome. And now we compare it with
Moore’s Law, and so the cost to. Sequence
[INAUDIBLE] is drastically decreased by many orders of magnitude,
maybe three orders of magnitude. It has decreased by three orders
of magnitude more than cost of computing. Which says then,
to analyze data for a given amount, it’s gonna cost 1000 times as
much for what it did in 2001. So something which was dominated
by the effort of gathering the data might now be dominated
by the actual processing, the actual analytics
run on the data. And this is gonna lead to 100
petabytes or so per year if we, for instance, decided to
sequence everybody. Here’s an interesting curve
highlighting an important issue, which we’ll come at from
several points of view. Here’s called the long tail
of science, and that says, there are some fields
like particle physics, that’s the LHC, astronomy, the
[INAUDIBLE] array, and LSST. And here’s biology and
these are plotted about, these are a few experiments, each of which
is very large, that’s this up here. And then over here we have the long
tail, economics, social science, some biology, where you have individuals gathered
doing lots of experiments. But they’ve not got a lot of people
involved, and not a lot of data. So we have a few large data things,
and then we have a lot of small data things, the lot of small data
is the so-called long tail. Long tail is very suitable to
clouds, cuz clouds are very effective at analyzing lots of
things, each of which is not so big. They’re not so effective at
analyzing individually huge things, because then you need to
use parallel computing. And for some things, like search and
recommender engines, clouds are very effective
parallel computing engines. There are other things like
clustering that are not so effective. So this type of graph is also
seen when you look at books sold. There are a few books for
sale for an enormous amount, and lots of books for
sale for a small amount. And if you were to go around
a physical bookstore, you see the cutoff line here. And you only have space
to hold this, and the person going to the physical
bookstore never sees this. This is why the Internet allows,
it’s sort of more democratic, it allows the long
tail to be accessed. And using recommender engines,
you can actually suggest which part of the long tail people should
look at, pretty interesting. So here are some of data
intensive activities, from my point of view, I gave you the
[INAUDIBLE] fellow Francis view. Particle physics,
which is a bag of events, information retrieval
is a bag of words. I’m trying to point out there’s
always a space attached to each of these activities. E-commerce, a bag of items to be
sold, or users trying to buy things. Social networking, a bag of
people with links and properties. Heath informatics, a bag of health
records or a bag of gene sequences. Sensors, lots of pixels,
bag of pixels, and these applications here use
statistics, deep learning, image analysis, recommender engines,
or anomaly or outlier detection. And they do this on clouds, and this
slide here really gives you a nice example of a rich set of fields
with a different set of spaces, with a range of tools. All running on clouds, and
they’re using variants of MapReduce. And this comes to our famous
summary of the course, the big data ecosystem
in one sentence. We’re using clouds,
we’re running data analytics, we’re doing it collaboratively,
so we’re all working together. We’re processing big data, and we’re
solving problems in X-informatics, or e-X. And X-informatics is
a superset of X-analytics, and here are the values of
X we discovered on the web. And we noted that some
fields like physics, which weren’t actually defined. We hadn’t used the term informatics
before, but we should, because we’re doing data science, and that’s
what this course is all about. And it’s an exciting new academic
area, which captures all of this. And here is the final
slide of this lesson, remember, this is
lesson two of unit two. And this is from the web,
what I collected, these definitions where I found people that introduce
the concept of X-informatics before. Originally around when I first
came to Indiana University, I taught my first class
called X-informatics, and got attacked because people said,
this is not a good idea. And then I stopped for about ten
years, and gave up that course. Well, sometimes you have good ideas
and you have them too early, or you give them up too early,
you’re too sensitive, and so on. Here we have Earth science
informatics, pathology informatics. Lots of other medical informatics
here, health informatics, biomedical informatics,
medical informatics, biochemistry, chem informatics,
biology and so on, bio informatics. Here we have energy informatics,
lifestyle informatics, which isn’t quite the same
lifestyle as I use it. But at least there’s a university in
the Netherlands that can study that environmental informatics. And a much bigger field,
social informatics, all right, there we are, informatics and
X-informatics. You can get rich,
you can cure cancer, do whatever you want
with X-informatics. And all you have to do is
learn a bunch of algorithms, buy a few clouds, and lo and behold, you have everything
you need to do X-informatics. And of course, you better get
a degree in data science, cuz that’s the qualifications. So here I am, signing out
of lesson two of unit two, the motivation, thank you,

Leave a Reply

Your email address will not be published. Required fields are marked *