X Informatics Unit 3 Data Deluge 8 Business III


So here’s an example of Williams
Gibbs of patterns which is the so called query rewrite example. So it shows how they take
a particular query and generalize it to make it more effective based
on previous use of that query. So I see she says, in just three years ago in just 2010
their search engines was a very, very straightforward search engine
that took what you typed and matched it what it knew and
gave you a response. Unfortunately, still many
search engines do that. And in fact, it’s sort of a rather
traditional approach where you have keywords and things like that. That exact matches is a very
standard approach of many front ends to databases. So, eBay points out here
that they can improve on that by learning from previous
people type what they really meant. So the idea is to actually
understand what you intended and give you what it
thinks you wanted and it does that by mining out that very
big, what they call extreme data. That’s effectively big data. Look for patterns and they map
the words in the user queries to synonyms and structured
data associated with items. So that’s what QUERY REWRITE is. The user types something in,
eBay rewrites it and then that actually rewritten
QUERY is the one that actually gets executed, and
here’s an example with a lamp. It seems as to some
people as a pilzlampe and coming from this
company in Germany and Ebay gives this as
just a simple example. If you look at what they learnt. Well they analyze what people
who purchase the pilzlampe actually typed. So they typed pilzlampe,
correctly spelled. They put a space between pilz and
lampe. They add an n to the lampe and they
put gaps between pilz and lampe. And all of those queries correspond
to people who wanted to buy pilzlampes. So, you can learn that by seeing
what people type and what they buy and they effectively determine that various things have been
effectively identical. Then you can have any pilz or not,
you can have a lampe of not, you can have a space for the pilz or lampe
or not, and you do this rewriting. And that’s the query that actually
gets given to your engine. So this is all learnt from customer
patterns, what actual customers door. This is very good example
of data discovery because you can discover, you have
a principle that whenever there’s [INAUDIBLE] insert a space but that might not be
very accurate here. We are actually using the data to
show the pilz followed by lampe, you should put the gap
as a possibility Here’s some comments about,
generally from eBay but general commerce, in 2008 the online e-commerce was 4% of
the total commerce. Now it’s 6% but they’re of the total offline purchases 37% is
actually influenced by the web. So actually the graph
looks more than 37%. So I’m not certain
the 37% number is right. Actually looks like
50% from the round circle cuz clearly the yellow
web influences more than half. So we will have to reserve judgement
on this particular figure. Anyway, it’s still large
whether it be 37% or 57%. And the total, so
they’re anticipating that everything will get integrated
and this commerce in 2013, integrated online and
offline, is $10 trillion. Now we come to a new
set of examples, that was the end of
the business set. This is a set of big
data in science, or more generally in
research applications. Here is a slide from
The Economist which had a special issue in 2010, already out of date. And it first starts with
the Sloan Digital Sky Survey and it points out that they’ve only
got a 142 TB of information. And the successor of
the Sloan Digital Sky Survey the large synoptic survey
telescope which will start in 2016 will get 140
TB every five days. So that is 28 terabytes a day. So that’s obviously a big increase. And it also here is giving things,
we’ve already discussed that Walmart handles a million
transactions every hour and it’s database which is actually
smaller than the example of eBay. But it’s probably this is 2010,
I’m sure Walmart database is a much larger now,
it says that 2.5 petabytes. But then it points out
this is very much larger than the amount of information
in the Library of Congress. And it points out that the even in 2010, this is surely
much larger now. Facebook had 40 billion photos and
that’s another example of the human genome that took ten
years to analyse a genome the first time it was done in 2003 but
now it can be achieved in one week. And so it will be achievable
in hours or days. So this is just pointing out that
the devices and processes that produce data are just getting
far more effective every day. Here is an example related
of the genome, and it plots actually the cost
per genome, and In 2012, the cost per genome
is around $10,000. And it started off, of course,
a long time ago, as $100 million. Well not so long ago, that was 2001. Another important feature of this
graph is it compares the cost of getting a genome with
the performance of computers. Which is more slow and
you can see the starting in 2008, there was a dramatic decrease
in the cost of genome sequencing compare to the reduction
in costs of computing. And now that’s over
a factor of 100 difference. Which points out a well
known remark that If you start asking how much
computing it takes to process these genomes the computing cost is
getting more and more significant. Because the cost of getting
the data is going down so much faster than
the cost of computing. Here’s a sort of number
which you can estimate. If you had a world where everybody
had their genome sequence, say, every two years,
which some people might think is a possible world of
the future where genes are used. The individual genomes are used
for personalized medicine, new approaches to health. That will get you three
petabytes of data per day. Here’s another slide about
radiology or medical imagery. It says that you have
almost 70 petabytes of data per year coming
from medical inventory. So that’s a very big number and then it actually divides it
between the various sources and actually it points out that if you
add in cardiology, that will take the answer to over a exabyte,
that’s a thousand petabytes. And these sound big, but
actually the amount of data processed, produced every year is actually
measured not in exabytes, but in zettabytes, which is,
a zettabyte is a million petabytes.

Leave a Reply

Your email address will not be published. Required fields are marked *