X-Informatics Unit 3: Data Deluge 8: Business III


So here’s an example Williams gives of patterns, which is the so-called Query Rewrite example. So it shows how they take a particular query and generalize it to make it more effective based on previous use of that query. So as he says, in just three years ago, 2010, their search engine was a very, very straightforward search engine that took what you typed and matched it to what it knew and gave you a response. Unfortunately still many search engines do that. And, in fact, it’s sort of a rather traditional approach where you have keywords and things like that. That exact match is a very standard approach in many front-ends to databases. [pause] eBay points out here that they can improve on that by learning from previous people type what they really meant. So the idea is to actually understand what you intended, and give you what it thinks you wanted. And it does that by mining that very big… what they call extreme data, that’s effectively Big Data, look for patterns, and then map the words in the user queries to synonyms and structured data associated with items. So that’s what query rewrite is, the user types something in, eBay rewrites it, and then that actual rewritten query is the one that actually gets executed. And here’s an example with a lamp. It seems there’s something called a pilzlampe. And coming from this company in Germany, and… eBay gives this as just a simple example. If you look at what they learned… well, they analyze what people who purchased the pilzlampe actually typed. So they typed ‘pilzlampe’, correctly spelled. They put a space between ‘pilz’ and ‘lampe’. They add an ‘n’ to the ‘lampe’. And they put gaps between ‘pilz’ and ‘lampen’. And all of those queries correspond to people wanting to buy pilzlampes. So you can learn that by seeing what people type and what they buy. And… they effectively determined that various things are effectively identical, namely you can have an ‘e’ on ‘pilz’ and not, you can have an ‘n’ on ‘lampe’ and not, you can have a space between ‘pilz’ or ‘lampe’ or not. And you do this rewriting. And that’s the query that actually gets given to your engine. So this is all learnt from customer patterns, what actual customers do. And this is a very good example of data discovery cuz… you can discover… you can have a principle that… whenever there’s ‘e’ followed by ‘l’ you insert a space, but that may not be very accurate. Here, we’re actually using the data to show that ‘pilz’ followed by ‘lampe’, you should put the gap as a possibility. [pause] Here’s some comment about… generally from eBay, but about general commerce. In 2008 the online e-commerce was 4% of the total commerce. Now it’s 6%, but of the total offline purchases, 37% is actually influenced by the Web. [pause] Actually the graph looks more than 37%, so I’m not certain the 37% number is right. Actually, it looks like 50% from the round circle, cuz clearly the yellow Web influence is more than half. So we will reserve… we will have to reserve judgment on this particular figure. Anyway, it’s still large, whether it be 37% or 57%. And the total… [pause] They’re anticipating that everything will get integrated, and this commerce in 2013 integrated online and offline is 10 trillion dollars. [pause] Now we come to a new set of examples, that was the end of the business set, This is a set of Big Data and Science, or more generally in research applications. Here is a slide from The Economist, which had a special issue in 2010, already out of date, and it first starts with the Sloan Digital Sky Survey, and it points out that it only got 140 terabytes of information. And the successor of the Sloan Digital Sky Survey, the Large Synoptic Survey Telescope, which will start in 2016, will get 140 terabytes every five days. So that is… 28 terabytes a day. So that’s obviously a big increase. And it also… here is giving things we’ve already discussed, that Walmart handles a million transcations every hour, and its databases, which is actually smaller than the example of eBay, but remember this is 2010, I’m sure Walmart’s databases are much larger now. It says that 2.5 petabytes, but… and then it points out this is very much larger than the amount of information in the Library of Congress. [pause] And it points out that even in 2010, and this is surely much larger now, Facebook had 40 billion photos. And as another example, the human genome, that took ten years to analyze the genome the first time it was done in 2003, but now it can be achieved in one week, and soon it will be achievable in hours or days. [pause] So this is just pointing out that the devices and processes that produce data are just getting far more effective every day. [pause] Here is an example related to the genome, and it plots actually the cost per genome. And in 2012 the cost per genome is around $10,000, and it started off, of course, a long time ago as $100 million. Well, not so long ago, that was 2001. Another important feature of this graph is it compares the cost of getting a genome with the performance of computers, which is Moore’s Law. And you can see that starting in 2008, there was a dramatic decrease in the cost of genome sequencing compared to the reduction in cost of computing. And now that’s over a factor of a hundred difference, which points out a well-known remark that… [pause] if you start asking how much computing it takes to process these genomes, the computing cost is getting more and more significant because the cost of getting the data is going down so much faster than the cost of computing. Here’s a sort of number which you can estimate. If you had a world where everybody had their genome sequenced, say, every two years, which some people might think is a possible world of the future where genes are used… the genome of… individual genomes are used for personalized medicine… in a new… new approaches to health, that will get you 3 petabytes of data per day. [pause] Here’s another slide about radiology or medical imagery. It says that you have almost 70 petabytes of data per year coming from medical imagery. So that’s a very big number. [pause] And then it actually divides it between the various sources, and actually it points out that if you added cardiology, that will take the answer to over an exabyte. That’s a thousand petabytes. Another… and these sound big, but actually the amount of data processed… produced every year is actually measured not in exabytes but in zettabytes, which is… a zettabyte is a million petabytes.

Leave a Reply

Your email address will not be published. Required fields are marked *