IU X Informatics Unit 5 Lesson 3 Features of Data Deluge I


So now let’s look at some
features of the data deluge, which give us a little more
detail on the overall process. So this is a slide from
Bina Ramamurthy, who has this undergraduate class on big data
at the University of Buffalo. And she stresses some features
which we further mentioned, in terms of the pipeline from raw
data to knowledge and wisdom. And she uses the term intelligence,
which is another way of summarizing the final result of the big data. And as she points out, information
is the cleaned up form of raw data, and intelligence is sort
of the final result. And I did the square
root on n analysis, if you want information,
you better have quite a bit of data. And so good intelligence or
good wisdom needs lots of data. And we are getting lots of data
readings from single experiments, such as the Large Hadron Collider,
Atlas, and CMS, or by the long tail,
which is lots of smaller apparatus. So this model of intelligence
coming from data is this data-intensive model of science. Or the data-intensive model of
actually commodity business. And she also notes here, a lot of
which we’ve seen in the examples, a lot of the data
comes from the web. eBay’s data is coming
largely from the web, for example, as is Amazon, Amazon is
an e-commerce site, and so on. So here’s our discussion of some
of the aspects of the data. Systems, we have the raw data,
we have algorithms, that’s what I call data analytics. The algorithms are gonna go into
the programs to process the data. And convert raw data to data,
data to information, and information to knowledge and
intelligence. And she has knowledge down here as
one of the things, and also the fact that we need, obviously, a lot of
infrastructure on data structures. And these components of the previous
slide are discussed here. Then she also points out that
we have new data structures, which we already pointed out. These are this platform as a
service, and one of the features of data, which is recently important,
is that it is WORM. That we write once, read many times, most data is not
rewritten many times. That differs probably from
typical transaction data, when you keep rewriting
your bank account data. But if you look at the data
which we’re using to look at e-commerce and things. That data is just written once,
we gather the data from the web. But then we sometimes,
we update it every now and then, but we read it an enormous number
of more times than we write it. Because every search
uses the web data, and that web data is updated from
the crawler every now and then. So it’s not actually written once,
it’s just written a few times, but the number of writes are small
compared to the number of reads, and that can suggest
a different data structure. [COUGH]
This comment here is one which is from me here about the
difference between the semantic web and big data, how they relate. Semantic web is an important concept
coming from Tim Berners Lee, who has made the incredible contribution of
inventing HTML, which effectively enabled the Internet to explode
in the way we’ve seen it. And then the idea of
the semantic web would be, we’d annotate or curate the web
pages by adding additional metadata. Metadata is data about data,
which allows the web browser, which is either a machine or
a person, to understand the real
meaning of a page. So we have, say, a doctor’s page,
we might add metadata telling us his opening hours, his specialty, and whether he has space for
new patients, and things like that. But if we actually look at
what happened, wasn’t so much done with such metadata
[COUGH], but rather, search engines were successful because they use
the big data approach. Now we have purely
data driven approach, they took this chaotic
data you find on the web. And they present
that in fashions and analyze it in fashions that allows
you to find the real meaning of a web page without
the semantic annotation. And although the semantic
annotation is important, and in some cases critical,
it’s surprising how far we’ve gone. With just the big data approach,
just analyzing data and finding information
without a lot of curation. So I think people are somewhat
surprised, and haven’t possibly all of them realized how important the
data only approach to big data is. So here is an example I found
from the very good set of lectures by Jeff Hammerbach
of Berkley on data science. And he quoted this fellow here,
Anand Rajaraman, who is at Walmart. And he taught a class, and this class got involved
in a competition. One interesting feature
of the Internet, that there is competitions. And Netflix had a competition,
which was to try to find a better algorithm to rating of movies. And so this fellow here, Anand notes that he had two teams which looked at this. Well, he had several teams, but he
looked at two of them in particular. One of them came up with a brilliant
new algorithm, and another one actually used a simple algorithm,
but they added in additional data. And the team which added data with
a simpler algorithm actually did better than the team which had
a sophisticated algorithm. I’m sure that’s not always true, but
Anand makes the statement that more data usually beats
better algorithms. So this is a sort of important
feature of the data deluge, and the big data approach. The size of the data and the
richness of the data is very big. Here notice what they did was, they
did not add more of the same data. They added different data, and so that’s a very important principle
that’s been known for a long time. It’s sometimes called data fusion, it says that you can do very well
by, you need to do, in fact. You need to join many data sets
together in some analysis, and that’s data fusion.

Leave a Reply

Your email address will not be published. Required fields are marked *