Arm Up for the Big Data Deluge (ft. Anastasia Ailamaki)


Big Data is an effort to squeeze larger and larger amounts of data through narrower and narrower necks of bottle It has nothing to do with size people say that Big Data is now relevant because we have a lot of data yes, of course, this is important but guess what? We’ve always had a lot of data 20 years ago, kilobytes were a lot of data because even with kilobytes there were things we were only able to do too slowly people could not get an answer to questions in what they consider acceptable response time So Big Data is a problem it’s construed as a problem people perceive it as a problem because it’s what they feel is impeding them from getting an answer to a question in the time that they consider acceptable This is professor Anastasia Ailamaki of the IC School at EPFL and what she’s talking about is probably every student’s reaction to having to read a huge pile of documents this is going to take a while Dealing with Big Data is a huge challenge Especially as, these days, we tend to quickly get upset when our technological devices are not responsive enough Come on! But where does all of this Big Data come from? It used to be just user-generated data that grew now we also have a ton of machine-generated data and that grows at unprecedent rates because machines can just do that 24/7 7 days a week 52 weeks a year they’re just generating data and that’s what they do This is going to be even more the case as the Internet of Things is taking over the world soon enough, your watches will be measuring your pulses at every second your fridge is going to measure everything that you’re eating and your houses will be optimizing their energy consumptions and managing all of these data in real time is only part of the challenge another big challenge is to interpret all these data in a standardized manner and this has turned out to be more difficult than you may think One might think that the machine-generated data would need less elaborated algorithms because it’s generated by machines so there will be some canonical features in them Interestingly enough, user-generated data seem to be more structured well, except for text, obviously Data that we generate that we collect they’re user-generated they seem to be more structured than machine-generated data because of the way they’re collected because of the normalization differences between machines among the different machines that are collecting the data and also because of the fact that interpreting the data is different it’s a different process depending on what the interpretation serves that data needs more elaborated algorithms in order to be efficiently processed and we find ourselves in a pickle trying to combine machine-generated data with user-generated data structured and unstructured and linking those to the interpretations needed by the applications Big Data is thus a cause of many headaches for computer scientists However, Big Data is also a blessing not just a problem Because Big Data is the reason why we can deduce so many useful information In fact, Big Data is showing us a new way to inquire the world we live in The scientific method has traditionally been theory and experimentations You have a theory, how do you corroborate that theory? Well, you use mathematics, which is number one And then, you run experiments if you’re lucky if you can get your hands on an experimental setup and then run experiments to corroborate the theory later, computer came along, so people can use simulations you don’t have to set up all your infrastructure within math which is hard You can actually go to a computer and program the whole infrastructure and study many levels of the infrastructure through simulations and that gives you a huge advantage because you don’t only study the reality you can also study what-if scenarios which you can’t really represent in reality So that’s where we were before Data Now that the data is so much, so abundant, and that our way to processing it is progressing so rapidly Now, we can add to the scientific method Not change it, really Many people say that we change it But really what happens is that we add to it we complete it with the following we complete it with asking the data to give us what are the hypotheses that we should be corroborating, again, using the data Because, the hypotheses, the humans had to generate them before Now, we can ask the data: What are the interesting questions to ask? This has to do obviously with machine learning and artificial intelligence but it also pertains to us, computer scientists, database people Because it has to do with data Supporting these kinds of questions efficiently is what gives rise to what we call the 4th paradigm which is the new scientific method This is extremely exciting Combined with great computation power and clever algorithms, Big Data may give us new insights into the world we live in Insights that our limited brain capacity would never be able to foresee However, according to Anastasia Ailamaki, there is still another challenge that Big Data raises and that we will need to be able to face more efficiently Before 2007, according to EMC we had enough storage to store all of the data that was produced Before that, it meant that we could store multiple copies of the data that was produced and actually keep them coherent to the extent possible, but as the applications needed this is not the case anymore we can’t store multiple copies of the data So whatever you do, you need to do with the original storage of the data So you better be careful there! You better be conscious of how the data is going to be used The problem is: How do you know how the data will be used? Storing, organizing the data efficiently, for me, remains a very important question No matter how expensive and sophisticated the algorithms you have whether you do data mining or data analysis your initial physical organization of the data is how you made your bed in the morning if you made it well, you’ll sleep well at night if you didn’t organize your data in a thoughtful way you’re not going to be able to harness its full potential Because performance is going to drop, and when performance drops, efficiently is going to drop and when efficiency drops, user productivity drops and that’s where everything ends, in user productivity This is something extremely important to understand in the age of Big Data, the way your data is stored will have strong implications on how easy it will be to retrieve information from your huge data set and in particular, the optimal way to store your data will depend on how you intend on using it the problem is:How do you know how the data is going to be used? This is several of levels of abstraction higher than where you are when you collect the data and organize it so there, very smart, dynamic, data processing, query processing, transaction processing algorithms take place in order to dissociate big parts of the efficiency at the user level from initial data organization In other words, a powerful data abstraction can allow for great performances for many different tasks But data organization is still important, and also choosing which data is interesting to store and keep is very important This is particularly true for Big Data generators like the LHC at CERN which has to throw most of its data away because it generates more data than it can store you then have to be very wise about what you’re going to store Now, going up, is query processing algorithms, data processing algorithms need to really change face We’re going from, at this point in the data management community, is from static query operator and query processing algorithms to more and more dynamic ways to process queries Wait. Don’t think about how we’ll optimize this query Let’s first see what the user wants to do and then we decide how to use the computer resources This is clever, since the optimal data storage depends on how the data will be used we should probably first learn about how it’s being used so we’re going towards a more adaptive, less static, less predetermined, more post-deciding direction when buidling data management software A lot of people are also talking about adaptive storage and moving data across different levels of storage Yes, because the old data may still have value but the way it’s being used may have evolved through time To adapt ourselves to the way the data is going to be used, we may want to restore already stored data in a more optimal way that is also an area which is very promising in terms of research and products and all that But it has to be used with a lot of attention and the reason is that: It has to do with data movement and data movement is a Big problem Today, we are working on moving data from cold storage to hot storage to several layers in a hierarchy seemlessly to the applications these ideas also come together and agree with the other activity ideas that I spoke about before The problem, however, is that these incur data movement and data movement is a big problem today it used to be problem. People used to avoid moving data before, because it was expensive before but today, it’s more expensive than ever wait. Has the cost of data movement increased? It was costly before, but the differences are bigger now so now, it’s not more costly in numbers it’s more costly because there’s more data So to move the data you need, it’s going to cost you more because the sheer size is larger the technology is better, but the size of the data is much bigger than how much better the technology is So that’s one thing. and the other thing is that everything else becomes a lot faster but moving data becomes slower So the gap increases If you have one operation of the system which relies on moving data around that’s going to cost you a lot in terms of movement, You’re bulking the system much more than in the past In other words, data movement has become more often than not the bottleneck in the exploitation of Big Data and that’s why improving an IT system often boils down to improving how data is being moved So if you have a component which, in order to work, relies on data movement data movement today is much slower than anything else just because data is so much more it’s important to think about all these synergies when you’re designing the system bottom to top To sum up, Big Data is opening new doors to explore and interact with the world we live in However, opening these doors raises new kinds of challenges that we need to arm ourselves for so it’s a good thing, and it’s a bad thing but it’s definitely an interesting thing Data is so voluminous that it overwhelmes the technology of today and it challenges us to create the next generation of data storage tools and techniques The human project for me is a Big Data management problem

1 thought on “Arm Up for the Big Data Deluge (ft. Anastasia Ailamaki)

Leave a Reply

Your email address will not be published. Required fields are marked *