Thursday, March 13, 2014

The rise of computation power

With the rise of data, the power of processing has risen significantly. From a time of initial multi-tonne computers capable of lesser computations than your washing machine, to super computers and power packed cloud applications, the journey has been wonderful.
Dell recently published an interesting article which suggests at least the contrast in terms of storage capacity.
http://techpageone.dell.com/history-of-data-storage/. An interactive setting to let you browse through the wonderful technological advances in the field of data storage.
Another not so graphically appealing but a nice infographic on the growth of Computation is found at this link: http://www.rsvlts.com/2013/12/06/a-visual-history-of-computers-infographic/

The links only provide the context of what is coming ahead. Most of statistical and analytical concepts have been around for a long long time. These concepts h
ave been previously proven on a far smaller data set by hard working mathematicians who worked manually to establish the credibility of their hypothesis. In fact most proofs were done by Mathematical Induction to avoid computations. Now almost all of the theorems and techniques have been proven on a much larger scale.

The most famous saying in the world of Statistics, "Correlation does not imply causation". Well it is confusing is it not. To explain this I will take a very simple example. When the streets are wet, you will see a lot more people carrying an umbrella. Now the question is, what is causing people to carry umbrella. Are they afraid of the wet roads. The correlation suggests that one happens when the other does. Does it imply that wet roads lead to people carrying umbrella or vice versa. From our prior knowledge we know that is not the case. There is a common cause to both; the falling rain. Hence the proof of Correlation does not imply causation.

However with the current amount of data and computation power, do we need to establish causation. Consider this, if you are selling coca cola and cranberries and for some reason they are selling together and in larger quantities, you would want to stock up on both. Now it may be the case that there is a group nearby preaching about the benefits of taking Coca Cola with Cranberies ( How great examples I come up with !), and you are unaware of it. But you find that the sales are going up. What will you want? You would definitely want more stock of the same. You might want to know about the cause but you would definitely not worry about it as long as the sales are going up. Now there is a correlation between Coca Cola and Cranberry sales, which did not exist before. This correlation does not imply causation but it does imply that people are buying both with decent linkage and you need to stock up more.

Such patterns as taken in the example above are often hidden deep within the data. Without large computational powers, the techniques are useless and not feasible. However, with the enhanced computation power, machine learning provides an excellent opportunity to find such patterns. Even though, you are not aware of the cause, just the awareness about the pattern helps you make better decisions. Sometimes some correlations are because of causation as well. If you find that, jackpot, otherwise also you get lot more information than before.

It is the computation power that can simulate millions of future scenario with thousands of different variables to suggest the best alternate. Sometimes elegant methods work best, at other times; brute force takes you through. The net result remain that with increased power, comes more and more empowerment. Geniuses have been known to exist throughout humanity. Some of them thought about methods in their pre computer era which can only be utilized now. I am avoiding listing algorithms in a trial to remain closer to the lay man of the field.

Everyone knows of Leonardo Da Vinci. It is known that he made sketches of what could have evolved into a helicopter. However, the real helicopters came ages after his death. The technological advances have made his dreams into reality. There have been many such geniuses who dreamed of a lot of things and tried to provide a solution. Those solutions are now being emulated, simulated and implemented across the world to help people have a better life.

Who would have thought years back, that google will tell you what to do. Google Now makes your life easy because your phone has more computational power than a desktop around 5 years back.Google has the data and definitely the computational means to provide you guidelines in advance. You see your current weather before leaving home, it even tells you if the traffic is bad in your current route, etc etc. How is it able to do this for so many people remotely. The answer is simple, it collects a lot of data about your usage and also possesses the computational power to make sense of that data.

What companies need to realize that this alone is not sufficient. Increase in computational power definitely has had far reaching consequences on the world but if everything could be shown with numbers alone, the artists of the world will feel betrayed.

Wednesday, January 1, 2014

Why Big Data is big now a days?

In my previous post I hinted that most current techniques used in Big Data currently have been in existence for a long time. What are the reasons that Big Data has grown in prominence recently? There are primarily two reasons, Data Capture and Power of Computation.

In this post I will focus primarily on the abundance of Data in current world. Changes in the world currently that have led to Big Data being in prominence:

1) Social Media: The growing culture of sharing personal opinions and content on facebook, twitter, pinterest, linkedin etc means that the data available on Social Media is much more than could have ever been imagined. In a recent disclosure, facebook revealed its monthly active users to be above a billion while twitter had more than 237 million active users. These users regularly post comments, events, photographs, experiences etc with the world. The number of user hence become a sufficiently large sample to make sense of the population.
The proliferation of Social Network can be found in the fact that people have taken to sharing their sneezes, rashes and abnormal physical conditions on these sites. Coupled with the fact that these posts are mostly geo tagged, it is possible to segregate things by locations. Incidentally, many locations provide sufficient volume of posts to justify any analysis statistically.

2) RFID: Most companies keep track of their inventory using either RFIDs or regularly scanned barcodes. Each scan becomes a data point and adds to the volume of data. Not only the volume but it heavily improves the quality of data available. It allows for meaningful analysis of this data. This technology helps covertly record a lot of data and adds extensively to volumes. Later, it is possible to find out different trends and patterns from this data. Combining this data with other factors open floodgates for analysis.



3) Scanners: Most outlets and industries use their IT infrastructure for recording anything. Scanners like Bar Code Readers, QR Code Readers etc have made it almost pain free to calculate such transactions. Totals, Billings, transfers etc become lot more accurate and fast. In the process lot of usable data is generated on which we can work. For bigger organizations, the sheer volume of this data might classify it as Big Data.

4) Mobile: This is the biggest change in context for the upsurge of Big Data for consumer side data. The usage and dependence of people on mobiles have allowed for the users to capture lot of pictures ( Form of Data), text messages, Instant Messages etc in real time without having to wait for getting home and writing it down, preparing a negative or sending out mail letters. Mobile is ubiquitous and its impact on the life of an average human has increased at the rate of knots. For example, people keep checking their social media (facebook, twitter) on phone and reply then and there. What this does is that it gives a near real time feed about events at a place or a global phenomenon. The information is disbursed across the globe at lightning speeds using the viral model. This data might not seem large but it is very very voluminous, not to mention unstructured. First making sense of this data in silo, and then combining it with external data sets is what most big data scientists do.

5) Location: Building on the last point, most mobile phones have location settings switch On by default. This leads to companies collecting immense user data about their location movements, etc. This in itself is such a huge amount of data, but when combined with other behaviors of the user, it provides immense understanding of user behavior. Also for companies, it is like mining gold and making the correct ornaments.

6) Internet : I am trying to explain a lot of unexplained things under a very broad category of Internet. Though most of what was explained earlier also fall under internet, but that is not very specific. Users generate immense data just by visiting websites or searching on Google/Bing. Google has approximately 6 Billion searches per day. Each of this search is logged. When one start typing part of its search query, Google gives most probable search query. It can do that because a user is rarely unique and the chances that someone has already searched for the same thing is extremely high. In short, a simple search entered by one user adds to data, and there are always millions doing the search. The data is big.
People's lives are so intertwined with the virtual world, that people are always on the web. People surf between 0 to 100 websites in a day on an average depending on how heavily your life is dependent on the internet. Each of this hit is collected as a data point somewhere. Either by the Name Server, or by the website itself; or both. Only the browsing history of all people combined collected by a large company like Google can run into TBs per day.

This is all data not seen at this scale previously. Since the techniques were available, but neither the tool nor resources (in terms of data to work on) was available previously, it lay dominant. With the rise of data, rose Big Data Analytics.
 
The list and explanations are in no way exhaustive but only indicative. I will try to collaborate a more exhaustive list when time permits. I just wanted to make a sense of why Big Data is a hit when the basic techniques were already there.

Saturday, December 28, 2013

Big Data: What it entails

Big Data is a term being thrown around a lot these days. This post is my attempt to at least let the readers make sense of what it means and its potential.

One does tend to think at a very large scale when it comes to big data. The basic concepts revolving around Big Data are age old. What has changed recently is not the techniques of looking at the data but the data itself. With people generating data at the rate of at least tera bytes per minute in the world, and most of it being shared online, the evolved computation power has provided humans with a paradigm never experienced before. Access to an unlimited stream of data and the computational ability to make sense of it.

Some people argue that anything too big for Excel or Access is big data. Personally I disagree. My belief is that big data is working not only with large data sets but also with varied data sets, most of which is unstructured. Making sense out of a combination of Structured and un-Structured data available in the world is what makes Big Data really BIG and provides a lot of potential.

World has known lot of great mathematicians and statisticians who gave theorems and algorithms which were hard to prove without today's computational capacity. It has been proven that the maximum length of chain that link and two humans in this world is extremely small in the order of 10s. It was a long standing theory but could only be proved by the advent of two things: Social Media and the computation power to analyze social media.

Companies like UPS and Fedex ship millions of consignments per day. The amount of data generated by just the tracking of each shipment is phenomenal. To optimize the flow of packets, these companies run algorithms on peta bytes of such data coupled with factors like weather, traffic, etc and arrive at optimum routines. This approach has saved millions if not billions for these companies.

This forbes article http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/ suggests how strong the power of data is. With the available amount of data and the capacity and willingness to process it can lead to such powerful models. Similar data when applied to natural disasters can lead to early prediction of earthquakes and Tsunami. ISRO was successfully able to predict cyclonic storms in India and saved thousands of lives. Such natural disasters usually cost thousands of lives in India.

There is a saying in India, " Ati Sarvatra Varjayet". In Chinese it goes something like "Wu Ji Bi Fan" ( Refer to Jackie Chan's Karate Kid if I did some mistake here ). And roughly translated in English it comes out saying that too much of everything is bad. Similarly, going too much into big data may be counter productive. Sometimes, data provides an insight and the organization is not flexible enough to implement it on time. They then implement a half cooked duck taped solution and find themselves in a bigger soup.

Privacy is a big issue when it comes to using the data better. Going into a more granular level of data makes sense but there comes a point when the analysis might intrude on the privacy of individuals. Traditionally, the store owners and workers had a personal touch with customers and offered discounts based on personal interactions. Most places, this has been replaced by data based offers. The personal touch is lost and the example of Target above can serve as a classic borderline privacy intrusion case.

Big Data is a great boon for business if used properly, but it is a territory best tread carefully with the help of legal and technical departments.