[Editor’s Note: This is part 3 of a 3-part article. Click here to read Part 2 of this three-part article.]
So far, we have learnt the basic definitions of Big Data from a Broadband Service Provider’s perspective and have understood what kinds of actionable insights may be obtained from such data.
However, before we do that, the data must be stored and acted upon. There are some well accepted technologies that allow us to do so, and we shall understand those next.
Storing and Processing Big Data #
In the RGU [Revenue Generating Unit] decline example provided in the previous article, it’s usually not possible to determine a pattern behind a symptomatic decline without looking at the data, and often the answer is not obvious even then. It’s important to extract as much customer behavior information as is available, combine it with local, social or other sources of information that may be available, and to then perform analysis to understand the root cause behind the loss of RGUs. It may be possible to create a machine learning algorithm that will help “learn” behavior over time.
But before we start acting on the data, we must first store it in such a fashion that we could act on it with technologies that would have scalable storage capability, reliable and scalable processing capabilities, at an affordable price.
While there are many technologies from reputable providers on the market, not everything fits every problem. For instance, there are problems that can be solved in a batch-processing mode, and there are other problems that require a real-time solution.
For problems that require batch-processing, Apache Hadoop is the most common solution in the market today. Hadoop is a framework that allows for the use of commodity hardware to store and process large data sets across clusters of computers. Each computer in Hadoop offers compute power as well as storage. Hadoop is built to scale from a single computer to thousands of computers, and best of all, it is Open Source software – with a very tempting price point – it is free!
Apache Hadoop is an implementation of MapReduce, which is a name for frameworks that could process problems over huge datasets that could be parallelized for processing across clusters of computing devices. MapReduce is a programming model that the search giant Google is credited with implementing and using to solve the decidedly Big Data problems in search.
MapReduce involves two steps:
- Map: A compute device takes a problem, splits it into multiple sub-problems and hands them over to other compute devices to solve in parallel. This can happen recursively (other devices can split the problem and hand over to other compute devices to solve the sub-sub-problems).
- Reduce: The results from solving all the sub-problems are combined to provide the answer to the original problem.
While details of MapReduce in general and the Hadoop architecture specifically are outside the scope of this article, suffice it to say that Hadoop offers a distributed file system, as well as the ability to perform parallel processing of extremely large data sets and all the tools needed to store and process the information that most Broadband Service Providers (or other enterprises) may require.
For real-time problem solving, there isn’t really a single solution that works very well for all cases, while there are multiple companies attempting to solve this problem. Twitter invented a solution called “Storm”, which is open-source, simple to use, works with almost any programming language. Storm is often used with a message broker called Kafka [open sourced from LinkedIn], and the combination is scalable and capable of implementing fault-tolerance.
A word of caution here – Hadoop and the technologies around it may appear to be easy to work with (the reader can download the software yourself and try it for free on her own home computer!) but for performing any meaningful processing of this precious data, it is highly recommended that Data Scientists be engaged and allowed to build and maintain the Service Provider’s Big Data system with help from the Service Provider’s engineering staff.
Of course, there is no dearth of companies that would be happy to provide support with an installation of Hadoop as well as setting it up and helping maintain it down the road.
Monetizing Precious Data #
As the reader must have gathered from this series of articles, good data is precious. But monetizing Big Data requires a serious commitment and sustained effort on the part of the Service Provider with help from qualified Data Scientists.
To realize the value of Big Data, an Independent Broadband Service Provider must:
- Collect and sanitize all useful data
- Ensure access to powerful Big Data processing systems
- Work with experienced Data Scientists who can help identify the data to collect, the infrastructure to use, and possibly create machine learning systems to extract value from such data
However small a Broadband Service Provider may be, it is likely there is enough data being generated by their customers that it could be analyzed to provide actionable insights that are monetizable. It is early days in the Big Data game, and those Service Providers who recognize Data’s importance and prepare now, stand to be the big winners over time.