TechNotes: November 2015

This article is an attempt to provide a quick glimpse of Big Data and Hadoop, two of the top trending words in the field of information science. Consider this article as more of an introduction to Big Data and Hadoop for newbies.

To get a hang of these two concepts, we should first know a few things about Data. The first thing to note is “data” is plural. Its singular form which is not used very often is “datum”. So data are the collection of raw facts and figures; which when processed within a context can be used to derive meaningful information. For instance, the following figures don’t really convey anything much; till we give it a context and perhaps a graphical representation.

1	2	3	4	1	2	3	4	1	2	3	4
1500	1200	1000	1395	1690	800	1100	1000	1555	1200	1000	850

Let’s say, in the table above, values in each of the cells in the first row signify a week number. So ‘1’ corresponds to the first week of the month, ‘2’ the second week etc. And say, values in the second row signify expense (in any currency of choice). So now we can say that we have the data for three months of weekly expenses which could be represented as shown in the graph below.

Fig.1 Data Visualization

Thus, the raw set of data in the table above turns into a piece of information when we know what each figure means. And we can infer that every month the expense in the first week is the highest. We could analyze these data in many different ways to gather many more bits of information. For instance, what has been the total monthly expense for a month, what is an average expense per month, which month were the expenses at their lowest and which is the highest, and with a larger data set one might even gain a deeper insight that could help one predict what could be expenses in the following months or weeks etc.

So to reiterate; data is the collection of raw facts and figures. When these are processed on the backdrop of a context, we derive meaning which is information. Good, so far?

Now, with the current phase of technological advancement and the millions and millions of computers at use all over the World, we are generating an immense amount of data. To form an idea, think about the digital pictures that are being clicked, shared, uploaded, posted and stored by billions of people across the globe. The innumerable web pages and posts put on the web, including this one. The mobile call logs produced across the World by people calling and talking to one another. The log messages that are generated when running an application. The emails which are being composed, sent, read or are being archived. Or the data on temperature and rainfall etc. that are being collected from across the different patches of the Earth and even space. The bottom line is we are generating a vast amount of data every second that we just want to or have to keep. These data are valuable to us, not just from a sentimental or security or legal or, logging perspective but also because these could be analyzed to retrieve amazing bits of intelligence, these could display patterns of behavior, offer predictive analysis, these digital footprints could help identify an individualistic or group behavior in terms of buying and selling, on deciding what’s trending or just predict human behavior in general, this could help with future designs of applications and machines and what not! These data which are being produced at such high velocity in such huge volumes with this mind-boggling variety is called Big Data. Just remember the three V’s that make an impactful definition of Big Data, that is, Velocity, Volume and Variety. Since Big Data is pretty complex and unstructured; and we seek a near real-time analysis and response from it, processing and management of Big Data is recommended to be done using BASE (Basically Available Soft state Eventual Consistency) principles of database rather than the conventional ACID (Atomicity Consistency Isolation Durability) principle.

So this should bring us to an understanding of the basic concepts of data and Big Data, in general. Now that we have these data, where should these be stored? And how should these be stored so that these could be later retrieved and analyzed conveniently? It is at this point that Big Data processing tools make their pitch. MongoDB, Cassandra, Aerospike and Hadoop are just some of the well-known players in the market. Their prerogative is dealing with unstructured data. Data not organized according to the traditional concept of rows and columns or normalization. These are No-SQL databases. Some of these are better known for managing Big Data and some for the capacity to analyze Big Data. Albeit, Hadoop seems to be gaining a little more momentum in terms of its fan-following and the accompanying popularity.

Hadoop was created by Doug Cutting. It is an open-source Apache project. But as it happens with most open-source stuff, two competing organizations have currently taken over the onus of parenting and the upbringing of Hadoop, namely Cloudera and HortonWorks. They also provide certifications which are in pretty good demand in the market. The certification examination from Cloudera as stated in their official website is mainly MCQ (Multiple-Choice Question) while the HDP (HortonWorks Development Platform) is more hands-on which is based on executing certain tasks. More details could be found in their respective official websites.

Ok, now as promised a quick look into Hadoop. Hadoop is a big-data analytics tool that is linearly scalable and works on a cluster of nodes. At a minimum, a cluster might have just one node and at a maximum, it could span thousands of nodes. The architecture of Hadoop is based on two concepts: HDFS and MapReduce. But before we plunge into understanding them, we need to be familiar with a few of the jargon.

1. Node: A computer-machine

2. Cluster: A group of nodes

3. Commodity Hardware: Machines that aren’t too expensive to buy or maintain; don’t boast of powerful processing capabilities

4. Scalability: The ability to cope with increasing load while maintaining performance and delivery

5. Horizontal or Linear Scalability: When scalability is achieved by adding more of commodity hardware that are not very resource intensive

6. Vertical Scalability: When scalability is achieved by adding more powerful machines with greater processing power and resources

The concept of file systems in Hadoop revolves around the idea that files should be split into chunks and their storage distributed across a cluster of nodes with a certain replication factor. Benefits? The chunking allows parallel processing and the replication promises protection against data loss. HDFS which stands for Hadoop Distributed File System is just one of these file system implementations for Hadoop. In simple terms, it is a file system in which the data or files are partitioned into chunks of size 128MB (default chunk size) and stored across a number of nodes in the cluster. Each chunk has a certain number of replication copies called its Replication Factor. It is ideally three. The HDFS blocks are kept of size 128MB by default for two reasons- first: this ensures that at any time an integral number of blocks reside in a disk (typical disk block is of size 512 bytes) and second: with the advent of storage technologies we have made good progress but still latency due to disk seek time is still more when compared to reading time; hence large chunks imply shorter seek times. All the nodes participating in the Hadoop cluster are arranged in a master-worker pattern comprising of one name-node (master) and multiple data-nodes (workers). The name-node holds information on the file-system tree and the meta-data for all the files and directories in the tree. This information is locally persisted in the form of two files- edit log and the namespace image. The data-nodes store and retrieve the chunks of data files assigned to them by the name-node. Now in a cluster to keep track of all active nodes, all data-nodes are expected to send regular heartbeat messages to the name-node together with the list of file blocks they store. If a data-node is down, the name-node knows in which alternate data-node to look for a replicated copy of the file block. Thus, data-nodes are the workhorses of the file-system but the name-node is the brains of the system. As such, it is of utmost importance to always have the name-node functioning. Since this could create a precarious situation, hence as of Hadoop 2 there is a provision of two name-nodes; one in active and the other in standby mode. Now, again there are different approaches through which the data between the two name-nodes are continually shared to ensure high availability. Two among these are using NFS Filer or the QJM (Quorum Journal Manager). Anyways, in this article we just stick to the basics. Will talk about HA (High Availability) Name-Nodes in another article. Good so far with the concepts of HDFS, file blocks, name-node and data-node? These form the skeleton of HDFS. Ok, let’s move onto MapReduce.

MapReduce is the method of executing a given task across multiple nodes in parallel. Each MapReduce job consists of four phases: Map->Sort->Shuffle->Reduce. Following terms are of importance when talking about MapReduce jobs:

1. JobTracker: the software daemon that resides on the namenode and assigns map and reduce tasks to other nodes in the cluster.

2. TaskTracker: the software daemon that resides on the data-nodes and is responsible for actually running the Map and Reduce tasks and reporting back the progress to the JobTracker.

3. Job: a program that can execute Mappers and Reducers over a dataset

4. Task: an execution of a single instance of a Mapper and Reducer over a slice of data

5. Data locality: it is the practice of ensuring that a task running on a node works on the data that is closest to that node, preferably in the same node. To explain a little more; from the above definition of HDFS we know that data files are split into chunks and stored in datanodes. The idea of data locality is that the task executing on a node works on the data stored in that same node. But sometimes, the data in a datanode might be corrupted or might not be available for some reason. In such a scenario that task running in that node tries to access the data from the datanode that is closest to that node. This is done to avoid any cluster bottlenecks that might arise owing to data transfers between datanodes.

In simple terms, what happens in MapReduce jobs is, the data to be processed is fed into Mappers that consume the input and generate a set of key-value pairs. These output from Mappers are then sorted and shuffled and provided as input to the reduce jobs; which finally crunch the key-value pairs given to it to generate a final response.

For example, consider a scenario where we have a huge pile of books. We want a count of all the words that occur in all of these books and this task is assigned to Hadoop. Hadoop would proceed to do the task somewhat like this.

The name-node in HDFS, let’s call it ‘master’ splits the books (application data) into chunks and distributes them across a number of data-nodes; let’s call them ‘workers’. Thereafter, the master sends over a bunch mappers and a reducer to each of these workers. This part is taken care of by the sentinels called JobTrackers and TaskTrackers who oversee the work of the mappers and reducers. All the mappers must finish their job before any reducer can begin. The task of the mappers is to simply read the pages from the chunk stored with the worker, write down each word it encounters in it and a count of 1 against each word on a page, irrespective of the number of times it encounters the same word. So the output from each mapper is a bunch of key-value pairs written on it where each word is a key and all the keys have a value 1. So the output would be something as shown below:

Apple -> 1

Mango ->1

Person -> 1

Person->1

Apple->1

Person-> 1 etc.

Next, the JobTracker directs the output from all the mappers to be read by the reducer thereby summing up values against each key. So continuing with the above sample, the reducer would generate something like this:

Apple -> 2

Mango->1

Person -> 3

So the output from the reducer gives us the result of the task. Simple enough to grasp the concept? Hope it is.

Presenting the following picture to offer a quick summary of the concepts explored here. Most often end-users just interact with a POSIX (Portable Operating System Interface) with client machines without needing to know of all the nitty-gritty stuff that Hadoop does.

Fig.2 Hadop

Communication is an amazing phenomenon. Many millions of years ago communication started as a bare medium to express the innate feelings of hunger, fear, anger, threat or camaraderie through sounds and gestures. Watching a documentary on Animal Planet or Wild Discovery reflects how the animals in the wild communicate. Not just among the same group or within the same species but among beings of different species as well as with the environment. For instance, monkeys could be seen sending out alarm calls to herds of deer caught unawares by crouching tigers; and both the deer and the tigers understand the significance of their calls. Elephants can be seen trumpeting wildly asserting their rights over a patch of green land during the dry seasons and all the encroachers understand the impending threat. Plants too can be seen responding to external stimuli, sometimes imperceptibly and sometimes quite more visibly. For instance, on long stretches of highways one could often see leafy trees bent towards the traffic on either side of the road as if forming the roof of a palanquin, quite metaphorically of course.

To begin with, the communication needs of human-beings were limited to what was wanted and what was needed. Just the basics. And then we began to grow. Our needs and wants leaped the invisible boundaries of animal life and raced towards limitlessness. Different dialects evolved, languages came into being, sounds were captured by pens and pencils, writing and calligraphy developed; signs and symbols were invented -- some vocal and some pictorial. With all these our innate feelings also underwent a change; and in the process, an interesting genre of signs and symbols came into being. Words or symbols which in their plain flavor were simple meaning words suddenly began to convey something more; sometimes teasing and at others a more sinister meaning, bordering on deeply painful or dangerous emotions. For example, the words Black and White. In it and of itself, they just convey the idea of two different colors. But a person using them could imply a myriad of different sentiments, meanings and concepts through the sheer pronunciation of the words, a look of the face, an exhibition of body-language, the context in question and what not. Anyways, so the idea put forth so far has been the bewildering progress that human-beings have made in the field of communication.

These days we have mind-bogglingly diverse channels of communication. How? Because we have learnt not just to talk or sulk with one another but we have learnt to talk to machines. Every software program that is written is a message sent to the computer, one of the most wonderful of machines we have invented. With it, we are capable of not just talking or texting to each other but we can exchange information through an enormous variety of media- words, text, signs, emoticons, pictures, you name it. And this trespasses all boundaries of land, water, seas and all the spaces in between. We can communicate with each other not just from across the globe, but even from the skies above, thanks to the flying aircrafts, space stations, space-crafts et al.

At some time or the other most of us must have shared a sense of bewilderment at how much we have progressed. But we have seen it happening all around us all the time, ad nauseam that sometimes the stuff that should fill us with wonderment and probably some dismay; just leaves the imprint of a mild smirk. Technology, machines and devices and human learning have brought us far, far away from where we started. And now we are in a state when we have begun teaching machines to talk to each other. For instance, the television remote has learnt to communicate with the television set. The cell phone knows how to talk to your computer and share pictures, files and other data. Google is teaching cars to drive around on the roads. Soon we would be in a World where a driverless car would fetch you up from wherever you are, bring you home and the house would know to welcome you. It might scan some biometrics data and unlock the door; sense your mood from the same biometrics scanning and turn on the air-conditioning to the desired temperature, switch on the lights adjusted to the brightness wished and turn on the television or the music system to play the stuff that soothes you. Sounds fantastic, isn’t? Yes, it’s all nice and hi-fi and sci-fi but such eerie reality might soon alienate human-beings and make them into something not very desirable to the human race. For then, the very need of people to communicate with each other would be rendered redundant, probably unwanted even as we would be cloistered by machines talking to one-another and communicating with humans in the mechanical ways trained by us.

Communication that is elevated to the stature of a skill, a platform of culture, an intricate art ought to be learnt and studied would perhaps become just one of the options to be set via one’s mobile phone or a tiny chip embedded somewhere. And in the transition, wonder if human beings could continue to remain human or would we be transformed into the image, a prototype of the robots we are so obsessed with creating.

TechNotes

Monday, 23 November 2015

About Big Data & Hadoop

Wednesday, 18 November 2015

When Machines Talk

Blog Archive