Tuesday, 14 February 2017

What is MapReduce in Big Data ?

MapReduce is pivotal to big data as it has allowed the processing of massive datasets, which the earlier preferred format for data storage and analysis, RBDMS was not capable of doing.

In terms of big data analytics, the open-source framework of Hadoop has been a game changer. It has enabled storage of large datasets going up to petabytes in a cluster format, and faster processing of this distributed data. An essential feature of Hadoop is its ability for parallel processing which is executed via MapReduce.



MapReduce is defined as the distributed data processing and querying programming engine that effectively splits and spreads around the necessary computation activities on a dataset across a wide range of servers which are known as data clusters.  A query that needs to run through a mega data set may take hours if situated in one computer server. This is however cut down to minutes when done in parallel over a spread of servers.

The term MapReduce refers to two critical tasks it handles on the Hadoop Distributed File System (HDFS) – the Map Job and the Reduce Job. The Map function takes the different input data elements available and processes them into an output data element, creating key value pairs.  The Reduce function aggregates outputs  created under the key value pairs, put them back together quickly and reliably in order to produce the required end-result.

Structurally, MapReduce has a single master Job Tracker and several slaves Task Tracker, one each per cluster. The master distributes and schedules the tasks to these slaves and keeps track of the assigned jobs, redoing any that fail.  The slave tracker ensures that the assigned task is executed and communicates with the master

There are number of benefits of MapReduce that has made it an important element of Hadoop.
  • It allows developers to use any language like Java or C++ to write the applications, although Java is most preferred. 
  • MapReduce can handle all forms of data whether structured or unstructured. 
  • Another core feature of MapReduce is that it can easily run through petabytes of data stored in a data center due to its construct. 
  • The MapReduce framework is also highly flexible in case of failures. If one dataset fails but is available in another machine, it can index and use the alternate location.  

Today, there are several additional data processing engines like Pig or Hive that can be used to extract data from a Hadoop framework. These eliminate some of the complexity of MapReduce and make it easier to generate insights.

Will discuss more about Map reduce in our upcoming post.
Learn more about Pig and Hive here

Thursday, 26 January 2017

Overview of Mongo DB 3.4 : New Features


Mongo DB has been wildly popular ever since its introduction for plenty of reasons. The biggest one was because it got rid of the Object-Relation Mapping to a large extent, which had been the source of trouble of programmers for years. Even today, it is the 5th most popular database. However, the graph of popularity of Mongo DB decreased somewhat over the years, due to introduction of more advanced and simplified NoSQL databases. This might change with the release of Mongo DB 3.4, released late last year. According to the company, they seek to attain a "digital transformation" with this release.

The clear message that the company gave with this release was that it is aiming to simplify the life of large enterprises that have depended upon Mongo DB for long now. Like Python, Mongo DB is aiming to evolve so that it alone suffices for tasks that earlier required multiple technologies. Since we have seen this formula succeeding more than once, we have to admit this is a very smart move from the company.

Graph support was the need of the hour in Mongo DB for quite some time now. Taking more than 3 years to become a reality, it is arguably the biggest addition in the new version. While it does not seem to pose any threat to established graph databases like Neo4J, the graph support is sure to simplify things for its existing users. This feature is sure to have large impact, as it will facilitate companies to explore hitherto doubted avenues like Deep Analytics, Internet of Things and Artificial Intelligence. This would be further aided by Atlas, Mongo DB's database cloud service released earlier last year.

Ecommerce websites working upon Mongo DB had toiled hard for long to provide decent search functionality to its customers. This ends with the faceted navigation feature, which uses filters to narrow down the query results. This ensures faster and more relevant search results. Also, a read-only mode was introduced that could expose the information of an application while preventing any modification. Another huge feature was the creation of Geo-distributed Mongo DB zones, which deals with the problem of data sovereignty and solves it by providing tagging via a higher abstraction of “zones”.

The release also had few things in store for the regular users. The new SQL interface is sure to greatly ease things for the users who have struggled for long to import their SQL code into Mongo. Mongo DB also introduced the ($switch) operator, which greatly simplifies complex branching, while making it more readable. Like the popular "switch" expression, it tests a number of cases, executing only the one that turns out to be true. Another addition was the ($reduce) operator, that could reduce the results of multiple arrays into a single expression.

Apart from this, there has been a whole array of other additions, whose actual importance would only be realized in the long run. This includes elastic clustering, tunable consistency and enhanced DBA.

Overall, this release has been quite impressive and an instant success. Mongo DB has made its intention very clear: It is here to stay and win. With this, other NoSQL providers like Redis and Cassandra as well as established SQL players like MySQL and Oracle will have to up their game.
Related Posts Plugin for WordPress, Blogger...

ShareThis