Tuesday, 24 January 2017

Pig vs Hive: Main differences between Apache Pig and Hive

Delving into the big data and extracting insights from it requires robust tools that allow flexibility in data management and querying – filtering, aggregating, and analyses. Typically, MapReduce code is leveraged to do this but the complexity involved in writing intricate Java code to prepare MapReduce scripts led to new languages being created that allowed users to access datasets with more ease.

Pig was created by researchers at Yahoo, and has the flexibility of multiple query approach. Although somewhat similar to SQL the traditional language for data analysis in some ways, doesn’t have its declarative nature and has limitations like- being dependent on relational database schemas. Pig is more of a programming language, and is often referred to as an abstraction of the complicated syntax of Java programming required for MapReduce. Pig has has different semantics than Hive and Sql.

Hive (invented at Facebook) on the other hand is highly similar to SQL, as it uses almost the same commands for data manipulation, making particularly suitable for those experienced in use of SQL.

These two components of the Hadoop ecosystem work atop Hadoop. The goal of both these tools is to make it easier to interact with massive datasets within Hadoop without having to write out complex MapReduce code.

Understanding the differences between Pig and Hive  
There are several differentiating elements between the two languages, and big data users need to appreciate these differences to make use of the right tool:
  •  As Hive adopts SQL-based declarative approach it is often preferred for structured data especially historical data. It is therefore often referred to as a data warehouse platform.
Pig on the other hand uses Procedural Data Flow Language and is preferred for semi- structured, unstructured or decentralized data. The flexibility of Pig allows better construction of data flows and its feature of self-optimization results in lesser number of data scans.

  • Hive use distinct query language called HQL whereas Pig use their own language called piglatin (procedural language).
  • Partitioning can be done using HIVE whereas it’s not possible in in PIG
  • In terms of practical usage, Hive is preferred for reporting and operates on the server side of a cluster while Pig is great for writing programs and operates on the client side.
  • Given its characteristics, Pig is typically used by researchers and programmers but Hive is preferred by data scientists who work on large quantitative datasets.
  • Hive usually executes quickly but loads slowly whereas Pig loads faster and more effectively.

Adopting a standard approach to big data analytics would hamper benefits from it.  Both Pig and Hive have their own advantages that make them apt for some situations but not in others. Analysts must carefully examine the insight requirements before deciding on the tool to use. 

Monday, 19 December 2016

How and Why To Bridge between SQL and NoSQL

SQL have for long now been the synonym of "database" for us. For any sort of data management, SQL had been our instinctive choice. However, the past decade saw the emergence of NoSQL which gave rise to a fierce competition of preferences.
What haunts the mind of every aspiring database developer today is the question of choice: To SQL or NoSQL. We want to keep in touch with the latest trends in the technology, but don't want the established technologies to slip away either. However, the most basic point that most people seem to miss is this: SQL and NoSQL are not competitors, and most certainly not antonyms of each other.

SQL or Structured Query Language is the most standard concept of database management systems today. SQL considers data to be stored in the form of tables called Relations, that consist of tuples and attributes. While this concept had been a hugely successful improvement over the data-storage systems present at that time, like flat files, things have changed today.

NoSQL came as a breath of fresh air in an industry that was rapidly changing. The world is going digital, and the digital world is messy. We can never predict the volume, variety or velocity of incoming data. The data, apart from being unpredictable, is also unstructured. Since relational databases are not inherently adept to handle them, something else was required. At the same time, distributed computing is all the rage today, because most businesses are moving towards the cloud. The expansion of relational databases cannot keep up with the pace; thus, NoSQL entered into the scene.

Why to migrate from SQL to NoSQL

Strictly speaking, NoSQL aims to do what SQL cannot. It is not based on relations and it may sometimes even fail to follow the ACID properties! But unlike what you have been taught, ACID properties, though really useful, are not the ultimate necessity. The ultimate necessity is fault tolerance, and NoSQL manages to achieve that anyway.

NoSQL cannot be defined in a single line, as there is no single definition. While all SQL-based databases follow strict guidelines that adhere to SQL-standards, NoSQL gives the databases a free rein. With so many lacks of standards, one might wonder: Are the reasons enough to migrate to NoSQL?

Yes, because we have only touched the crux of the importance of NoSQL in modern world. The two biggest reasons why NoSQL trumps over SQL are agility and scalability.

With the rapid changes that occur daily in the industry, being agile is the only way to survive. However, Relational databases couldn't ever hope to achieve that, with their rigid schemas and complex development. The aforementioned rapid changes are also met by growing size, which require rapid scalability. However, scalability was one aspect that was blatantly ignored in SQL (as it was made in a time when web and internet were non-existent). To cope up with these issues, NoSQL seems like our best bet.

Why to Bridge SQL and NoSQL

"Now that we know how NoSQL differs from SQL, the question arises: Why to bridge them? Why not adopt NoSQL altogether?   "

Simply, because NoSQL doesn't have the same penetration as SQL. A huge number of companies have their entire existing architecture based on relational databases, which would be quite a headache to change. But that doesn't mean that one has to remain stuck with SQL forever. The best option in such scenarios is to bridge the existing SQL framework with a NoSQL database. The benefit? To put it simple, it will bring out "the best of both worlds".

As far the "bridging" goes, there is no one, simple way to do that. The easiest way would be to use third-party drivers like easysoft, which provides ODBC-like bridging capabilities. However, as it comes from a third-party vendor, it might have its own security and licensing issues.

An alternative approach would be to develop languages that could extend SQL functionality to NoSQL databases. One example would be the N1QL, introduced by Couchbase Server, which extends SQL to JSON.

The ways to bridge the gap between these two technologies may differ and evolve; but we can all agree that co-existence of the two is best for the progress of industry.

Please share your thoughts on this topic. If you like this posts, please share it on google by clicking on the Google +1 button.

Read more on NO SQL- NOT ONLY SQL here - WhatisNoSQL
Related Posts Plugin for WordPress, Blogger...