Wednesday, 3 February 2016

What is Cloudera Impala ? Impala vs Hive

Cloudera Impala is an open source, and one of the leading analytic massively parallelprocessing (MPP) SQL query engine that runs natively in Apache Hadoop. Cloudera Impala project was announced in October 2012 and after successful beta test distribution and became generally available in May 2013.Its preferred users are analysts doing ad-hoc queries over the massive data sets stored in Hadoop.

The main feature of Impala is that with Impala we can run low-latency Adhoc SQL queries directly on the data stored in a cluster, stored either in unstructured flat files in the file system, or in structured HBase tables without requiring data movement or transformation. Performance is increased due to the fact that we need not migrate data sets to dedicated processing systems or convert data formats prior to analysis.

Another important feature of Impala is that it is workable to the data formats metadata, security and resource management frameworks used by Map Reduce, Apache Hive, 
Apache Pig and other components of the Hadoop stack.

Impala also supports all Hadoop file formats, including new format Apache Parquet. Apache Parquet is a columnar storage format for the Hadoop ecosystem created with advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

Impala queries are executed as follows:

  • Queries are submitted using Impala-shell command-line tool, or from a business application through an ODBC or JDBC driver.
  • Impala distributed query engine builds and distributes the query plan across the cluster.
  • It runs separate Impala Daemon (impalad) which runs on data nodes and responds to impala shell. These daemons can return data quickly without having to go through a whole Map/Reduce job.
  • Impalad is a process that runs on designated nodes in the cluster. It coordinates and runs queries.

Comparison With Hive 

When we compare to Hive and MapReduce ,both optimized for long running batch-oriented tasks such as 
ETL(Read more:What is ETL), Impala is more compatible for running interactive analytical SQL queries over small amounts of a huge data. What makes it different form HIVE is that Impala does not rely on Map Reduce, it avoids the start-up overhead of Map Reduce jobs and instead uses its own t’s own set of execution daemons which need to be installed alongside your data nodes. 
Hive in Hadoop ecosystem is intended for a data warehouse system to support with easy data aggregations, adhoc queries over large datasets which are stored in Hadoop HDFS file systems whereas Cloudera Impala is a query engine for data stored in HDFS and HBase. 
Both Hive and Impala supports the existing HiveQL(a SQL-like language-Read more on :Apache Hive ).

Because Impala and Hive share the same metastore database and their tables are often used interchangeably. This cross-compatibility applies to Hive tables that use Impala-compatible types for all columns.

Partitions in Impala 

As in large scale Data warehouse how we make use of partitioned tables (Read more on: Partitions in Oracle ) to speed up queries, the same way in Impala we make use of Partitioned tables. Data is partitioned based on values in one column and instead of looking up one row at a time from widely scattered items, the rows with identical partition keys are physically grouped together. Impala also takes advantage of the partitioning present in Hive tables.

Cloudera Impala makes use of the following two technologies
  • Columnar Storage:  Since data stored in columnar fashion it gives high compression ratio and efficient scanning.
  • Tree Architecture: The architecture forms a massively parallel distributed multi-level serving tree for pushing down a query to the tree and then aggregating the results from the leaves.

Impala provides the following benefits:

  • Efficient resource usage: Impala can handle concurrent client requests in shared workload environment. Each Impala daemon can handle multiple concurrent client requests
  • Impala doesn't provide fault-tolerance compared to Hive. Just in case the node fails in the middle of processing, the whole query has to be re-run. But Impala has the advantage that even if node fails and we start over, its total runtime is so fast that it will accomplish for the time loss.
  • Time savings because you do not have to move around data and Impala does not write the intermediate results to disk.
  • Supports Hadoop Security (Kerberos authentication) and role-based authorization through the Apache Sentry project.
  • Far-reaching accessibility of Hadoop data to the business community.
  • More complete analysis of full raw and historical data, without information loss from aggregations or conforming to fixed schemas.

Tuesday, 26 January 2016

Google BigQuery- An externalized version of Dremel

So What is Google Big Query?? Its powerful Big Data analytics platform used by all types of organizations to run SQL-like queries against multiple terabytes of data in a matter of seconds. With this cloud based interactive query service we can handle web-sized amounts of data at blazing fast speed. 

Big Query (released in 2010)is actually the external or public implementation of one of the Google’s core technologies so-called Dremel .Big Query provides the features available in Dermel to third party conserving its unparalleled query performance. Both in fact share the same underlying architecture and performance characteristics. 

Big Query release made it possible to utilize the power of Dremel and to take advantage of Google’s massive computational infrastructure.

Let’s take a deeper look into power of Dremel… It is a query service that allows you to run SQL-like queries against very, very large data sets and get accurate results in mere seconds. You just need a basic knowledge of SQL to query extremely large datasets in an ad hoc manner.

Dermel runs through tens of thousands of servers simultaneously and makes it easy to analyse large amount of data such as a collection of web documents or a library of digital books or even the data describing millions of spam messages.

“According to Google’s paper, this has been used inside Google since 2006, with “thousands” of Googlers using it to analyse everything from the software crash reports for various Google services to the behavior of disks inside the company’s data centers”

The two core technologies that makes Dremel and BigQuery so fast is the Tree Architecture of Dremel And that the Data is stored in a Columnar Storage fashion in so doing, it gives very high compression ratio and scan throughput. 

So how to use data in Big Query or how to import data into Big Query:

  • Upload your data to Google Cloud Storage
  • Import the files to Big Query. Executed using command-line tool, Web UI or API, which can typically import roughly 100 GB within a half hour.

Other Important Features of Google Big Query:
  • BigQuery is designed to handle structured data using SQL. Apart from SQL queries we can easily read and write data in Big Query via Cloud Dataflow, Spark, and Hadoop
  • BigQuery provides extremely high cost effectiveness and full-scan performance for ad hoc queries and cost effectiveness compared to traditional data warehouse solutions and appliances.
  • BigQuery is the best choice for ad hoc OLAP/BI queries that require results as fast as possible.
  • BigQuery requires no capacity planning, provisioning, 24x7 monitoring or operations, nor does it require manual security patch updates. You simply upload datasets to Google Cloud Storage of your account, import them into Big Query, and let Google’s experts manage the rest.

Related Posts Plugin for WordPress, Blogger...