Tuesday, 26 November 2013

Zookeeper (Apache Hadoop project)

The Apache Hadoop project is a collection of many sub projects and ZooKeeper(now a top-level project in its own) is one among them and is noticeable for its wide applicability for building distributed systems.

ZooKeeper is a distributed, open-source coordination service for distributed applications. Very large Hadoop clusters can be maintained by multiple ZooKeeper servers as it ensures the availability by never-ending services.In Hadoop project it is used to manage master election and store other process metadata.

For example
In Hadoop we do have many types’ nodes, master and multiple worker nodes. If by any chance the master node fails then role of master node has to be transferred to different node. This is done by zookeeper as it takes care of clusters by assigning tasks to new master node

Saturday, 16 November 2013

Apache Hive & Hive Query Language

Apache Hive is an open-source data warehouse system based on Hadoop and is used for ad-hoc querying, data summarization and analyzing large datasets stored in Hadoop files.

While initially developed by Facebook to analyze their petabytes of data at Internet, Apache Hive is now used and developed by other companies .Hive was developed by Facebook to allow their SQL developers to control the Hadoop platform by writing Hive Query Language (HQL) statements.

Hive QL is a simple language similar to SQL .Hive QL which converts SQL-like queries into MapReduce jobs executed on Hadoop, also supports custom MapReduce scripts.

Hive is faster when compared to other queries running on huge datasets. It can be run from a command line interface or from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application.

All about Informatica Session Properties

This post is just an Introduction for beginners in Informatica to know more about Informatica sessions and its properties.

A session is an instance of Mapping Program or in other words a running instance of a mapping is referred as Session. For one Mapping Program we can create one or more Sessions. Generally we require one session for one mapping but for Parallel data loading we may create multiple sessions.
Now we will go through the following tabs in the session.

·        General Tab
·        Properties Tab
·        Mapping Tab & Config Object

Sunday, 10 November 2013

Informatica Session Components Tab: Pre-session and Post-session Command

Pre-Session Command:
We can define Operating System Commands or  programs to be executed before the data loading process (Session Process) starts. Operating System commands or programs can be defined as Reusable Commands (defined in the form of   Command Tasks) or Non-reusable commands (defined with this property directly). Some of the uses of commands or programs can be:

  • To enable/disable database users before data loading
  • Make backup/copy of target tables so that old data can be restored in case of data loading fails
  • To intimate users via Emails about the Data Loading success so that they can start analysis

Informatica Session: Sources and Targets Properties and Connection

Source Connection
Define source data connection for each source qualifiers. If the base tables for source qualifiers are from different databases then we should define any of the databases as connection database for all source qualifiers. We need to grant SELECT privilege on all the tables from different databases to the Connection Database. This task is done at Database level.

Target Connection
Select Relational Writer if the target is a Table else select File Writer if the target is a Text  File i.e. Flat File .If File Writer is selected then use 'Set file properties' button to define file structure like: Delimiter, Text Qualifier, etc and to give extra character instead of nulls in flat files.

Properties Page
If Relational Writer is selected then set the following properties:
  • Target load type: Normal or Bulk
  • Insert, if this option is Unchecked then target table can not receive New Records.
  • Delete, if this option is Unchecked then target table does not allow deletion of records.

Informatica Session:Properties Tab

Below are the properties defined under "Informatica Session Properties tab".

  • Enter session log file name (any name) 
  • Define session log file directory using $PMSessionLogDir system variable OR define the complete folder path (/runbatch/session/test_session.log). 

Informatica Session Properties : General Tab

Below are some of the session properties listed under General tab

  • Rename to define the session name
  • Enable option 'Fail parent if this task fails' if you want to fail the parent task also in case this session fails. For example:First session is to load data for Employee table  and second session is to load for Department  table. Employee session is Parent and Department  session  is Child (Sequential Link), if Department  session fails to load data then we should fail Employee session also since an Employee without Department  is not possible.
Related Posts Plugin for WordPress, Blogger...