In this post we will take a look on the different Storage File Formats and Record Formats in Hive
Before we move forward lets discuss for a split second about Apache Hive.
Apache Hive which is a data warehouse system for Hadoop facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems, first created at Facebook . Hive provide a means to project structure onto this data and query the data using a SQL-like language called HiveQL.…read more on Hive here
Among the different storage file formats that are used in hive, the default and simplest storage file format is the TEXTFILE.
The data in a TEXTFILE is stored as plain text, one line per record. The TEXTFILE is very useful for sharing data with other tools and also when you want to manually edit the data in the file. However the TEXTFILE is less proficient when compared to the other formats.
CREATE TABLE TEXTFILE_TABLE (
) STORED AS TEXTFILE;
In sequence files the data is stored in a binary storage format consisting of binary key value pairs. A complete row is stored as single binary value. Sequence files are more compact than text and fit well the map-reduce output format. Sequence files do support block compression and can be compressed on value, or block level, to improve its IO profile further.
SEQUENCEFILE is a standard format that is supported by Hadoop itself and is good choice for Hive table storage especially when you want to integrate Hive with other techonolgies in the Hadoop ecosystem.
The USING sequence file keywords lets you create a sequence File. Here is an example statement to create a table using sequence File:
CREATE TABLE SEQUENCEFILE_TABLE (
) STORED AS SEQUENCEFILE
Due to the complexity of reading sequence files, they are often only used for “in flight” data such as intermediate data storage used within a sequence of MapReduce jobs.
RCFILE OR RECORD COLUMNAR FILE
The RCFILE is one more file format that can be used with Hive. The RCFILE stores columns of a table in a record columnar format rather than row oriented fashion and provides considerable compression and query performance benefits with highly efficient storage space utilization. Hive added the RCFile format in version 0.6.0.
RC file format is more useful when tables have large number of columns but only few columns are typically retrieved.
The RCFile combines multiple functions to provide the following features
- Fast data storing
- Improved query processing,
- Optimized storage space utilization
- Dynamic data access patterns.
CREATE TABLE RCFILE_TABLE (
COLUMN4 INT ) STORED AS RCFILE;
Compressed RCFile reduces the IO and storage significantly over text, sequence file, and row formats. Compression on a column base is more efficient here since it can take advantage of similarity of the data in a column.
ORC FILE OR OPTIMIZED ROW COLUMNAR FILE
ORCFILE stands for Optimized Row Columnar File and it’s a new Hive File Format that was created to provide many advantages over the RCFILE format while processing data. The ORC File format comes with the Hive 0.11 version and cannot be used with previous versions.
Lightweight indexes are included with ORC file to improve the performance.
Also it uses specific encoders for different column data types to improve compression further, e.g. variable length compression on integers
ORC stores collections of rows in one file and within the collection the row data is stored in a columnar format allowing parallel processing of row collections across a cluster.
ORC files compress better than RC files, enabling faster queries. To use it just add STORED AS orc to the end of your create table statements like this:
CREATE TABLE mytable (
) STORED AS orc;