










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Introduction to apache hive and hiveQL with creating database, tables and queries
Typology: Lecture notes
1 / 18
This page cannot be seen from the preview
Don't miss anything!
Introduction to Hive: Introduction to Hive, Hive Architecture , Hive Data Types, Hive File Format,Hive Query Language (HQL), User-Defined Function (UDF) in Hive.
Q) What is Hive? Explain features of Hive. Hive is data warehousing tool and is used to query structured data that was built on top of Hadoop for providing data summarization, query, and analysis. Hive Provides HQL (Hive Query Language) which is similar to SQL. Hive compiles SQL queries into MapReduce jobs and then runs the job in the Hadoop cluster. Features of Hive:
Q) Explain various Hive Data Units Databases: The name space for tables Tables: set of records that have similar schema Partitions: Logical separations of data based on classification of given information as per specific attributes. Buckets or clusters: Similar to partitions but uses hash function to segregate data and determines the cluster or bucket into which the record should be placed.
Q) Explain Hive Architecture with a neat diagram.
Externel Interfaces - CLI, WebUI, JDBC, ODBC programming interfaces Hive CLI: The most commonly used interface to interact with Hadoop. Hive Web Interface: It is simple graphic interface to interact with Hive and to execute query. Thrift Server – Cross Language service framework. This is an optional Sever. This can be used to submit Hive jobs from a remote client. JDBC/ODBC: Jobs can be submitted from a JDBC client. One can write java code to connect to Hive and submit jobs on it. Metastore - Meta data about the Hive tables, partitions. A metastore consists of Meta store service and Database. There are three kinds of Metastore:
process, but the metastore database runs in a separate process and can be on a separate host.
Fig. Hive Architecture Driver - Brain of Hive! Hive queries are sent to the driver for Compilation, Optimization and Execution.
Q)Explain different data types in Hive.
Hive Data types are used for specifying the column/field type in Hive tables. Mainly Hive Data Types are classified into 5 major categories, let’s discuss them one by one:
a. Primitive Data Types in Hive Primitive Data Types also divide into 3 types which are as follows:
RC File Instead of only partitioning the table harizontally like the row oriented DBMS, RCFile partitions this table first horizontally and then vertically to serialize the data.
Fig. Hive RCFile Format
Q) What is Hive Query Language(HQL)? Explain various DDL and DML statements in Hive Hive query language provides basic SQL like operations. Basic tasks of HQL are:
HIVE DML Statements: These statements are used to retrive, store, modify, delete and update data in database. The DML commands are as follows:
hive> create table emp(id int,name string,sal double) row format delimited fields terminated by ','; OK Time taken: 8.331 seconds
hive> show tables; OK emp
hive> create table dept(eid int,dept string) row format delimited fields terminated by '@'; OK
hive> select * from dept; OK 2 cse 3 mca 4 cse Time taken: 0.133 seconds, Fetched: 4 row(s)
Queries:
pnr kry (or) select emp.name from emp left outer join dept on(emp.id=dept.eid) where dept.dept='cse' or dept.dept='mca'; pnr kry
Q) Briefly explain joins in Hive.
JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database. It is more or less similar to SQL JOIN.
Inner join: The HiveQL INNER JOIN returns all the rows which are common in both the tables.
3 kry
Q) Explain about various functions in Hive. string functions:
Math Functions:
3. Find the length of the emp names. hive> select name,length(name) from emp; chp 3 pnr 3 kry 3
Aggregate functions:
Q) Differentiate Sort By and Order By in Hive Hive sort by and order by commands are used to fetch data in sorted order. The main differences between sort by and order by commands are given below.
Sort by: May use multiple reducers for final output. Only guarantees ordering of rows within a reducer. May give partially ordered result.
Order by: Uses single reducer to guarantee total order in output.
Q) Write a query in hive to find word count in a text file. (or) Give an example for subquery in Hive
hello.txt(Input text file) : hello welcome to guntur hello welcome to vignan welcome to cse
Creating a table for input: hive> create table t2(st string);
In a Hive table, there are many numbers of rows and columns. If we want to perform queries only on some columns without indexing, it will take large amount of time because queries will be executed on all the columns present in the table.
Indexes are maintained in a separate table in Hive so that it won’t affect the data inside the table, which contains the data.
Indexes are advised to build on the columns on which you frequently perform operations.
Building more number of indexes also degrade the performance of your query.
Types of Indexes in Hive F 0 B 7 Compact Indexing F 0 B 7 Bitmap Indexing
Differences between Compact and Bitmap Indexing Compact indexing stores the pair of indexed column’s value and its blockid. Bitmap indexing stores the combination of indexed column value and list of rows as a bitmap.
Creating compact index: Syntax: hive> create index index_name on table table_name(columns,...) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;
Here, in the place of index_name we can give any name of our choice, which will be the table’s INDEX NAME. •In the ON TABLE line, we can give the table_name for which we are creating the index and the names of the columns in brackets for which the indexes are to be created. We should specify the columns which are available only in the table. •The org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ line specifies that a built in CompactIndexHandler will act on the created index, which means we are creating a compact index for the table. •The WITH DEFERRED REBUILD statement should be present in the created index because we need to alter the index in later stages using this statement.
Eg.
hive>create index emp_index on table emp(name,sal) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;
creating bitmap index : Syntax:
create index index_name on table table_name(columns,...) as 'bitmap' with deferred rebuild; eg. hive> create index dept_index on table dept(eid) as 'bitmap' with deferred rebuild;
Q) Find different indices on table emp and dept hive> show formatted index on emp;
idx_name tab_name col_names idx_tab_name idx_type emp_index emp name, sal default__emp_emp_index__ compact
hive>show formatted index on dept; idx_name tab_name col_names idx_tab_name idx_type dept_index dept eid default__dept_dept_index__ bitmap
1) Update index emp_index hive> alter index emp_index on emp rebuild;
2) Drop index emp_index hive>drop index if exists emp_index on emp;
Q) Explain Partitioning in Hive.
In Hive, the query reads the entire dataset even though a where clause filter is specified on a particular column. This becomes a bottlenect in most of the MapReduce jobs as it involves huge degree of I/O.
So it is necessary to reduce I/O required by the MapReduce job to imporve the performance of the query. A very common method to reduce I/O is data partioning.
Partitions split the larger dataset into more meaningful chunks. Hive provides two kinds of partions.
Static partition: Static partitions comprise columns whose values are known at compile time.
**Eg.
std1.txt 1001 1002
2) Load data into std_partition from st1.txt and partitioned column branch as cse.
3 mca
4) Drop partitioned table dept_partition. hive> alter table dept_partition drop partition(branch='cse'); Dropping the partition branch=cse OK Time taken: 1.737 seconds hive> select * from dept_partition; OK 3 mca
Q) Explain how to create buckets in Hive.
Bucketing is similar to partition. However, there is a difference between partition and bucketing. In a partition, we need to create partition for each unique value of the column. This may lead to situations where you may end up with thousands of partitions. This can be avoided by using bucketing in which we can limit the number of buckets to create. A bucket is a file whereas a partition is a directory. Eg. Set below property to enable bucketing. set hive.enforce.bucketing=true;
Creating bucket on dept table: create table dept_bucket( eid int,dept string ) clustered by ( dept ) into 2 buckets;
loading data: from dept insert overwrite table dept_bucket select eid,dept;
To display content from 1 st^ bucket: select distinct from dept_bucket tablesample(bucket 1 out 2 on dept);
Q) Explain about SERDE.
SerDe stands for Serializer/Deserializer.
Input:
create table xmlsample(xmldata string) load data local inpath ‘/chp/input.xml’ into table xmlsample;
create table xpath_table as select xpath_int(xmldata,’employee/empid’), xpath_string(xmldata,’employee/name’), xpath_string(xmldata,’employee/designation’) from xmlsample;
select * from xpath_table;
Output: 1001 chp Analyst 1002 praneeth Team Lead
Q) Explain about User Defined Function(UDF) in Hive with an example.
In Hive, we can use custom functions by defining the user defined functions Eg. to write a hive function to convert values of a field to lowercase.
package com.example.hive.udf; import org.apavhe.hadoop.hive.ql.excec.Description; import org.apache.hadoop.hive.ql.exec.UDF; @Description( name=”SampleUDFExample”) public final class MyLowerCase extends UDF{ public String evaluate(final String word){ return word.toLowerCase(); } }
Note: We have to convert this java program into jar.
add jar /chp/hivedemo/UpperCase.jar create temporary function tolowercase as ‘com.example.hive.udf.MyLowerCase’; select touppercase(name) from emp; output: CHP PNR KRY