Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Apache Hive for big data analytics, Lecture notes of Data Analysis & Statistical Methods

Introduction to apache hive and hiveQL with creating database, tables and queries

Typology: Lecture notes

2018/2019

Uploaded on 10/04/2019

praneeth-cheraku
praneeth-cheraku 🇮🇳

5

(1)

2 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
HIVE
Introduction to Hive: Introduction to Hive, Hive Architecture , Hive Data
Types, Hive File Format,Hive Query Language (HQL), User-Dened
Function (UDF) in Hive.
Q) What is Hive? Explain features of Hive.
Hive is data warehousing tool and is used to query structured data that
was built on top of Hadoop for providing data summarization, query,
and analysis.
Hive Provides HQL (Hive Query Language) which is similar to SQL. Hive
compiles SQL queries into MapReduce jobs and then runs the job in the
Hadoop cluster.
Features of Hive:
It is similar to SQL
HQL is easy to code
Hive supports rich datatypes such as structs, lists and maps
Hive supports SQL lters, group-by and order-by clauses
Custom types and custom functions can be dened.
Q) Explain various Hive Data Units
Databases: The name space for tables
Tables: set of records that have similar schema
Partitions: Logical separations of data based on classication of
given information as per specic attributes.
Buckets or clusters: Similar to partitions but uses hash function to
segregate data and determines the cluster or bucket into which the
record should be placed.
Q) Explain Hive Architecture with a neat diagram.
Externel Interfaces- CLI, WebUI, JDBC, ODBC programming interfaces
Hive CLI: The most commonly used interface to interact with Hadoop.
Hive Web Interface: It is simple graphic interface to interact with Hive
and to execute query.
Thrift Server Cross Language service framework . This is an optional
Sever. This can be used to submit Hive jobs from a remote client.
JDBC/ODBC: Jobs can be submitted from a JDBC client. One can write java
code to connect to Hive and submit jobs on it.
Metastore- Meta data about the Hive tables, partitions. A metastore
consists of Meta store service and Database.
There are three kinds of Metastore:
1. Embedded Meta store:This metastore is mainly used for unit tests.
Here, only one process is allowed to connect to the metastore at a
time. This is the default metastore for Hive. It is apache derby
database.
2. Local Metastore: Metadata can be stored in any RDBMS component
like MySQL. Local network allows multiple connections at a time. In
this mode, Hive metastore service runs in the main Hive server
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Apache Hive for big data analytics and more Lecture notes Data Analysis & Statistical Methods in PDF only on Docsity!

HIVE

Introduction to Hive: Introduction to Hive, Hive Architecture , Hive Data Types, Hive File Format,Hive Query Language (HQL), User-Defined Function (UDF) in Hive.

Q) What is Hive? Explain features of Hive. Hive is data warehousing tool and is used to query structured data that was built on top of Hadoop for providing data summarization, query, and analysis. Hive Provides HQL (Hive Query Language) which is similar to SQL. Hive compiles SQL queries into MapReduce jobs and then runs the job in the Hadoop cluster. Features of Hive:

  • (^) It is similar to SQL
  • (^) HQL is easy to code
  • (^) Hive supports rich datatypes such as structs, lists and maps
  • (^) Hive supports SQL filters, group-by and order-by clauses
  • (^) Custom types and custom functions can be defined.

Q) Explain various Hive Data Units Databases: The name space for tables Tables: set of records that have similar schema Partitions: Logical separations of data based on classification of given information as per specific attributes. Buckets or clusters: Similar to partitions but uses hash function to segregate data and determines the cluster or bucket into which the record should be placed.

Q) Explain Hive Architecture with a neat diagram.

Externel Interfaces - CLI, WebUI, JDBC, ODBC programming interfaces Hive CLI: The most commonly used interface to interact with Hadoop. Hive Web Interface: It is simple graphic interface to interact with Hive and to execute query. Thrift Server – Cross Language service framework. This is an optional Sever. This can be used to submit Hive jobs from a remote client. JDBC/ODBC: Jobs can be submitted from a JDBC client. One can write java code to connect to Hive and submit jobs on it. Metastore - Meta data about the Hive tables, partitions. A metastore consists of Meta store service and Database. There are three kinds of Metastore:

  1. Embedded Meta store:This metastore is mainly used for unit tests. Here, only one process is allowed to connect to the metastore at a time. This is the default metastore for Hive. It is apache derby database.
  2. Local Metastore: Metadata can be stored in any RDBMS component like MySQL. Local network allows multiple connections at a time. In this mode, Hive metastore service runs in the main Hive server

process, but the metastore database runs in a separate process and can be on a separate host.

  1. Remote Metastore: Hive driver and the metastore interface runs on different JVM’s.

Fig. Hive Architecture Driver - Brain of Hive! Hive queries are sent to the driver for Compilation, Optimization and Execution.

Q)Explain different data types in Hive.

Hive Data types are used for specifying the column/field type in Hive tables. Mainly Hive Data Types are classified into 5 major categories, let’s discuss them one by one:

a. Primitive Data Types in Hive Primitive Data Types also divide into 3 types which are as follows:

  • Numeric Data Type
  • Date/Time Data Type
  • String Data Type TINYINT '1 byte signed integer', -128 to 127 SMALLINT '2 byte signed integer', -32, 768 to 32, 767 INT '4 byte signed integer', –2,147,483,648 to 2,147,483, BIGINT '8 byte signed integer', – 9,223,372,036,854,775,808 to 9,223,372,036,854,775, FLOAT 'Single precision floating point', DOUBLE 'Double precision floating point', DECIMAL 'Precise decimal type based on Java BigDecimal Object',

RC File Instead of only partitioning the table harizontally like the row oriented DBMS, RCFile partitions this table first horizontally and then vertically to serialize the data.

Fig. Hive RCFile Format

Q) What is Hive Query Language(HQL)? Explain various DDL and DML statements in Hive Hive query language provides basic SQL like operations. Basic tasks of HQL are:

  1. Create and Manage tables and partitions
  2. Support various relational, arithmentic and logic operations
  3. Evaluate functions
  4. Down load the contents of a table to a local directory or result of queries to HDFS directory. HIVE DDL Statements: These statements are used to build and modify the tables and other objects in the database.
  5. Create/Drop/Alter Database
  6. Create/Drop/truncate Table
  7. Alter Table/partition/column
  8. Create/Drop/Alter view
  9. Create/Drop/Alter index
  10. Show
  11. Describe

HIVE DML Statements: These statements are used to retrive, store, modify, delete and update data in database. The DML commands are as follows:

  1. (^) Loading files into table.
  2. (^) Inserting data into Hive tables from queries.
  3. Creating and Managing Databases and Tables

hive> create table emp(id int,name string,sal double) row format delimited fields terminated by ','; OK Time taken: 8.331 seconds

hive> show tables; OK emp

hive> create table dept(eid int,dept string) row format delimited fields terminated by '@'; OK

  1. Loading data into the tables hive> load data inpath '/chp/data/faculty.txt' into table emp; hive> load data inpath '/chp/data/dept.txt' into table dept;
  2. Retreiving data from the tables. hive> select * from emp; OK 1 chp 10000. 2 pnr 20000. 3 kry 30000. Time taken: 0.379 seconds, Fetched: 3 row(s)

hive> select * from dept; OK 2 cse 3 mca 4 cse Time taken: 0.133 seconds, Fetched: 4 row(s)

Queries:

  1. Display different department in dept hive> select distinct(dept) from dept; cse mca
  2. Find the employees who earns 10000 hive> select name from emp where sal=10000; chp
  3. Find the employees who earns greater than 20000 hive> select name from emp where sal>=20000; pnr kry
  4. Find the employee id whose name is either chp or kry hive> select id from emp where name='chp' or name='kry'; 1 3
  5. Find the employee name whose dept is either cse or mca. select emp.name from emp join dept on(emp.id=dept.eid) where dept.dept='cse' or dept='mca';

pnr kry (or) select emp.name from emp left outer join dept on(emp.id=dept.eid) where dept.dept='cse' or dept.dept='mca'; pnr kry

  1. Find first 2 records in dept hive> select * from dept limit 2; OK 2 cse 3 mca
  2. Find the no.of employees in each department. hive> select dept,count(*) from dept group by dept; cse 2 mca 1
  3. Find the no.of employees in dept cse. hive> select dept,count(*) from dept group by dept having dept='cse'; cse 2
  4. Find employees name and salary whose salary is greater than 20000. hive> select name,sal from emp where sal>=20000 limit 3; pnr 20000. kry 30000.
  5. Find the name of the employee who is earning minimum salary. hive> select name,sal from emp order by sal limit 1; chp 10000.
  6. (^) Find the name of the employee who is earning maximum salary. hive> select name,sal from emp order by sal desc limit 1; kry 30000.
  7. Display all employee having sal greater than 10000 hive> select eid,name,sal from emp group by eid,name,sal having sal > 10000;

Q) Briefly explain joins in Hive.

JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database. It is more or less similar to SQL JOIN.

Inner join: The HiveQL INNER JOIN returns all the rows which are common in both the tables.

3 kry

  1. Find no.of employees using above view. hive> select count() from emp_view;* 3
  2. Drop view. hive> drop view emp_view;

Q) Explain about various functions in Hive. string functions:

  1. Display employee names in uppercase hive> select upper(name) from emp; CHP PNR KRY
  2. Display employee names from 2nd character hive> select substr(name,2) from emp; hp nr ry
  3. Concatenate emp id and name hive> select concat(id,name) from emp; 1chp 2pnr 3kry

Math Functions:

  1. Find the salaries of the employees by applying ceil function. hive> select ceil(sal) from emp; 10000 20000 30000
  2. Find the square root of the emp salaries. hive> select sqrt(sal) from emp;

3. Find the length of the emp names. hive> select name,length(name) from emp; chp 3 pnr 3 kry 3

Aggregate functions:

  1. Find no.of employees in the table emp. hive> select count(*) from emp; 3
  1. Find the salary of all the employees. hive> select sum(sal) from emp;
  2. Find the average salary of the employees. hive> select avg(sal) from emp;
  3. Find the minimum salary of all the employees. hive> select min(sal) from emp;
  4. Find the maximum salary of all the employees. hive> select max(sal) from emp;

Q) Differentiate Sort By and Order By in Hive Hive sort by and order by commands are used to fetch data in sorted order. The main differences between sort by and order by commands are given below.

Sort by: May use multiple reducers for final output. Only guarantees ordering of rows within a reducer. May give partially ordered result.

Order by: Uses single reducer to guarantee total order in output.

  1. Display the employees according to their names using sort by. hive> select * from emp sort by name; 1 chp 10000. 3 kry 30000. 2 pnr 20000.
  2. (^) Display the employees according to their names using order by. hive> select * from emp order by name; 1 chp 10000. 3 kry 30000. 2 pnr 20000.

Q) Write a query in hive to find word count in a text file. (or) Give an example for subquery in Hive

hello.txt(Input text file) : hello welcome to guntur hello welcome to vignan welcome to cse

Creating a table for input: hive> create table t2(st string);

In a Hive table, there are many numbers of rows and columns. If we want to perform queries only on some columns without indexing, it will take large amount of time because queries will be executed on all the columns present in the table.

Indexes are maintained in a separate table in Hive so that it won’t affect the data inside the table, which contains the data.

Indexes are advised to build on the columns on which you frequently perform operations.

Building more number of indexes also degrade the performance of your query.

Types of Indexes in Hive F 0 B 7 Compact Indexing F 0 B 7 Bitmap Indexing

Differences between Compact and Bitmap Indexing Compact indexing stores the pair of indexed column’s value and its blockid. Bitmap indexing stores the combination of indexed column value and list of rows as a bitmap.

Creating compact index: Syntax: hive> create index index_name on table table_name(columns,...) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;

Here, in the place of index_name we can give any name of our choice, which will be the table’s INDEX NAME. •In the ON TABLE line, we can give the table_name for which we are creating the index and the names of the columns in brackets for which the indexes are to be created. We should specify the columns which are available only in the table. •The org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ line specifies that a built in CompactIndexHandler will act on the created index, which means we are creating a compact index for the table. •The WITH DEFERRED REBUILD statement should be present in the created index because we need to alter the index in later stages using this statement.

Eg.

hive>create index emp_index on table emp(name,sal) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;

creating bitmap index : Syntax:

create index index_name on table table_name(columns,...) as 'bitmap' with deferred rebuild; eg. hive> create index dept_index on table dept(eid) as 'bitmap' with deferred rebuild;

Q) Find different indices on table emp and dept hive> show formatted index on emp;

idx_name tab_name col_names idx_tab_name idx_type emp_index emp name, sal default__emp_emp_index__ compact

hive>show formatted index on dept; idx_name tab_name col_names idx_tab_name idx_type dept_index dept eid default__dept_dept_index__ bitmap

1) Update index emp_index hive> alter index emp_index on emp rebuild;

2) Drop index emp_index hive>drop index if exists emp_index on emp;

Q) Explain Partitioning in Hive.

In Hive, the query reads the entire dataset even though a where clause filter is specified on a particular column. This becomes a bottlenect in most of the MapReduce jobs as it involves huge degree of I/O.

So it is necessary to reduce I/O required by the MapReduce job to imporve the performance of the query. A very common method to reduce I/O is data partioning.

Partitions split the larger dataset into more meaningful chunks. Hive provides two kinds of partions.

Static partition: Static partitions comprise columns whose values are known at compile time.

**Eg.

  1. Create a partition table.** hive> create table std_partition(sid int) partitioned by (branch string) row format delimited fields terminated by ',' stored as textfile;

std1.txt 1001 1002

2) Load data into std_partition from st1.txt and partitioned column branch as cse.

3 mca

4) Drop partitioned table dept_partition. hive> alter table dept_partition drop partition(branch='cse'); Dropping the partition branch=cse OK Time taken: 1.737 seconds hive> select * from dept_partition; OK 3 mca

Q) Explain how to create buckets in Hive.

Bucketing is similar to partition. However, there is a difference between partition and bucketing. In a partition, we need to create partition for each unique value of the column. This may lead to situations where you may end up with thousands of partitions. This can be avoided by using bucketing in which we can limit the number of buckets to create. A bucket is a file whereas a partition is a directory. Eg. Set below property to enable bucketing. set hive.enforce.bucketing=true;

Creating bucket on dept table: create table dept_bucket( eid int,dept string ) clustered by ( dept ) into 2 buckets;

loading data: from dept insert overwrite table dept_bucket select eid,dept;

To display content from 1 st^ bucket: select distinct from dept_bucket tablesample(bucket 1 out 2 on dept);

Q) Explain about SERDE.

SerDe stands for Serializer/Deserializer.

  1. Contains the logic to convert unstructured data into records.
  2. Implemented using java
  3. Serializers are used at the time of writing
  4. Deserializers are used at query time(Select statement) Deserializer interface takes a binary representation or string of a record, converts it into a jave object that hive can then manipulate. Serializer takes a java object that hive has been working with and translates it into something that hive can write to hdfs.

Input: 1001 chp Analyst

1002 praneeth Team Lead

create table xmlsample(xmldata string) load data local inpath ‘/chp/input.xml’ into table xmlsample;

create table xpath_table as select xpath_int(xmldata,’employee/empid’), xpath_string(xmldata,’employee/name’), xpath_string(xmldata,’employee/designation’) from xmlsample;

select * from xpath_table;

Output: 1001 chp Analyst 1002 praneeth Team Lead

Q) Explain about User Defined Function(UDF) in Hive with an example.

In Hive, we can use custom functions by defining the user defined functions Eg. to write a hive function to convert values of a field to lowercase.

package com.example.hive.udf; import org.apavhe.hadoop.hive.ql.excec.Description; import org.apache.hadoop.hive.ql.exec.UDF; @Description( name=”SampleUDFExample”) public final class MyLowerCase extends UDF{ public String evaluate(final String word){ return word.toLowerCase(); } }

Note: We have to convert this java program into jar.

add jar /chp/hivedemo/UpperCase.jar create temporary function tolowercase as ‘com.example.hive.udf.MyLowerCase’; select touppercase(name) from emp; output: CHP PNR KRY