Hive

Hive

What is Transactional Processing?
* Small data
* recent data
* updates data
* real time
* narrow scope - single data source

What is Analytical Processing?
* Mostly batch
* Mostly old data
* Mostly long running
* wide scope multiple data source to give a big picture

-----------

RDBMS (mostly for transactional)
* Single m/c with backup, structured data, usually one source
Its hard to do analytics on RDBMS

Big Data (mostly for analytical using data warehouse)
* distributed data, semi/unstructured, auto replication with fault tolerance, usually different source and different format

Not always...
* hbase is special db that can be configured for transactional processing even if its on big data
* qlikview can be used for analyitial processing even if its running on a single m/c

---------

Whats a Data Warehouse?
Technology that aggregates data from one or more sources for analytical processing
used by long running jobs, lagged data, large data, multi source, read optimized
Eg: Vertica, Teradata, Oracle, IBM - all closed source
Apache Hive is open source data warehouse

What is Hadoop?
A distributed computing framework for processing millions of records
Google uses same for index and search results
Uses Map Reduce on multiple machines
Fault tolerant

Basic building blocks of Hadoop

HIVE
---------------------------------------------------------------------------
HDFS <<<<<<>>>>>>>MR<<<<<<<<<>>>>>>>>>>YARN
File System ---------parallel processing f/w----------coordinator, resource negotiator

Hive Metastore
Underlying data stored in HDFS
User sees data in the form tables
Conversion of Files in FS and data that user sees is done by metastore
Metastore keeps metadata - table schema, col name type, serialization details, how files converted
Files are mapped to tables and lines in file become rows in table
Metastore is just a RDB with JDBC driver
dev enviroments use builtin derby db metastore, typically one java process with metastore, only one session available in dev. Prod allows multiple sessions, external metastore using something like MySQL

RDBMS is ACID compliant HIVE is not ACID compliant by default (maintain the order of operations of individual users predictable). Some tables can be made ACID but its not so efficient use.

Large Data processed by google!!
100Perabytes = 100x15

Hive can be read by many technologies (Hadoop, Pig, Spark etc) unlike traditional db that's the sole gatekeeper of its db. So its easy in traditional db to impose table schema, constraints, keys (primary/foreign)

How does Hive scehma come from?
uses schema on read
When data is read then schema is read at that time
If underlying data doesn't huw to schema hive will try its best to use it but imposing schema can fail
In traditional db (schema on write) will never have this issue!

HiveQL v/s SQL
1. Schema on read v/s schema on write
2. Hive QL has minimal index support. RDBMS has indexes allowed for r/w
3. Row level update not allowed in hive. We have to set config properties to enable this.
4. Hive has a lot of built in functions to access data (like table generating, covariance functions etc)
5. Restricted sub queries allowed in hive - Only equijoins

CLI ---> talks to Hive Server consults metastore for the hive table information, parses querues, converts them to MR jobs and submits them to HDFS where they are run and results are returned

Hive Server components
Hive API is its gateway to world -----> Allows clients to connect and run queries ----> Driver + Compiler + Execution Engine is heart of hive (it gets query, parses, converts to jobs, sends to Hadoop) -->Metastore contains info

When this entire hive server runs on a single JVM process. Its called an Embedded Store

Note that derby db metatore is generated in the directory where hive is started. If you chdir you need to reinstantiate db in new dir. This is fine for dev but for prod usually you can give a common location in the connection url of the config file
javax.jdo.option.ConnectionURL
jdbc:derby:;databaseName=/path/to/metastore_db;create=true

Beeline is the latest server (Hive server 2)
Added features
- connect multiple users
- better authentication and authorization

Hive CLI directly access hive metastore and driver. Hive 2 passes through authorization first

Hive Installation

* Download tar from apache site
* copy to ~/sw/
* add in .profile

export HIVE_HOME=/Users/bhajanpreetsingh/sw/hive-2.1.1

* add to $PATH $HIVE_HOME/bin

* add to $CLASSPATH $HIVE_HOME/lib/*:

Update configuration

BhajanpreetsMBP:conf bhajanpreetsingh$ ls -ltr

total 520

-rw-r--r--@ 1 bhajanpreetsingh  staff    2662 Nov 28 16:32 parquet-logging.properties

-rw-r--r--@ 1 bhajanpreetsingh  staff    2060 Nov 28 16:32 ivysettings.xml

-rw-r--r--@ 1 bhajanpreetsingh  staff    2925 Nov 28 16:32 hive-log4j2.properties.template

-rw-r--r--@ 1 bhajanpreetsingh  staff    1596 Nov 28 16:32 beeline-log4j2.properties.template

-rw-r--r--@ 1 bhajanpreetsingh  staff    2719 Nov 28 16:32 llap-cli-log4j2.properties.template

-rw-r--r--@ 1 bhajanpreetsingh  staff    2274 Nov 28 16:32 hive-exec-log4j2.properties.template

-rw-r--r--@ 1 bhajanpreetsingh  staff    4353 Nov 28 16:35 llap-daemon-log4j2.properties.template

-rw-r--r--@ 1 bhajanpreetsingh  staff    2378 Nov 28 16:35 hive-env.sh.template

-rw-r--r--@ 1 bhajanpreetsingh  staff  229198 Nov 29 14:46 hive-default.xml.template

hive-default.xml.template has default configuration for hive

do cp hive-default.xml.template hive-site.xml
and the edit this file to update the connectionURL

    javax.jdo.option.ConnectionURL

    jdbc:derby:;databaseName=metastore_db;create=true

      JDBC connect string for a JDBC metastore.

      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.

      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.

This property tells us what database we want to use for meta store and run it as embedded metastore
we also have

    hive.metastore.warehouse.dir

    /user/hive/warehouse

    location of default database for the warehouse

The warehouse tells us where all our hive data will live

Replace all

:%s/${system:java.io.tmpdir}/\/tmp\/hive_io/g

This is where the hive temporary data lives
Also replace

%s/${system:user.name}/bhajanpreetsingh/g

the username as above

Check all HDFS processes are running

BhajanpreetsMBP:conf bhajanpreetsingh$ jps

4276 JobHistoryServer

7765 SecondaryNameNode

8023 ResourceManager

10330 Jps

7579 NameNode

7661 DataNode

8110 NodeManager

Both DFS and YARN processes are running as seen above

add tmp and user dirs in hive

hdfs dfs -mkdir /tmp

hdfs dfs -mkdir -p /user/hive/warehouse

and give them group write permissions

hdfs dfs -chmod g+w /tmp

hdfs dfs -chmod g+w /user/hive/warehouse

warehouse is where the hive (non-metastore) data will live

Now run the hive command
Before running you need to create an instance of metastore database otherwise you get the following exception

Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

By default we have metastore_db dir in each dir where hive is run from (we can use a common location or metastore too by updating the config file
Lets move the old metastore_db in conf dir and re initialize the db by running following commands

BhajanpreetsMBP:conf bhajanpreetsingh$ pwd

/Users/bhajanpreetsingh/sw/hive-2.1.1/conf

BhajanpreetsMBP:conf bhajanpreetsingh$ mv metastore_db metastore_db.tmp

BhajanpreetsMBP:conf bhajanpreetsingh$ schematool -initSchema -dbType derby

To use a common location for metastore_db we can update ConnectionURL and add the whole path in front of databaseName

    javax.jdo.option.ConnectionURL

    jdbc:derby:;databaseName=metastore_db;create=true

Hive 2 server can be accessed using beeline

Hiveserver2 supports concurrent clients, better authentication and authorization

Beeline is the CLI for hiverserver2

Hive1 was directly accessing metastore and driver, hive2 passes through the api

run beeline by calling

$ bin/beeline -u jdbc:hive2://

(I updated the javax.jdo.option.ConnectionURL to use common metastore_db location)

0: jdbc:hive2://> create database myexample1;

17/03/12 12:30:27 [HiveServer2-Background-Pool: Thread-41]: WARN metastore.ObjectStore: Failed to get database myexample1, returning NoSuchObjectException

OK

No rows affected (0.225 seconds)

0: jdbc:hive2://> use myexample1

0: jdbc:hive2://> create table customers(id bigint, name string, address string);

OK

No rows affected (0.309 seconds)

0: jdbc:hive2://> show tables;

OK

+------------+--+

|  tab_name  |

+------------+--+

| customers  |

+------------+--+

1 row selected (0.085 seconds)

0: jdbc:hive2://> describe customers;

OK

+-----------+------------+----------+--+

| col_name  | data_type  | comment  |

+-----------+------------+----------+--+

| id        | bigint     |          |

| name      | string     |          |

| address   | string     |          |

+-----------+------------+----------+--+

3 rows selected (0.233 seconds)

0: jdbc:hive2://> insert into customers values(1111,"Jim","NJ");

This will run the map reduce command

17/03/12 12:55:23 [0bb81cdd-3ea1-49bc-9df2-f3b27d76e7c0 main]: ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!

17/03/12 12:55:24 [HiveServer2-Background-Pool: Thread-49]: WARN ql.Driver: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = bhajanpreetsingh_20170312125522_ffc0c677-17ea-4632-811b-02e61523b296

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

17/03/12 12:55:25 [HiveServer2-Background-Pool: Thread-49]: WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

Starting Job = job_1489337616191_0001, Tracking URL = http://10.246.215.84:8088/proxy/application_1489337616191_0001/

Kill Command = /Users/bhajanpreetsingh/sw/hadoop-2.7.3/bin/hadoop job  -kill job_1489337616191_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

17/03/12 12:55:33 [HiveServer2-Background-Pool: Thread-49]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead

2017-03-12 12:55:33,954 Stage-1 map = 0%,  reduce = 0%

17/03/12 12:55:24 [HiveServer2-Background-Pool: Thread-49]: WARN ql.Driver: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

2017-03-12 12:55:39,309 Stage-1 map = 100%,  reduce = 0%

Ended Job = job_1489337616191_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to directory hdfs://localhost/user/hive/warehouse/myexample1.db/customers/.hive-staging_hive_2017-03-12_12-55-22_882_6007169443295942027-1/-ext-10000

Loading data to table myexample1.customers

MapReduce Jobs Launched: 

Stage-Stage-1: Map: 1   HDFS Read: 4352 HDFS Write: 88 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

No rows affected (18.856 seconds)

0: jdbc:hive2://> select * from customers;

OK

+---------------+-----------------+--------------------+--+

| customers.id  | customers.name  | customers.address  |

+---------------+-----------------+--------------------+--+

| 1111          | Jim             | NJ                 |

+---------------+-----------------+--------------------+--+

1 row selected (0.122 seconds)

0: jdbc:hive2://> insert into customers values(2222,"Tim","NJ"),(3333,"Moke","NJ"),(4444,"Him","NJ"),(5555,"Bhajan","NJ");

0: jdbc:hive2://> select * from customers;

OK

+---------------+-----------------+--------------------+--+

| customers.id  | customers.name  | customers.address  |

+---------------+-----------------+--------------------+--+

| 1111          | Jim             | NJ                 |

| 2222          | Tim             | NJ                 |

| 3333          | Moke            | NJ                 |

| 4444          | Him             | NJ                 |

| 5555          | Bhajan          | NJ                 |

+---------------+-----------------+--------------------+--+

5 rows selected (0.089 seconds)

0: jdbc:hive2://> 

Hive UDF

These are the hive built in functions. There is also a library in hive you can use to define your custom functions
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

How to execute hive queries from terminal?

$ beeline -u jdbc:hive2:// -e "use myexample1; select * from customers;"

OR

executes queries from a file

BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ echo "use myexample1; select * from customers;" >> /tmp/tmp.hql

BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ beeline -u jdbc:hive2:// -f /tmp/tmp.hql 

How to do simple Joins in Hive?

0: jdbc:hive2://> use myexample1;

0: jdbc:hive2://> create table if not exists orders(id bigint, product_id string, customer_id bigint, quantity int, amount double);

add data

BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ cat /tmp/tmp.hql 

use myexample1; 

insert into orders values(0001, "laptop", 1111,2,2500);

insert into orders values(0002, "phone", 1111,1,500);

insert into orders values(0003, "mouse", 2222,100,1800);

insert into orders values(0004, "light", 3333,1,2);

insert into orders values(0005, "ball", 4444,25,5);

select * from orders;

BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ 

BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ 

BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ beeline -u jdbc:hive2:// -f /tmp/tmp.hql

run join query

$ beeline -u jdbc:hive2:// -e "use myexample1;select * from customers join orders where customers.id=orders.customer_id"

+---------------+-----------------+--------------------+------------+--------------------+---------------------+------------------+----------------+--+

| customers.id  | customers.name  | customers.address  | orders.id  | orders.product_id  | orders.customer_id  | orders.quantity  | orders.amount  |

+---------------+-----------------+--------------------+------------+--------------------+---------------------+------------------+----------------+--+

| 1111          | Jim             | NJ                 | 1          | laptop             | 1111                | 2                | 2500.0         |

| 1111          | Jim             | NJ                 | 2          | phone              | 1111                | 1                | 500.0          |

| 2222          | Tim             | NJ                 | 3          | mouse              | 2222                | 100              | 1800.0         |

| 3333          | Moke            | NJ                 | 4          | light              | 3333                | 1                | 2.0            |

| 4444          | Him             | NJ                 | 5          | ball               | 4444                | 25               | 5.0            |

+---------------+-----------------+--------------------+------------+--------------------+---------------------+------------------+----------------+--+

note that hive only supports equity joins

Primitive datatypes in HIVE

Boolean
Numeric

Integers

Tinyint: takes 1 byte -128 to 128 bit range (one bit used for sign and 7 for magnitude)
Smallint: takes 2 bytes -215 to 215 -1
Int: 4 bytes
Bigint: 8 bytes

Decimals

Float: 4 bytes
Decimal: 8 bytes
Decimal: Arbitary Precision e.g., decimal(10,2) will give 10x before decimal and 2x after decimal

String

String: unbounded variable length
Char: fixed length char string
Varchar: bounded variable length (we give max size)

Timestamp

Integer: Unix timestamp in nanoseconds
Float: Unix timestamp in seconds with decimal precision
String: any JDBC compliant string will work e.g.., YYYY-MM-DD HH:MM:SS.fffffffff
Date: values of the form "YYYY-MM-DD"

Exploring Hive warehouse data in HDFS

BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ hdfs dfs -cat /user/hive/warehouse/myexample*/orde*/000*_0

17/03/12 19:31:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

1laptop111122500.0

BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ hdfs dfs -cat /user/hive/warehouse/myexample*/cust*/000*_1

17/03/12 19:32:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2222TimNJ

3333MokeNJ

4444HimNJ

5555BhajanNJ

each insert created a file

each db has its dir (like myexample1 above)

default db has no dir, tables from default db are directly placed in /user/hive/warehouse/

Managed and External Tables

In Managed tables, hive owns the files and directories with data

Deleting a table will delete both metadata and data

In External tables, Hadoop, Pig, HBase all may edit data files

Deleting table will only the metadata

Metadata is saved in metastore in both Managed and External tables

Creating external tables

copy csv data

$ cat /tmp/products.csv

ip1,iphone,800

tv2,visio,790

mb3,mac,2690

in beeline

create external table if not exists products(id string,name string,cost float)

comment "products table"

row format delimited fields terminated by ','

stored as textfile

location "/data_example";

select * from products;

+--------------+----------------+----------------+--+

| products.id  | products.name  | products.cost  |

+--------------+----------------+----------------+--+

| ip1          | iphone         | 800.0          |

| tv2          | visio          | 790.0          |

| mb3          | mac            | 2690.0         |

+--------------+----------------+----------------+--+

How to check if a table is managed or external table in hive?

run in beeline> describe formatted ;

and check Table Type row

it will also show location of data

How to create a table like another table?

create table table2 like table1;

How to rename a table?

alter table table1 rename to table10;

How to add columns?

alter table table1 add columns (col1 int comment "my comment");

How to swap/rename columns?

alter table table1 change column col1 col_1 int after col2;

(this will rename col1 to col_1 and move it after col2)

When we swap columns, only the metadata is updated. data remains in same place.

Temporary tables

they are local session level tables that get deleted after session expires

names don't conflict with other user temporary tables as well as other permanent tables (however if names are same then you can not access the permanent table with same name until the temporary table goes away)

they do not support partitions and index

Loading data

0: jdbc:hive2://> create external table if not exists products(id string,name string,cost float)

. . . . . . . . > comment "products table"

. . . . . . . . > row format delimited fields terminated by ','

. . . . . . . . > stored as textfile

. . . . . . . . > location "/data_example";

OK

No rows affected (0.076 seconds)

0: jdbc:hive2://> select * from products

. . . . . . . . > ;

OK

+--------------+----------------+----------------+--+

| products.id  | products.name  | products.cost  |

+--------------+----------------+----------------+--+

| ip1          | iphone         | 800.0          |

| tv2          | visio          | 790.0          |

| mb3          | mac            | 2690.0         |

+--------------+----------------+----------------+--+

3 rows selected (0.101 seconds)

0: jdbc:hive2://> select * from managed_products;

OK

+----------------------+------------------------+------------------------+--+

| managed_products.id  | managed_products.name  | managed_products.cost  |

+----------------------+------------------------+------------------------+--+

+----------------------+------------------------+------------------------+--+

No rows selected (0.08 seconds)

0: jdbc:hive2://> load data local inpath '/tmp/products.csv' into table managed_products;

local indicates local FS

to copy from HDFS don't specify local above

Data is appended in this way

File is MOVED from HDFS and not just copied! In case of local FS, file is copied!!

to overwrite the data call, load data local inpath '/tmp/products.csv' overwrite into table managed_products;

Copy data from one table to other
insert into table1
select col1, col2, col3 from table2

to overwrite, call insert overwrite table1 select * from table2

multi insert

0: jdbc:hive2://> select * from products

. . . . . . . . > ;

17/03/12 23:03:26 [02a195f5-8a48-4d43-b109-b8dd6b43d6a6 main]: ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!

OK

+--------------+----------------+----------------+--+

| products.id  | products.name  | products.cost  |

+--------------+----------------+----------------+--+

| ip1          | iphone         | 800.0          |

| tv2          | visio          | 790.0          |

| mb3          | mac            | 2690.0         |

+--------------+----------------+----------------+--+

3 rows selected (1.374 seconds)

0: jdbc:hive2://> from products

. . . . . . . . > insert overwrite table product_name  

. . . . . . . . > select id,name

. . . . . . . . > insert overwrite table product_cost

. . . . . . . . > select id,cost;

0: jdbc:hive2://> select * from product_name;

OK

+------------------+--------------------+--+

| product_name.id  | product_name.name  |

+------------------+--------------------+--+

| ip1              | iphone             |

| tv2              | visio              |

| mb3              | mac                |

+------------------+--------------------+--+

3 rows selected (0.089 seconds)

0: jdbc:hive2://> select * from product_cost;

OK

+------------------+--------------------+--+

| product_cost.id  | product_cost.cost  |

+------------------+--------------------+--+

| ip1              | 800.0              |

| tv2              | 790.0              |

| mb3              | 2690.0             |

+------------------+--------------------+--+

3 rows selected (0.085 seconds)

0: jdbc:hive2://> 

What cannot be done easily from hive

Row level delete and update not available by default

However we can set special properties in hive-site.xml to make some tables ACID compliant` (performance will suffer due to this)

Adding Arrays

create table laptops(OS array, brand array) row format delimited fields terminated by ',' collection items terminated by '#';

load data local inpath '/tmp/laptops.csv' into table laptops;

select * from laptops;

OK

+--------------+-----------------------+--+

|  laptops.os  |     laptops.brand     |

+--------------+-----------------------+--+

| ["Mac"]      | ["Apple"]             |

| ["Win"]      | ["HP","Acer","Dell"]  |

| ["Unbuntu"]  | [" HP","Dell"]        |

+--------------+-----------------------+--+

cat /tmp/laptops.csv

Mac,Apple

Win,HP#Acer#Dell

Unbuntu, HP#Dell

Adding Maps

0: jdbc:hive2://> create table desktops(model string,features map) row format delimited fields terminated by ',' collection items terminated by '#' map keys terminated by ':'; 

OK

No rows affected (0.074 seconds)

0: jdbc:hive2://> load data local inpath '/tmp/desktops.csv' into table desktops;

Loading data to table myexample1.desktops

OK

No rows affected (0.153 seconds)

0: jdbc:hive2://> select * from desktops;

OK

+-----------------+----------------------------------+--+

| desktops.model  |        desktops.features         |

+-----------------+----------------------------------+--+

| model1          | {"webcam":false,"mouse":true}    |

| model2          | {"webcam":true,"earphone":true}  |

+-----------------+----------------------------------+--+

2 rows selected (0.098 seconds)

$ cat /tmp/desktops.csv

model1,webcam:false#mouse:true

model2,webcam:true#earphone:true

Adding Structures

create table desktops(model string,features map, information struct) row format delimited fields terminated by ',' collection items terminated by # map keys terminated by ':'; 

CBO

exceptions can be turned off in hive-site.xml

Built in functions in hive

UDF (user defined functions) like trim() concat() length() round() floor() - single in - single out

UDAF (user defined aggregate functions) like count(*), sum(), avg() - multi in - single out

UDTF (user defined table functions) explode() posexplode() - single in - multi out

explode() and posexplode()

0: jdbc:hive2://> select * from desktops;

OK

+-----------------+----------------------------------+--+

| desktops.model  |        desktops.features         |

+-----------------+----------------------------------+--+

| model1          | {"webcam":false,"mouse":true}    |

| model2          | {"webcam":true,"earphone":true}  |

+-----------------+----------------------------------+--+

2 rows selected (0.098 seconds)

0: jdbc:hive2://> select * from laptops;

OK

+--------------+-----------------------+--+

|  laptops.os  |     laptops.brand     |

+--------------+-----------------------+--+

| ["Mac"]      | ["Apple"]             |

| ["Win"]      | ["HP","Acer","Dell"]  |

| ["Unbuntu"]  | [" HP","Dell"]        |

+--------------+-----------------------+--+

3 rows selected (0.069 seconds)
NOTE: WE CANNOT DO select os, explode(brand) from laptops

NOTE that we also cannot do select select(brand) as brnd from laptops group by brnd

0: jdbc:hive2://> select explode(brand) from laptops;

OK

+--------+--+

|  col   |

+--------+--+

| Apple  |

| HP     |

| Acer   |

| Dell   |

|  HP    |

| Dell   |

+--------+--+

0: jdbc:hive2://> select posexplode(brand) from laptops;

+------+--------+--+

| pos  |  val   |

+------+--------+--+

| 0    | Apple  |

| 0    | HP     |

| 1    | Acer   |

| 2    | Dell   |

| 0    |  HP    |

| 1    | Dell   |

+------+--------+--+

0: jdbc:hive2://>  select posexplode(brand) as (index,brand) from laptops;

+--------+--------+--+

| index  | brand  |

+--------+--------+--+

| 0      | Apple  |

| 0      | HP     |

| 1      | Acer   |

| 2      | Dell   |

| 0      |  HP    |

| 1      | Dell   |

+--------+--------+--+

6 rows selected (0.078 seconds)

0: jdbc:hive2://> select * from desktops;

OK

+-----------------+----------------------------------+--+

| desktops.model  |        desktops.features         |

+-----------------+----------------------------------+--+

| model1          | {"webcam":false,"mouse":true}    |

| model2          | {"webcam":true,"earphone":true}  |

+-----------------+----------------------------------+--+

2 rows selected (0.073 seconds)

0: jdbc:hive2://> select explode(features) as (feature,present) from desktops

+-----------+----------+--+

|  feature  | present  |

+-----------+----------+--+

| webcam    | false    |

| mouse     | true     |

| webcam    | true     |

| earphone  | true     |

+-----------+----------+--+

4 rows selected (0.093 seconds)

0: jdbc:hive2://> 

Lateral View

Its a virtual table formed by exploded view

We can join it back with the original table to get useful queries

0: jdbc:hive2://> drop table laptops;

0: jdbc:hive2://> create table laptops ( os string, brand array ) row format delimited fields terminated by ',' collection items terminated by '#';

0: jdbc:hive2://> load data local inpath '/tmp/laptops.csv' into table laptops; 

0: jdbc:hive2://> describe laptops;

OK

+-----------+----------------+----------+--+

| col_name  |   data_type    | comment  |

+-----------+----------------+----------+--+

| os        | string         |          |

| brand     | array  |          |

+-----------+----------------+----------+--+

0: jdbc:hive2://> select os,ibrand from laptops lateral view explode(brand) brandsTable as ibrand;

OK

+----------+---------+--+

|    os    | ibrand  |

+----------+---------+--+

| Mac      | Apple   |

| Win      | HP      |

| Win      | Acer    |

| Win      | Dell    |

| Unbuntu  |  HP     |

| Unbuntu  | Dell    |

+----------+---------+--+

*****brandsTable is the new virtual table that has brand column*****

0: jdbc:hive2://> describe desktops;

OK

+-----------+----------------------+----------+--+

| col_name  |      data_type       | comment  |

+-----------+----------------------+----------+--+

| model     | string               |          |

| features  | map  |          |

+-----------+----------------------+----------+--+

2 rows selected (0.076 seconds)

0: jdbc:hive2://> select model,theFeature,isPresent from desktops lateral view explode(features) ExplodedFeatureTable as theFeature,isPresent;

OK

+---------+-------------+------------+--+

|  model  | thefeature  | ispresent  |

+---------+-------------+------------+--+

| model1  | webcam      | false      |

| model1  | mouse       | true       |

| model2  | webcam      | true       |

| model2  | earphone    | true       |

+---------+-------------+------------+--+

lets count how many times we gave brand name in laptops table

0: jdbc:hive2://> select ibrand,count(*) from laptops lateral view explode(brand) llap as ibrand group by ibrand;

0: jdbc:hive2://> create table matrix( cells array> );

0: jdbc:hive2://> insert into matrix 

----------------- select array( array(1,2,3), array(4,5,6), array(7,8,9) );

0: jdbc:hive2://> select * from matrix;

OK

+----------------------------+--+

|        matrix.cells        |

+----------------------------+--+

| [[1,2,3],[4,5,6],[7,8,9]]  |

+----------------------------+--+

1 row selected (0.074 seconds)

0: jdbc:hive2://> 

0: jdbc:hive2://> select * from matrix

. . . . . . . . > lateral view explode(cells) array1 as r

. . . . . . . . > lateral view explode(r) array2 as c;

OK

+----------------------------+-----------+-----------+--+

|        matrix.cells        | array1.r  | array2.c  |

+----------------------------+-----------+-----------+--+

| [[1,2,3],[4,5,6],[7,8,9]]  | [1,2,3]   | 1         |

| [[1,2,3],[4,5,6],[7,8,9]]  | [1,2,3]   | 2         |

| [[1,2,3],[4,5,6],[7,8,9]]  | [1,2,3]   | 3         |

| [[1,2,3],[4,5,6],[7,8,9]]  | [4,5,6]   | 4         |

| [[1,2,3],[4,5,6],[7,8,9]]  | [4,5,6]   | 5         |

| [[1,2,3],[4,5,6],[7,8,9]]  | [4,5,6]   | 6         |

| [[1,2,3],[4,5,6],[7,8,9]]  | [7,8,9]   | 7         |

| [[1,2,3],[4,5,6],[7,8,9]]  | [7,8,9]   | 8         |

| [[1,2,3],[4,5,6],[7,8,9]]  | [7,8,9]   | 9         |

+----------------------------+-----------+-----------+--+

9 rows selected (0.063 seconds)

0: jdbc:hive2://> 

Note on external tables

create external table mytable(id int) row format delimited fields terminated by '|';
load data in path '/tmp/abc.txt' into mytable;

this will delete the file /tmp/abc.txt

now if I do,
drop table mytable;

then table definition will be dropped but contents will not be dropped

if I want to drop contents too I need to find where is that data stored. For this I have to run the command,
show create table mytable;

My Notes...

Search This Blog