Skip to main content

Hive

Hive

What is Transactional Processing?
* Small data
* recent data
* updates data
* real time
* narrow scope - single data source

What is Analytical Processing?
* Mostly batch
* Mostly old data
* Mostly long running
* wide scope multiple data source to give a big picture

-----------

RDBMS (mostly for transactional)
* Single m/c with backup, structured data, usually one source
Its hard to do analytics on RDBMS

Big Data (mostly for analytical using data warehouse)
* distributed data, semi/unstructured, auto replication with fault tolerance, usually different source and different format

Not always...
* hbase is special db that can be configured for transactional processing even if its on big data
* qlikview can be used for analyitial processing even if its running on a single m/c

---------

Whats a Data Warehouse?
Technology that aggregates data  from one or more sources for analytical processing
used by long running jobs, lagged data, large data, multi source, read optimized
Eg: Vertica, Teradata, Oracle, IBM - all closed source
Apache Hive is open source data warehouse

What is Hadoop?
A distributed computing framework for processing millions of records
Google uses same for index and search results
Uses Map Reduce on multiple machines
Fault tolerant

Basic building blocks of Hadoop


                                       HIVE
---------------------------------------------------------------------------
HDFS <<<<<<>>>>>>>MR<<<<<<<<<>>>>>>>>>>YARN
File System ---------parallel processing f/w----------coordinator, resource negotiator


Hive Metastore
Underlying data stored in HDFS
User sees data in the form tables
Conversion of Files in FS and data that user sees is done by metastore
Metastore keeps metadata - table schema, col name type, serialization details, how files converted
Files are mapped to tables and lines in file become rows in table
Metastore is just a RDB with JDBC driver
dev enviroments use builtin derby db metastore, typically one java process with metastore, only one session available in dev.  Prod allows multiple sessions, external metastore using something like MySQL

RDBMS is ACID compliant HIVE is not ACID compliant by default (maintain the order of operations of individual users predictable). Some tables can be made ACID but its not so efficient use.

Large Data processed by google!!
100Perabytes = 100x15

Hive can be read by many technologies (Hadoop, Pig, Spark etc) unlike traditional db that's the sole gatekeeper of its db. So its easy in traditional db to impose table schema, constraints, keys (primary/foreign)

How does Hive scehma come from?
uses schema on read
When data is read then schema is read at that time
If underlying data doesn't huw to schema hive will try its best to use it but imposing schema can fail
In traditional db (schema on write) will never have this issue!

HiveQL v/s SQL
1. Schema on read v/s schema on write
2. Hive QL has minimal index support. RDBMS has indexes allowed for r/w
3. Row level update not allowed in hive. We have to set config properties to enable this.
4. Hive has a lot of built in functions to access data (like table generating, covariance functions etc)
5. Restricted sub queries allowed in hive - Only equijoins


CLI ---> talks to Hive Server consults metastore for the hive table information, parses querues, converts them to MR jobs and submits them to HDFS where they are run and results are returned


Hive Server components
Hive API is its gateway to world -----> Allows clients to connect and run queries ----> Driver + Compiler + Execution Engine is heart of hive (it gets query, parses, converts to jobs, sends to Hadoop) -->Metastore contains info

When this entire hive server runs on a single JVM process. Its called an Embedded Store



Note that derby db metatore is generated in the directory where hive is started. If you chdir you need to reinstantiate db in new dir. This is fine for dev but for prod usually you can give a common location in the connection url of the config file
javax.jdo.option.ConnectionURL
jdbc:derby:;databaseName=/path/to/metastore_db;create=true

Beeline is the latest server (Hive server 2)
Added features
- connect multiple users
- better authentication and authorization

Hive CLI directly access hive metastore and driver. Hive 2 passes through authorization first



Hive Installation

* Download tar from apache site
* copy to ~/sw/
* add in .profile
export HIVE_HOME=/Users/bhajanpreetsingh/sw/hive-2.1.1
* add to $PATH $HIVE_HOME/bin
* add to $CLASSPATH $HIVE_HOME/lib/*:

Update configuration

BhajanpreetsMBP:conf bhajanpreetsingh$ ls -ltr
total 520
-rw-r--r--@ 1 bhajanpreetsingh  staff    2662 Nov 28 16:32 parquet-logging.properties
-rw-r--r--@ 1 bhajanpreetsingh  staff    2060 Nov 28 16:32 ivysettings.xml
-rw-r--r--@ 1 bhajanpreetsingh  staff    2925 Nov 28 16:32 hive-log4j2.properties.template
-rw-r--r--@ 1 bhajanpreetsingh  staff    1596 Nov 28 16:32 beeline-log4j2.properties.template
-rw-r--r--@ 1 bhajanpreetsingh  staff    2719 Nov 28 16:32 llap-cli-log4j2.properties.template
-rw-r--r--@ 1 bhajanpreetsingh  staff    2274 Nov 28 16:32 hive-exec-log4j2.properties.template
-rw-r--r--@ 1 bhajanpreetsingh  staff    4353 Nov 28 16:35 llap-daemon-log4j2.properties.template
-rw-r--r--@ 1 bhajanpreetsingh  staff    2378 Nov 28 16:35 hive-env.sh.template

-rw-r--r--@ 1 bhajanpreetsingh  staff  229198 Nov 29 14:46 hive-default.xml.template

hive-default.xml.template has default configuration for hive



do cp hive-default.xml.template hive-site.xml
and the edit this file to update the connectionURL
    javax.jdo.option.ConnectionURL
    jdbc:derby:;databaseName=metastore_db;create=true
   
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
   
 
This property tells us what database we want to use for meta store and run it as embedded metastore
we also have
    hive.metastore.warehouse.dir
    /user/hive/warehouse
    location of default database for the warehouse

 

The warehouse tells us where all our hive data will live

Replace all
:%s/${system:java.io.tmpdir}/\/tmp\/hive_io/g
This is where the hive temporary data lives
Also replace
%s/${system:user.name}/bhajanpreetsingh/g
the username as above

Check all HDFS processes are running
BhajanpreetsMBP:conf bhajanpreetsingh$ jps
4276 JobHistoryServer
7765 SecondaryNameNode
8023 ResourceManager
10330 Jps
7579 NameNode
7661 DataNode

8110 NodeManager
Both DFS and YARN processes are running as seen above


add tmp and user dirs in hive
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /user/hive/warehouse
and give them group write permissions
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /user/hive/warehouse
warehouse is where the hive (non-metastore) data will live

Now run the hive command
Before running you need to create an instance of metastore database otherwise you get the following exception
Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

By default we have metastore_db dir in each dir where hive is run from (we can use a common location or metastore  too by updating the config file
Lets move the old metastore_db in conf dir and re initialize the db by running following commands
BhajanpreetsMBP:conf bhajanpreetsingh$ pwd
/Users/bhajanpreetsingh/sw/hive-2.1.1/conf
BhajanpreetsMBP:conf bhajanpreetsingh$ mv metastore_db metastore_db.tmp

BhajanpreetsMBP:conf bhajanpreetsingh$ schematool -initSchema -dbType derby

To use a common location for metastore_db we can update ConnectionURL and add the whole path in front of databaseName
    javax.jdo.option.ConnectionURL
    jdbc:derby:;databaseName=metastore_db;create=true

 



Hive 2 server can be accessed using beeline

Hiveserver2 supports concurrent clients, better authentication and authorization

Beeline is the CLI for hiverserver2

Hive1 was directly accessing metastore and driver, hive2 passes through the api

run beeline by calling
$ bin/beeline -u jdbc:hive2://

(I updated the javax.jdo.option.ConnectionURL to use common metastore_db location)


0: jdbc:hive2://> create database myexample1;
17/03/12 12:30:27 [HiveServer2-Background-Pool: Thread-41]: WARN metastore.ObjectStore: Failed to get database myexample1, returning NoSuchObjectException
OK
No rows affected (0.225 seconds)

0: jdbc:hive2://> use myexample1
0: jdbc:hive2://> create table customers(id bigint, name string, address string);
OK

No rows affected (0.309 seconds)
0: jdbc:hive2://> show tables;
OK
+------------+--+
|  tab_name  |
+------------+--+
| customers  |
+------------+--+
1 row selected (0.085 seconds)
0: jdbc:hive2://> describe customers;
OK
+-----------+------------+----------+--+
| col_name  | data_type  | comment  |
+-----------+------------+----------+--+
| id        | bigint     |          |
| name      | string     |          |
| address   | string     |          |
+-----------+------------+----------+--+
3 rows selected (0.233 seconds)

0: jdbc:hive2://> insert into customers values(1111,"Jim","NJ");

This will run the map reduce command

17/03/12 12:55:23 [0bb81cdd-3ea1-49bc-9df2-f3b27d76e7c0 main]: ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
17/03/12 12:55:24 [HiveServer2-Background-Pool: Thread-49]: WARN ql.Driver: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = bhajanpreetsingh_20170312125522_ffc0c677-17ea-4632-811b-02e61523b296
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
17/03/12 12:55:25 [HiveServer2-Background-Pool: Thread-49]: WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
Starting Job = job_1489337616191_0001, Tracking URL = http://10.246.215.84:8088/proxy/application_1489337616191_0001/
Kill Command = /Users/bhajanpreetsingh/sw/hadoop-2.7.3/bin/hadoop job  -kill job_1489337616191_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
17/03/12 12:55:33 [HiveServer2-Background-Pool: Thread-49]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2017-03-12 12:55:33,954 Stage-1 map = 0%,  reduce = 0%
17/03/12 12:55:24 [HiveServer2-Background-Pool: Thread-49]: WARN ql.Driver: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
2017-03-12 12:55:39,309 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_1489337616191_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost/user/hive/warehouse/myexample1.db/customers/.hive-staging_hive_2017-03-12_12-55-22_882_6007169443295942027-1/-ext-10000
Loading data to table myexample1.customers
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   HDFS Read: 4352 HDFS Write: 88 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
No rows affected (18.856 seconds)
0: jdbc:hive2://> select * from customers;
OK
+---------------+-----------------+--------------------+--+
| customers.id  | customers.name  | customers.address  |
+---------------+-----------------+--------------------+--+
| 1111          | Jim             | NJ                 |
+---------------+-----------------+--------------------+--+
1 row selected (0.122 seconds)

0: jdbc:hive2://> insert into customers values(2222,"Tim","NJ"),(3333,"Moke","NJ"),(4444,"Him","NJ"),(5555,"Bhajan","NJ");



0: jdbc:hive2://> select * from customers;

OK

+---------------+-----------------+--------------------+--+

| customers.id  | customers.name  | customers.address  |

+---------------+-----------------+--------------------+--+

| 1111          | Jim             | NJ                 |

| 2222          | Tim             | NJ                 |

| 3333          | Moke            | NJ                 |

| 4444          | Him             | NJ                 |

| 5555          | Bhajan          | NJ                 |

+---------------+-----------------+--------------------+--+
5 rows selected (0.089 seconds)
0: jdbc:hive2://> 



Hive UDF

These are the hive built in functions. There is also a library in hive you can use to define your custom functions
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


How to execute hive queries from terminal?

$ beeline -u jdbc:hive2:// -e "use myexample1; select * from customers;"

OR 

executes queries from a file
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ echo "use myexample1; select * from customers;" >> /tmp/tmp.hql
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ beeline -u jdbc:hive2:// -f /tmp/tmp.hql 


How to do simple Joins in Hive?

0: jdbc:hive2://> use myexample1;
0: jdbc:hive2://> create table if not exists orders(id bigint, product_id string, customer_id bigint, quantity int, amount double);

add data
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ cat /tmp/tmp.hql 
use myexample1; 
insert into orders values(0001, "laptop", 1111,2,2500);
insert into orders values(0002, "phone", 1111,1,500);
insert into orders values(0003, "mouse", 2222,100,1800);
insert into orders values(0004, "light", 3333,1,2);
insert into orders values(0005, "ball", 4444,25,5);
select * from orders;
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ 
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ 
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ beeline -u jdbc:hive2:// -f /tmp/tmp.hql

run join query

$ beeline -u jdbc:hive2:// -e "use myexample1;select * from customers join orders where customers.id=orders.customer_id"

+---------------+-----------------+--------------------+------------+--------------------+---------------------+------------------+----------------+--+
| customers.id  | customers.name  | customers.address  | orders.id  | orders.product_id  | orders.customer_id  | orders.quantity  | orders.amount  |
+---------------+-----------------+--------------------+------------+--------------------+---------------------+------------------+----------------+--+
| 1111          | Jim             | NJ                 | 1          | laptop             | 1111                | 2                | 2500.0         |
| 1111          | Jim             | NJ                 | 2          | phone              | 1111                | 1                | 500.0          |
| 2222          | Tim             | NJ                 | 3          | mouse              | 2222                | 100              | 1800.0         |
| 3333          | Moke            | NJ                 | 4          | light              | 3333                | 1                | 2.0            |
| 4444          | Him             | NJ                 | 5          | ball               | 4444                | 25               | 5.0            |
+---------------+-----------------+--------------------+------------+--------------------+---------------------+------------------+----------------+--+


note that hive only supports equity joins


Primitive datatypes in HIVE

  1. Boolean
  2. Numeric
    1. Integers
      1. Tinyint: takes 1 byte -128 to 128 bit range (one bit used for sign and 7 for magnitude)
      2. Smallint:  takes 2 bytes -215 to 215 -1 
      3. Int: 4 bytes
      4. Bigint: 8 bytes
    2. Decimals
      1. Float:  4 bytes
      2. Decimal:  8 bytes
      3. Decimal: Arbitary Precision e.g., decimal(10,2) will give 10x before decimal and 2x after decimal
  3. String
    1. String: unbounded variable length
    2. Char: fixed length char string
    3. Varchar: bounded variable length (we give max size)
  4. Timestamp
    1. Integer: Unix timestamp in nanoseconds
    2. Float: Unix timestamp in seconds with decimal precision
    3. String: any JDBC compliant string will work e.g.., YYYY-MM-DD HH:MM:SS.fffffffff
    4. Date: values of the form "YYYY-MM-DD"

Exploring Hive warehouse data in HDFS

BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ hdfs dfs -cat /user/hive/warehouse/myexample*/orde*/000*_0
17/03/12 19:31:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1laptop111122500.0
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ hdfs dfs -cat /user/hive/warehouse/myexample*/cust*/000*_1
17/03/12 19:32:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2222TimNJ
3333MokeNJ
4444HimNJ

5555BhajanNJ

each insert created a file
each db has its dir (like myexample1 above)
default db has no dir, tables from default db are directly placed in /user/hive/warehouse/

Managed and External Tables
In Managed tables, hive owns the files and directories with data
Deleting a table will delete both metadata and data

In External tables, Hadoop, Pig, HBase all may edit data files
Deleting table will only the metadata

Metadata is saved in metastore in both Managed and External tables

Creating external tables
copy csv data
$ cat /tmp/products.csv
ip1,iphone,800
tv2,visio,790
mb3,mac,2690

in beeline
create external table if not exists products(id string,name string,cost float)
comment "products table"
row format delimited fields terminated by ','
stored as textfile
location "/data_example";

select * from products;
+--------------+----------------+----------------+--+
| products.id  | products.name  | products.cost  |
+--------------+----------------+----------------+--+
| ip1          | iphone         | 800.0          |
| tv2          | visio          | 790.0          |
| mb3          | mac            | 2690.0         |
+--------------+----------------+----------------+--+
more on create table command in
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL


How to check if a table is managed or external table in hive?

run in beeline> describe formatted ;
and check Table Type row

it will also show location of data

How to create a table like another table?

create table table2 like table1;

How to rename a table?

alter table table1 rename to table10;

How to add columns?

alter table table1 add columns (col1 int comment "my comment");

How to swap/rename columns?

alter table table1 change column col1 col_1 int after col2;
(this will rename col1 to col_1 and move it after col2)

When we swap columns, only the metadata is updated. data remains in same place.


Temporary tables

they are local session level tables that get deleted after session expires
names don't conflict with other user temporary tables as well as other permanent tables (however if names are same then you can not access the permanent table with same name until the temporary table goes away)
they do not support partitions and index


Loading data 

0: jdbc:hive2://> create external table if not exists products(id string,name string,cost float)
. . . . . . . . > comment "products table"
. . . . . . . . > row format delimited fields terminated by ','
. . . . . . . . > stored as textfile
. . . . . . . . > location "/data_example";
OK
No rows affected (0.076 seconds)
0: jdbc:hive2://> select * from products
. . . . . . . . > ;
OK
+--------------+----------------+----------------+--+
| products.id  | products.name  | products.cost  |
+--------------+----------------+----------------+--+
| ip1          | iphone         | 800.0          |
| tv2          | visio          | 790.0          |
| mb3          | mac            | 2690.0         |
+--------------+----------------+----------------+--+

3 rows selected (0.101 seconds)

0: jdbc:hive2://> select * from managed_products;

OK

+----------------------+------------------------+------------------------+--+

| managed_products.id  | managed_products.name  | managed_products.cost  |

+----------------------+------------------------+------------------------+--+

+----------------------+------------------------+------------------------+--+
No rows selected (0.08 seconds)


0: jdbc:hive2://> load data local inpath '/tmp/products.csv' into table managed_products;

local indicates local FS

to copy from HDFS don't specify local above

Data is appended in this way

File is MOVED from HDFS and not just copied! In case of local FS, file is copied!!

to overwrite the data call, load data local inpath '/tmp/products.csv' overwrite into table managed_products;


Copy data from one table to other
insert into table1
select col1, col2, col3 from table2

to overwrite, call insert overwrite table1 select * from table2


multi insert


0: jdbc:hive2://> select * from products
. . . . . . . . > ;
17/03/12 23:03:26 [02a195f5-8a48-4d43-b109-b8dd6b43d6a6 main]: ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
OK
+--------------+----------------+----------------+--+
| products.id  | products.name  | products.cost  |
+--------------+----------------+----------------+--+
| ip1          | iphone         | 800.0          |
| tv2          | visio          | 790.0          |
| mb3          | mac            | 2690.0         |
+--------------+----------------+----------------+--+
3 rows selected (1.374 seconds)
0: jdbc:hive2://> from products
. . . . . . . . > insert overwrite table product_name  
. . . . . . . . > select id,name
. . . . . . . . > insert overwrite table product_cost

. . . . . . . . > select id,cost;

0: jdbc:hive2://> select * from product_name;

OK

+------------------+--------------------+--+

| product_name.id  | product_name.name  |

+------------------+--------------------+--+

| ip1              | iphone             |
| tv2              | visio              |
| mb3              | mac                |
+------------------+--------------------+--+
3 rows selected (0.089 seconds)
0: jdbc:hive2://> select * from product_cost;
OK
+------------------+--------------------+--+
| product_cost.id  | product_cost.cost  |
+------------------+--------------------+--+
| ip1              | 800.0              |
| tv2              | 790.0              |
| mb3              | 2690.0             |
+------------------+--------------------+--+
3 rows selected (0.085 seconds)
0: jdbc:hive2://> 






What cannot be done easily from hive

Row level delete and update not available by default

However we can set special properties in hive-site.xml to make some tables ACID compliant` (performance will suffer due to this)


Adding Arrays

create table laptops(OS array, brand array) row format delimited fields terminated by ',' collection items terminated by '#';


load data local inpath '/tmp/laptops.csv' into table laptops;



select * from laptops;

OK

+--------------+-----------------------+--+
|  laptops.os  |     laptops.brand     |
+--------------+-----------------------+--+
| ["Mac"]      | ["Apple"]             |
| ["Win"]      | ["HP","Acer","Dell"]  |
| ["Unbuntu"]  | [" HP","Dell"]        |
+--------------+-----------------------+--+

cat /tmp/laptops.csv

Mac,Apple
Win,HP#Acer#Dell
Unbuntu, HP#Dell


Adding Maps


0: jdbc:hive2://> create table desktops(model string,features map) row format delimited fields terminated by ',' collection items terminated by '#' map keys terminated by ':'; 

OK

No rows affected (0.074 seconds)

0: jdbc:hive2://> load data local inpath '/tmp/desktops.csv' into table desktops;

Loading data to table myexample1.desktops
OK
No rows affected (0.153 seconds)
0: jdbc:hive2://> select * from desktops;
OK
+-----------------+----------------------------------+--+
| desktops.model  |        desktops.features         |
+-----------------+----------------------------------+--+
| model1          | {"webcam":false,"mouse":true}    |
| model2          | {"webcam":true,"earphone":true}  |
+-----------------+----------------------------------+--+
2 rows selected (0.098 seconds)



$ cat /tmp/desktops.csv

model1,webcam:false#mouse:true

model2,webcam:true#earphone:true






Adding Structures


create table desktops(model string,features map, information struct) row format delimited fields terminated by ',' collection items terminated by # map keys terminated by ':'; 



CBO


exceptions can be turned off in hive-site.xml


Built in functions in hive


UDF (user defined functions) like trim() concat() length() round() floor() - single in - single out
UDAF (user defined aggregate functions) like count(*), sum(), avg() - multi in - single out
UDTF (user defined table functions) explode() posexplode() - single in - multi out

explode() and posexplode()
0: jdbc:hive2://> select * from desktops;
OK
+-----------------+----------------------------------+--+
| desktops.model  |        desktops.features         |
+-----------------+----------------------------------+--+
| model1          | {"webcam":false,"mouse":true}    |
| model2          | {"webcam":true,"earphone":true}  |
+-----------------+----------------------------------+--+
2 rows selected (0.098 seconds)
0: jdbc:hive2://> select * from laptops;
OK
+--------------+-----------------------+--+
|  laptops.os  |     laptops.brand     |
+--------------+-----------------------+--+
| ["Mac"]      | ["Apple"]             |
| ["Win"]      | ["HP","Acer","Dell"]  |
| ["Unbuntu"]  | [" HP","Dell"]        |
+--------------+-----------------------+--+
3 rows selected (0.069 seconds)
NOTE: WE CANNOT DO select os, explode(brand) from laptops

NOTE that we also cannot do select select(brand) as brnd from laptops group by brnd
0: jdbc:hive2://> select explode(brand) from laptops;
OK
+--------+--+
|  col   |
+--------+--+
| Apple  |
| HP     |
| Acer   |
| Dell   |
|  HP    |
| Dell   |

+--------+--+
0: jdbc:hive2://> select posexplode(brand) from laptops;

+------+--------+--+

| pos  |  val   |

+------+--------+--+

| 0    | Apple  |
| 0    | HP     |
| 1    | Acer   |
| 2    | Dell   |
| 0    |  HP    |
| 1    | Dell   |
+------+--------+--+
0: jdbc:hive2://>  select posexplode(brand) as (index,brand) from laptops;

+--------+--------+--+

| index  | brand  |

+--------+--------+--+

| 0      | Apple  |
| 0      | HP     |
| 1      | Acer   |
| 2      | Dell   |
| 0      |  HP    |
| 1      | Dell   |
+--------+--------+--+
6 rows selected (0.078 seconds)

0: jdbc:hive2://> select * from desktops;

OK

+-----------------+----------------------------------+--+

| desktops.model  |        desktops.features         |
+-----------------+----------------------------------+--+
| model1          | {"webcam":false,"mouse":true}    |
| model2          | {"webcam":true,"earphone":true}  |
+-----------------+----------------------------------+--+
2 rows selected (0.073 seconds)


0: jdbc:hive2://> select explode(features) as (feature,present) from desktops



+-----------+----------+--+

|  feature  | present  |

+-----------+----------+--+

| webcam    | false    |
| mouse     | true     |
| webcam    | true     |
| earphone  | true     |
+-----------+----------+--+
4 rows selected (0.093 seconds)
0: jdbc:hive2://> 

Lateral View

Its a virtual table formed by exploded view
We can join it back with the original table to get useful queries



0: jdbc:hive2://> drop table laptops;
0: jdbc:hive2://> create table laptops ( os string, brand array ) row format delimited fields terminated by ',' collection items terminated by '#';

0: jdbc:hive2://> load data local inpath '/tmp/laptops.csv' into table laptops; 



0: jdbc:hive2://> describe laptops;

OK
+-----------+----------------+----------+--+
| col_name  |   data_type    | comment  |
+-----------+----------------+----------+--+
| os        | string         |          |
| brand     | array  |          |
+-----------+----------------+----------+--+

0: jdbc:hive2://> select os,ibrand from laptops lateral view explode(brand) brandsTable as ibrand;

OK
+----------+---------+--+
|    os    | ibrand  |
+----------+---------+--+
| Mac      | Apple   |
| Win      | HP      |
| Win      | Acer    |
| Win      | Dell    |
| Unbuntu  |  HP     |
| Unbuntu  | Dell    |
+----------+---------+--+
*****brandsTable is the new virtual table that has brand column*****
0: jdbc:hive2://> describe desktops;
OK
+-----------+----------------------+----------+--+
| col_name  |      data_type       | comment  |
+-----------+----------------------+----------+--+
| model     | string               |          |
| features  | map  |          |
+-----------+----------------------+----------+--+
2 rows selected (0.076 seconds)
0: jdbc:hive2://> select model,theFeature,isPresent from desktops lateral view explode(features) ExplodedFeatureTable as theFeature,isPresent;
OK
+---------+-------------+------------+--+
|  model  | thefeature  | ispresent  |
+---------+-------------+------------+--+
| model1  | webcam      | false      |
| model1  | mouse       | true       |
| model2  | webcam      | true       |
| model2  | earphone    | true       |
+---------+-------------+------------+--+

lets count how many times we gave brand name in laptops table
0: jdbc:hive2://> select ibrand,count(*) from laptops lateral view explode(brand) llap as ibrand group by ibrand;


0: jdbc:hive2://> create table matrix( cells array> );

0: jdbc:hive2://> insert into matrix 

----------------- select array( array(1,2,3), array(4,5,6), array(7,8,9) );
0: jdbc:hive2://> select * from matrix;
OK
+----------------------------+--+
|        matrix.cells        |
+----------------------------+--+
| [[1,2,3],[4,5,6],[7,8,9]]  |
+----------------------------+--+
1 row selected (0.074 seconds)
0: jdbc:hive2://> 

0: jdbc:hive2://> select * from matrix
. . . . . . . . > lateral view explode(cells) array1 as r
. . . . . . . . > lateral view explode(r) array2 as c;
OK
+----------------------------+-----------+-----------+--+
|        matrix.cells        | array1.r  | array2.c  |
+----------------------------+-----------+-----------+--+
| [[1,2,3],[4,5,6],[7,8,9]]  | [1,2,3]   | 1         |
| [[1,2,3],[4,5,6],[7,8,9]]  | [1,2,3]   | 2         |
| [[1,2,3],[4,5,6],[7,8,9]]  | [1,2,3]   | 3         |
| [[1,2,3],[4,5,6],[7,8,9]]  | [4,5,6]   | 4         |
| [[1,2,3],[4,5,6],[7,8,9]]  | [4,5,6]   | 5         |
| [[1,2,3],[4,5,6],[7,8,9]]  | [4,5,6]   | 6         |
| [[1,2,3],[4,5,6],[7,8,9]]  | [7,8,9]   | 7         |
| [[1,2,3],[4,5,6],[7,8,9]]  | [7,8,9]   | 8         |
| [[1,2,3],[4,5,6],[7,8,9]]  | [7,8,9]   | 9         |
+----------------------------+-----------+-----------+--+
9 rows selected (0.063 seconds)
0: jdbc:hive2://> 


Note on external tables


create external table mytable(id int) row format delimited fields terminated by '|';
load data in path '/tmp/abc.txt' into mytable;

this will delete the file /tmp/abc.txt

now if I do,
drop table mytable;

then table definition will be dropped but contents will not be dropped

if I want to drop contents too I need to find where is that data stored. For this I have to run the command,
show create table mytable;


Comments

Popular posts from this blog

Simple tutorial to create RESTful web services using SPRING, GRADLE & ECLIPSE

How to create RESTful web services using SPRING, GRADLE & ECLIPSE * First install Eclipse  i n your machine by referring to the official wiki  (I have installed an eclipse version called Kepler in my machine) * After installing Eclipse, open it and go to "eclipse market place" to add Gradle as seen in below screenshots: * Now create a new blank Gradle project in Eclipse *Now lets build the blank project using Gradle to ensure that everything is fine so far * Now change the build.gradle file to below as seen in http://spring.io/guides/gs/rest-service/ buildscript { repositories { maven { url "http://repo.spring.io/libs-release" } mavenLocal () mavenCentral () } dependencies { classpath ( "org.springframework.boot:spring-boot-gradle-plugin:1.1.4.RELEASE" ) } } apply plugin : 'java' apply plugin : 'eclipse' apply plugin : 'idea...

Simple FitNesse tutorial

(Time spent for writing - 1:30 min) In simple words, FitNesse is a testing framework that can be used to develop automated test cases. I had searched many posts but could not find a single tutorial that could help me get started with FitNesse. So writing it now... BTW I am a java developer, so this tutorial is in pure java! I use windows XP and Java version is 1.5.0 To get started first download FitNesse from http://fitnesse.org/ I didnt like its 2 minute example, because that just showed what it does, doesnt really help me to start with development. They should have a real getting started tutorial there, I think. I had downloaded fitnesse20070619.zip from downloads section After downloading, install it by unzipping the all the files to a some location like c:\fitnesse c:\fitnesse will then act as the root folder for you and all the classpaths will be relative to root folder By default FitNesse will run on port 80. If you want another port you can edit the run.bat file in the root fol...