Hive
What is Transactional Processing?
* Small data
* recent data
* updates data
* real time
* narrow scope - single data source
What is Analytical Processing?
* Mostly batch
* Mostly old data
* Mostly long running
* wide scope multiple data source to give a big picture
-----------
RDBMS (mostly for transactional)
* Single m/c with backup, structured data, usually one source
Its hard to do analytics on RDBMS
Big Data (mostly for analytical using data warehouse)
* distributed data, semi/unstructured, auto replication with fault tolerance, usually different source and different format
Not always...
* hbase is special db that can be configured for transactional processing even if its on big data
* qlikview can be used for analyitial processing even if its running on a single m/c
---------
Whats a Data Warehouse?
Technology that aggregates data from one or more sources for analytical processing
used by long running jobs, lagged data, large data, multi source, read optimized
Eg: Vertica, Teradata, Oracle, IBM - all closed source
Apache Hive is open source data warehouse
What is Hadoop?
A distributed computing framework for processing millions of records
Google uses same for index and search results
Uses Map Reduce on multiple machines
Fault tolerant
Basic building blocks of Hadoop
HIVE
---------------------------------------------------------------------------
HDFS <<<<<<>>>>>>>MR<<<<<<<<<>>>>>>>>>>YARN
File System ---------parallel processing f/w----------coordinator, resource negotiator
Hive Metastore
Underlying data stored in HDFS
User sees data in the form tables
Conversion of Files in FS and data that user sees is done by metastore
Metastore keeps metadata - table schema, col name type, serialization details, how files converted
Files are mapped to tables and lines in file become rows in table
Metastore is just a RDB with JDBC driver
dev enviroments use builtin derby db metastore, typically one java process with metastore, only one session available in dev. Prod allows multiple sessions, external metastore using something like MySQL
RDBMS is ACID compliant HIVE is not ACID compliant by default (maintain the order of operations of individual users predictable). Some tables can be made ACID but its not so efficient use.
Large Data processed by google!!
100Perabytes = 100x15
Hive can be read by many technologies (Hadoop, Pig, Spark etc) unlike traditional db that's the sole gatekeeper of its db. So its easy in traditional db to impose table schema, constraints, keys (primary/foreign)
How does Hive scehma come from?
uses schema on read
When data is read then schema is read at that time
If underlying data doesn't huw to schema hive will try its best to use it but imposing schema can fail
In traditional db (schema on write) will never have this issue!
HiveQL v/s SQL
1. Schema on read v/s schema on write
2. Hive QL has minimal index support. RDBMS has indexes allowed for r/w
3. Row level update not allowed in hive. We have to set config properties to enable this.
4. Hive has a lot of built in functions to access data (like table generating, covariance functions etc)
5. Restricted sub queries allowed in hive - Only equijoins
CLI ---> talks to Hive Server consults metastore for the hive table information, parses querues, converts them to MR jobs and submits them to HDFS where they are run and results are returned
Hive Server components
Hive API is its gateway to world -----> Allows clients to connect and run queries ----> Driver + Compiler + Execution Engine is heart of hive (it gets query, parses, converts to jobs, sends to Hadoop) -->Metastore contains info
When this entire hive server runs on a single JVM process. Its called an Embedded Store
Note that derby db metatore is generated in the directory where hive is started. If you chdir you need to reinstantiate db in new dir. This is fine for dev but for prod usually you can give a common location in the connection url of the config file
javax.jdo.option.ConnectionURL
jdbc:derby:;databaseName=/path/to/metastore_db;create=true
Beeline is the latest server (Hive server 2)
Added features
- connect multiple users
- better authentication and authorization
Hive CLI directly access hive metastore and driver. Hive 2 passes through authorization first
Hive Installation
* Download tar from apache site
* copy to ~/sw/
* add in .profile
Update configuration
hive-default.xml.template has default configuration for hive
do cp hive-default.xml.template hive-site.xml
and the edit this file to update the connectionURL
javax.jdo.option.ConnectionURL
jdbc:derby:;databaseName=metastore_db;create=true
This property tells us what database we want to use for meta store and run it as embedded metastore
we also have
hive.metastore.warehouse.dir
/user/hive/warehouse
location of default database for the warehouse
The warehouse tells us where all our hive data will live
Replace all
Also replace
Check all HDFS processes are running
add tmp and user dirs in hive
Now run the hive command
Before running you need to create an instance of metastore database otherwise you get the following exception
By default we have metastore_db dir in each dir where hive is run from (we can use a common location or metastore too by updating the config file
Lets move the old metastore_db in conf dir and re initialize the db by running following commands
To use a common location for metastore_db we can update ConnectionURL and add the whole path in front of databaseName
javax.jdo.option.ConnectionURL
jdbc:derby:; databaseName=metastore_db;create=true
Hive 2 server can be accessed using beeline
Hiveserver2 supports concurrent clients, better authentication and authorization
Beeline is the CLI for hiverserver2
Hive1 was directly accessing metastore and driver, hive2 passes through the api
run beeline by calling
(I updated the javax.jdo.option.ConnectionURL to use common metastore_db location)
explode() and posexplode()
create external table mytable(id int) row format delimited fields terminated by '|';
load data in path '/tmp/abc.txt' into mytable;
this will delete the file /tmp/abc.txt
now if I do,
drop table mytable;
then table definition will be dropped but contents will not be dropped
if I want to drop contents too I need to find where is that data stored. For this I have to run the command,
show create table mytable;
What is Transactional Processing?
* Small data
* recent data
* updates data
* real time
* narrow scope - single data source
What is Analytical Processing?
* Mostly batch
* Mostly old data
* Mostly long running
* wide scope multiple data source to give a big picture
-----------
RDBMS (mostly for transactional)
* Single m/c with backup, structured data, usually one source
Its hard to do analytics on RDBMS
Big Data (mostly for analytical using data warehouse)
* distributed data, semi/unstructured, auto replication with fault tolerance, usually different source and different format
Not always...
* hbase is special db that can be configured for transactional processing even if its on big data
* qlikview can be used for analyitial processing even if its running on a single m/c
---------
Whats a Data Warehouse?
Technology that aggregates data from one or more sources for analytical processing
used by long running jobs, lagged data, large data, multi source, read optimized
Eg: Vertica, Teradata, Oracle, IBM - all closed source
Apache Hive is open source data warehouse
What is Hadoop?
A distributed computing framework for processing millions of records
Google uses same for index and search results
Uses Map Reduce on multiple machines
Fault tolerant
Basic building blocks of Hadoop
HIVE
---------------------------------------------------------------------------
HDFS <<<<<<>>>>>>>MR<<<<<<<<<>>>>>>>>>>YARN
File System ---------parallel processing f/w----------coordinator, resource negotiator
Hive Metastore
Underlying data stored in HDFS
User sees data in the form tables
Conversion of Files in FS and data that user sees is done by metastore
Metastore keeps metadata - table schema, col name type, serialization details, how files converted
Files are mapped to tables and lines in file become rows in table
Metastore is just a RDB with JDBC driver
dev enviroments use builtin derby db metastore, typically one java process with metastore, only one session available in dev. Prod allows multiple sessions, external metastore using something like MySQL
RDBMS is ACID compliant HIVE is not ACID compliant by default (maintain the order of operations of individual users predictable). Some tables can be made ACID but its not so efficient use.
Large Data processed by google!!
100Perabytes = 100x15
Hive can be read by many technologies (Hadoop, Pig, Spark etc) unlike traditional db that's the sole gatekeeper of its db. So its easy in traditional db to impose table schema, constraints, keys (primary/foreign)
How does Hive scehma come from?
uses schema on read
When data is read then schema is read at that time
If underlying data doesn't huw to schema hive will try its best to use it but imposing schema can fail
In traditional db (schema on write) will never have this issue!
HiveQL v/s SQL
1. Schema on read v/s schema on write
2. Hive QL has minimal index support. RDBMS has indexes allowed for r/w
3. Row level update not allowed in hive. We have to set config properties to enable this.
4. Hive has a lot of built in functions to access data (like table generating, covariance functions etc)
5. Restricted sub queries allowed in hive - Only equijoins
CLI ---> talks to Hive Server consults metastore for the hive table information, parses querues, converts them to MR jobs and submits them to HDFS where they are run and results are returned
Hive Server components
Hive API is its gateway to world -----> Allows clients to connect and run queries ----> Driver + Compiler + Execution Engine is heart of hive (it gets query, parses, converts to jobs, sends to Hadoop) -->Metastore contains info
When this entire hive server runs on a single JVM process. Its called an Embedded Store
Note that derby db metatore is generated in the directory where hive is started. If you chdir you need to reinstantiate db in new dir. This is fine for dev but for prod usually you can give a common location in the connection url of the config file
javax.jdo.option.ConnectionURL
jdbc:derby:;databaseName=/path/to/metastore_db;create=true
Beeline is the latest server (Hive server 2)
Added features
- connect multiple users
- better authentication and authorization
Hive CLI directly access hive metastore and driver. Hive 2 passes through authorization first
Hive Installation
* Download tar from apache site
* copy to ~/sw/
* add in .profile
export HIVE_HOME=/Users/bhajanpreetsingh/sw/hive-2.1.1
* add to $PATH $HIVE_HOME/bin
* add to $CLASSPATH $HIVE_HOME/lib/*:Update configuration
BhajanpreetsMBP:conf bhajanpreetsingh$ ls -ltr
total 520
-rw-r--r--@ 1 bhajanpreetsingh staff 2662 Nov 28 16:32 parquet-logging.properties
-rw-r--r--@ 1 bhajanpreetsingh staff 2060 Nov 28 16:32 ivysettings.xml
-rw-r--r--@ 1 bhajanpreetsingh staff 2925 Nov 28 16:32 hive-log4j2.properties.template
-rw-r--r--@ 1 bhajanpreetsingh staff 1596 Nov 28 16:32 beeline-log4j2.properties.template
-rw-r--r--@ 1 bhajanpreetsingh staff 2719 Nov 28 16:32 llap-cli-log4j2.properties.template
-rw-r--r--@ 1 bhajanpreetsingh staff 2274 Nov 28 16:32 hive-exec-log4j2.properties.template
-rw-r--r--@ 1 bhajanpreetsingh staff 4353 Nov 28 16:35 llap-daemon-log4j2.properties.template
-rw-r--r--@ 1 bhajanpreetsingh staff 2378 Nov 28 16:35 hive-env.sh.template
-rw-r--r--@ 1 bhajanpreetsingh staff 229198 Nov 29 14:46 hive-default.xml.template
and the edit this file to update the connectionURL
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
we also have
The warehouse tells us where all our hive data will live
Replace all
:%s/${system:java.io.tmpdir}/\/tmp\/hive_io/g
This is where the hive temporary data livesAlso replace
%s/${system:user.name}/bhajanpreetsingh/g
the username as aboveCheck all HDFS processes are running
BhajanpreetsMBP:conf bhajanpreetsingh$ jps
4276 JobHistoryServer
7765 SecondaryNameNode
8023 ResourceManager
10330 Jps
7579 NameNode
7661 DataNode
8110 NodeManager
Both DFS and YARN processes are running as seen aboveadd tmp and user dirs in hive
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /user/hive/warehouse
and give them group write permissions
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /user/hive/warehouse
warehouse is where the hive (non-metastore) data will liveNow run the hive command
Before running you need to create an instance of metastore database otherwise you get the following exception
Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
By default we have metastore_db dir in each dir where hive is run from (we can use a common location or metastore too by updating the config file
Lets move the old metastore_db in conf dir and re initialize the db by running following commands
BhajanpreetsMBP:conf bhajanpreetsingh$ pwd
/Users/bhajanpreetsingh/sw/hive-2.1.1/conf
BhajanpreetsMBP:conf bhajanpreetsingh$ mv metastore_db metastore_db.tmp
BhajanpreetsMBP:conf bhajanpreetsingh$ schematool -initSchema -dbType derby
To use a common location for metastore_db we can update ConnectionURL and add the whole path in front of databaseName
Hive 2 server can be accessed using beeline
Hiveserver2 supports concurrent clients, better authentication and authorization
Beeline is the CLI for hiverserver2
Hive1 was directly accessing metastore and driver, hive2 passes through the api
run beeline by calling
$ bin/beeline -u jdbc:hive2://
(I updated the javax.jdo.option.ConnectionURL to use common metastore_db location)
0: jdbc:hive2://> create database myexample1;
17/03/12 12:30:27 [HiveServer2-Background-Pool: Thread-41]: WARN metastore.ObjectStore: Failed to get database myexample1, returning NoSuchObjectException
OK
No rows affected (0.225 seconds)
0: jdbc:hive2://> use myexample1
0: jdbc:hive2://> create table customers(id bigint, name string, address string);
OK
No rows affected (0.309 seconds)
0: jdbc:hive2://> show tables;
OK
+------------+--+
| tab_name |
+------------+--+
| customers |
+------------+--+
1 row selected (0.085 seconds)
0: jdbc:hive2://> describe customers;
OK
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| id | bigint | |
| name | string | |
| address | string | |
+-----------+------------+----------+--+
3 rows selected (0.233 seconds)
0: jdbc:hive2://> insert into customers values(1111,"Jim","NJ");
This will run the map reduce command
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
add data
run join query
17/03/12 12:55:23 [0bb81cdd-3ea1-49bc-9df2-f3b27d76e7c0 main]: ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
17/03/12 12:55:24 [HiveServer2-Background-Pool: Thread-49]: WARN ql.Driver: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = bhajanpreetsingh_20170312125522_ffc0c677-17ea-4632-811b-02e61523b296
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
17/03/12 12:55:25 [HiveServer2-Background-Pool: Thread-49]: WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
Starting Job = job_1489337616191_0001, Tracking URL = http://10.246.215.84:8088/proxy/application_1489337616191_0001/
Kill Command = /Users/bhajanpreetsingh/sw/hadoop-2.7.3/bin/hadoop job -kill job_1489337616191_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
17/03/12 12:55:33 [HiveServer2-Background-Pool: Thread-49]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2017-03-12 12:55:33,954 Stage-1 map = 0%, reduce = 0%
17/03/12 12:55:24 [HiveServer2-Background-Pool: Thread-49]: WARN ql.Driver: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
2017-03-12 12:55:39,309 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1489337616191_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost/user/hive/warehouse/myexample1.db/customers/.hive-staging_hive_2017-03-12_12-55-22_882_6007169443295942027-1/-ext-10000
Loading data to table myexample1.customers
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 HDFS Read: 4352 HDFS Write: 88 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
No rows affected (18.856 seconds)
0: jdbc:hive2://> select * from customers;
OK
+---------------+-----------------+--------------------+--+
| customers.id | customers.name | customers.address |
+---------------+-----------------+--------------------+--+
| 1111 | Jim | NJ |
+---------------+-----------------+--------------------+--+
1 row selected (0.122 seconds)
0: jdbc:hive2://> insert into customers values(2222,"Tim","NJ"),(3333,"Moke","NJ"),(4444,"Him","NJ"),(5555,"Bhajan","NJ");
0: jdbc:hive2://> select * from customers;
OK
+---------------+-----------------+--------------------+--+
| customers.id | customers.name | customers.address |
+---------------+-----------------+--------------------+--+
| 1111 | Jim | NJ |
| 2222 | Tim | NJ |
| 3333 | Moke | NJ |
| 4444 | Him | NJ |
| 5555 | Bhajan | NJ |
+---------------+-----------------+--------------------+--+
5 rows selected (0.089 seconds)
0: jdbc:hive2://>
Hive UDF
These are the hive built in functions. There is also a library in hive you can use to define your custom functionshttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
How to execute hive queries from terminal?
$ beeline -u jdbc:hive2:// -e "use myexample1; select * from customers;"
OR
executes queries from a file
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ echo "use myexample1; select * from customers;" >> /tmp/tmp.hql
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ beeline -u jdbc:hive2:// -f /tmp/tmp.hql
How to do simple Joins in Hive?
0: jdbc:hive2://> use myexample1;
0: jdbc:hive2://> create table if not exists orders(id bigint, product_id string, customer_id bigint, quantity int, amount double);
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ cat /tmp/tmp.hql
use myexample1;
insert into orders values(0001, "laptop", 1111,2,2500);
insert into orders values(0002, "phone", 1111,1,500);
insert into orders values(0003, "mouse", 2222,100,1800);
insert into orders values(0004, "light", 3333,1,2);
insert into orders values(0005, "ball", 4444,25,5);
select * from orders;
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ beeline -u jdbc:hive2:// -f /tmp/tmp.hql
$ beeline -u jdbc:hive2:// -e "use myexample1;select * from customers join orders where customers.id=orders.customer_id"
+---------------+-----------------+--------------------+------------+--------------------+---------------------+------------------+----------------+--+
| customers.id | customers.name | customers.address | orders.id | orders.product_id | orders.customer_id | orders.quantity | orders.amount |
+---------------+-----------------+--------------------+------------+--------------------+---------------------+------------------+----------------+--+
| 1111 | Jim | NJ | 1 | laptop | 1111 | 2 | 2500.0 |
| 1111 | Jim | NJ | 2 | phone | 1111 | 1 | 500.0 |
| 2222 | Tim | NJ | 3 | mouse | 2222 | 100 | 1800.0 |
| 3333 | Moke | NJ | 4 | light | 3333 | 1 | 2.0 |
| 4444 | Him | NJ | 5 | ball | 4444 | 25 | 5.0 |
+---------------+-----------------+--------------------+------------+--------------------+---------------------+------------------+----------------+--+
note that hive only supports equity joins
Primitive datatypes in HIVE
- Boolean
- Numeric
- Integers
- Tinyint: takes 1 byte -128 to 128 bit range (one bit used for sign and 7 for magnitude)
- Smallint: takes 2 bytes -215 to 215 -1
- Int: 4 bytes
- Bigint: 8 bytes
- Decimals
- Float: 4 bytes
- Decimal: 8 bytes
- Decimal: Arbitary Precision e.g., decimal(10,2) will give 10x before decimal and 2x after decimal
- String
- String: unbounded variable length
- Char: fixed length char string
- Varchar: bounded variable length (we give max size)
- Timestamp
- Integer: Unix timestamp in nanoseconds
- Float: Unix timestamp in seconds with decimal precision
- String: any JDBC compliant string will work e.g.., YYYY-MM-DD HH:MM:SS.fffffffff
- Date: values of the form "YYYY-MM-DD"
Exploring Hive warehouse data in HDFS
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ hdfs dfs -cat /user/hive/warehouse/myexample*/orde*/000*_0
17/03/12 19:31:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1laptop111122500.0
BhajanpreetsMBP:hadoop-2.7.3 bhajanpreetsingh$ hdfs dfs -cat /user/hive/warehouse/myexample*/cust*/000*_1
17/03/12 19:32:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2222TimNJ
3333MokeNJ
4444HimNJ
5555BhajanNJ
each insert created a file
each db has its dir (like myexample1 above)
default db has no dir, tables from default db are directly placed in /user/hive/warehouse/
Managed and External Tables
In Managed tables, hive owns the files and directories with data
Deleting a table will delete both metadata and data
In External tables, Hadoop, Pig, HBase all may edit data files
Deleting table will only the metadata
Metadata is saved in metastore in both Managed and External tables
Creating external tables
copy csv data
$ cat /tmp/products.csv
ip1,iphone,800
tv2,visio,790
mb3,mac,2690
in beeline
create external table if not exists products(id string,name string,cost float)
comment "products table"
row format delimited fields terminated by ','
stored as textfile
location "/data_example";
select * from products;
+--------------+----------------+----------------+--+
| products.id | products.name | products.cost |
+--------------+----------------+----------------+--+
| ip1 | iphone | 800.0 |
| tv2 | visio | 790.0 |
| mb3 | mac | 2690.0 |
+--------------+----------------+----------------+--+
more on create table command in
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
How to check if a table is managed or external table in hive?
run in beeline> describe formatted ;
and check Table Type row
it will also show location of data
How to create a table like another table?
create table table2 like table1;
How to rename a table?
alter table table1 rename to table10;
How to add columns?
alter table table1 add columns (col1 int comment "my comment");
How to swap/rename columns?
alter table table1 change column col1 col_1 int after col2;
(this will rename col1 to col_1 and move it after col2)
When we swap columns, only the metadata is updated. data remains in same place.
Temporary tables
they are local session level tables that get deleted after session expires
names don't conflict with other user temporary tables as well as other permanent tables (however if names are same then you can not access the permanent table with same name until the temporary table goes away)
they do not support partitions and index
Loading data
0: jdbc:hive2://> create external table if not exists products(id string,name string,cost float)
. . . . . . . . > comment "products table"
. . . . . . . . > row format delimited fields terminated by ','
. . . . . . . . > stored as textfile
. . . . . . . . > location "/data_example";
OK
No rows affected (0.076 seconds)
0: jdbc:hive2://> select * from products
. . . . . . . . > ;
OK
+--------------+----------------+----------------+--+
| products.id | products.name | products.cost |
+--------------+----------------+----------------+--+
| ip1 | iphone | 800.0 |
| tv2 | visio | 790.0 |
| mb3 | mac | 2690.0 |
+--------------+----------------+----------------+--+
3 rows selected (0.101 seconds)
0: jdbc:hive2://> select * from managed_products;
OK
+----------------------+------------------------+------------------------+--+
| managed_products.id | managed_products.name | managed_products.cost |
+----------------------+------------------------+------------------------+--+
+----------------------+------------------------+------------------------+--+
No rows selected (0.08 seconds)
0: jdbc:hive2://> load data local inpath '/tmp/products.csv' into table managed_products;
local indicates local FS
to copy from HDFS don't specify local above
Data is appended in this way
File is MOVED from HDFS and not just copied! In case of local FS, file is copied!!
to overwrite the data call, load data local inpath '/tmp/products.csv' overwrite into table managed_products;
Copy data from one table to other
insert into table1
select col1, col2, col3 from table2
to overwrite, call insert overwrite table1 select * from table2
insert into table1
select col1, col2, col3 from table2
to overwrite, call insert overwrite table1 select * from table2
multi insert
0: jdbc:hive2://> select * from products
. . . . . . . . > ;
17/03/12 23:03:26 [02a195f5-8a48-4d43-b109-b8dd6b43d6a6 main]: ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
OK
+--------------+----------------+----------------+--+
| products.id | products.name | products.cost |
+--------------+----------------+----------------+--+
| ip1 | iphone | 800.0 |
| tv2 | visio | 790.0 |
| mb3 | mac | 2690.0 |
+--------------+----------------+----------------+--+
3 rows selected (1.374 seconds)
0: jdbc:hive2://> from products
. . . . . . . . > insert overwrite table product_name
. . . . . . . . > select id,name
. . . . . . . . > insert overwrite table product_cost
. . . . . . . . > select id,cost;
0: jdbc:hive2://> select * from product_name;
OK
+------------------+--------------------+--+
| product_name.id | product_name.name |
+------------------+--------------------+--+
| ip1 | iphone |
| tv2 | visio |
| mb3 | mac |
+------------------+--------------------+--+
3 rows selected (0.089 seconds)
0: jdbc:hive2://> select * from product_cost;
OK
+------------------+--------------------+--+
| product_cost.id | product_cost.cost |
+------------------+--------------------+--+
| ip1 | 800.0 |
| tv2 | 790.0 |
| mb3 | 2690.0 |
+------------------+--------------------+--+
3 rows selected (0.085 seconds)
0: jdbc:hive2://>
What cannot be done easily from hive
Row level delete and update not available by default
However we can set special properties in hive-site.xml to make some tables ACID compliant` (performance will suffer due to this)
Adding Arrays
create table laptops(OS array, brand array) row format delimited fields terminated by ',' collection items terminated by '#';
load data local inpath '/tmp/laptops.csv' into table laptops;
select * from laptops;
OK
+--------------+-----------------------+--+
| laptops.os | laptops.brand |
+--------------+-----------------------+--+
| ["Mac"] | ["Apple"] |
| ["Win"] | ["HP","Acer","Dell"] |
| ["Unbuntu"] | [" HP","Dell"] |
+--------------+-----------------------+--+
cat /tmp/laptops.csv
Mac,Apple
Win,HP#Acer#Dell
Unbuntu, HP#Dell
Adding Maps
0: jdbc:hive2://> create table desktops(model string,features map) row format delimited fields terminated by ',' collection items terminated by '#' map keys terminated by ':';
OK
No rows affected (0.074 seconds)
0: jdbc:hive2://> load data local inpath '/tmp/desktops.csv' into table desktops;
Loading data to table myexample1.desktops
OK
No rows affected (0.153 seconds)
0: jdbc:hive2://> select * from desktops;
OK
+-----------------+----------------------------------+--+
| desktops.model | desktops.features |
+-----------------+----------------------------------+--+
| model1 | {"webcam":false,"mouse":true} |
| model2 | {"webcam":true,"earphone":true} |
+-----------------+----------------------------------+--+
2 rows selected (0.098 seconds)
$ cat /tmp/desktops.csv
model1,webcam:false#mouse:true
model2,webcam:true#earphone:true
Adding Structures
create table desktops(model string,features map, information struct) row format delimited fields terminated by ',' collection items terminated by # map keys terminated by ':';
CBO
exceptions can be turned off in hive-site.xml
Built in functions in hive
UDF (user defined functions) like trim() concat() length() round() floor() - single in - single out
UDAF (user defined aggregate functions) like count(*), sum(), avg() - multi in - single out
UDTF (user defined table functions) explode() posexplode() - single in - multi out
explode() and posexplode()
0: jdbc:hive2://> select * from desktops;
OK
+-----------------+----------------------------------+--+
| desktops.model | desktops.features |
+-----------------+----------------------------------+--+
| model1 | {"webcam":false,"mouse":true} |
| model2 | {"webcam":true,"earphone":true} |
+-----------------+----------------------------------+--+
2 rows selected (0.098 seconds)
0: jdbc:hive2://> select * from laptops;
OK
+--------------+-----------------------+--+
| laptops.os | laptops.brand |
+--------------+-----------------------+--+
| ["Mac"] | ["Apple"] |
| ["Win"] | ["HP","Acer","Dell"] |
| ["Unbuntu"] | [" HP","Dell"] |
+--------------+-----------------------+--+
3 rows selected (0.069 seconds)
NOTE: WE CANNOT DO select os, explode(brand) from laptops
NOTE that we also cannot do select select(brand) as brnd from laptops group by brnd
0: jdbc:hive2://> select explode(brand) from laptops;
OK
+--------+--+
| col |
+--------+--+
| Apple |
| HP |
| Acer |
| Dell |
| HP |
| Dell |
+--------+--+
0: jdbc:hive2://> select posexplode(brand) from laptops;
+------+--------+--+
| pos | val |
+------+--------+--+
| 0 | Apple |
| 0 | HP |
| 1 | Acer |
| 2 | Dell |
| 0 | HP |
| 1 | Dell |
+------+--------+--+
0: jdbc:hive2://> select posexplode(brand) as (index,brand) from laptops;
0: jdbc:hive2://> select posexplode(brand) as (index,brand) from laptops;
+--------+--------+--+
| index | brand |
+--------+--------+--+
| 0 | Apple |
| 0 | HP |
| 1 | Acer |
| 2 | Dell |
| 0 | HP |
| 1 | Dell |
+--------+--------+--+
6 rows selected (0.078 seconds)
0: jdbc:hive2://> select * from desktops;
OK
+-----------------+----------------------------------+--+
| desktops.model | desktops.features |
+-----------------+----------------------------------+--+
| model1 | {"webcam":false,"mouse":true} |
| model2 | {"webcam":true,"earphone":true} |
+-----------------+----------------------------------+--+
2 rows selected (0.073 seconds)
0: jdbc:hive2://> select explode(features) as (feature,present) from desktops
+-----------+----------+--+
| feature | present |
+-----------+----------+--+
| webcam | false |
| mouse | true |
| webcam | true |
| earphone | true |
+-----------+----------+--+
4 rows selected (0.093 seconds)
0: jdbc:hive2://>
Lateral View
Its a virtual table formed by exploded view
We can join it back with the original table to get useful queries
0: jdbc:hive2://> drop table laptops;
0: jdbc:hive2://> create table laptops ( os string, brand array ) row format delimited fields terminated by ',' collection items terminated by '#';
0: jdbc:hive2://> load data local inpath '/tmp/laptops.csv' into table laptops;
0: jdbc:hive2://> describe laptops;
OK
+-----------+----------------+----------+--+
| col_name | data_type | comment |
+-----------+----------------+----------+--+
| os | string | |
| brand | array | |
+-----------+----------------+----------+--+
0: jdbc:hive2://> select os,ibrand from laptops lateral view explode(brand) brandsTable as ibrand;
OK
+----------+---------+--+
| os | ibrand |
+----------+---------+--+
| Mac | Apple |
| Win | HP |
| Win | Acer |
| Win | Dell |
| Unbuntu | HP |
| Unbuntu | Dell |
+----------+---------+--+
*****brandsTable is the new virtual table that has brand column*****
0: jdbc:hive2://> describe desktops;
OK
+-----------+----------------------+----------+--+
| col_name | data_type | comment |
+-----------+----------------------+----------+--+
| model | string | |
| features | map | |
+-----------+----------------------+----------+--+
2 rows selected (0.076 seconds)
0: jdbc:hive2://> select model,theFeature,isPresent from desktops lateral view explode(features) ExplodedFeatureTable as theFeature,isPresent;
OK
+---------+-------------+------------+--+
| model | thefeature | ispresent |
+---------+-------------+------------+--+
| model1 | webcam | false |
| model1 | mouse | true |
| model2 | webcam | true |
| model2 | earphone | true |
+---------+-------------+------------+--+
lets count how many times we gave brand name in laptops table
0: jdbc:hive2://> select ibrand,count(*) from laptops lateral view explode(brand) llap as ibrand group by ibrand;
0: jdbc:hive2://> create table matrix( cells array> );
0: jdbc:hive2://> insert into matrix
----------------- select array( array(1,2,3), array(4,5,6), array(7,8,9) );
0: jdbc:hive2://> select * from matrix;
OK
+----------------------------+--+
| matrix.cells |
+----------------------------+--+
| [[1,2,3],[4,5,6],[7,8,9]] |
+----------------------------+--+
1 row selected (0.074 seconds)
0: jdbc:hive2://>
0: jdbc:hive2://> select * from matrix
. . . . . . . . > lateral view explode(cells) array1 as r
. . . . . . . . > lateral view explode(r) array2 as c;
OK
+----------------------------+-----------+-----------+--+
| matrix.cells | array1.r | array2.c |
+----------------------------+-----------+-----------+--+
| [[1,2,3],[4,5,6],[7,8,9]] | [1,2,3] | 1 |
| [[1,2,3],[4,5,6],[7,8,9]] | [1,2,3] | 2 |
| [[1,2,3],[4,5,6],[7,8,9]] | [1,2,3] | 3 |
| [[1,2,3],[4,5,6],[7,8,9]] | [4,5,6] | 4 |
| [[1,2,3],[4,5,6],[7,8,9]] | [4,5,6] | 5 |
| [[1,2,3],[4,5,6],[7,8,9]] | [4,5,6] | 6 |
| [[1,2,3],[4,5,6],[7,8,9]] | [7,8,9] | 7 |
| [[1,2,3],[4,5,6],[7,8,9]] | [7,8,9] | 8 |
| [[1,2,3],[4,5,6],[7,8,9]] | [7,8,9] | 9 |
+----------------------------+-----------+-----------+--+
9 rows selected (0.063 seconds)
0: jdbc:hive2://>
Note on external tables
create external table mytable(id int) row format delimited fields terminated by '|';
load data in path '/tmp/abc.txt' into mytable;
this will delete the file /tmp/abc.txt
now if I do,
drop table mytable;
then table definition will be dropped but contents will not be dropped
if I want to drop contents too I need to find where is that data stored. For this I have to run the command,
show create table mytable;
Comments