What is Hive??
Let's you write standard sql queries that can be executed around an entire cluster. This means that even though it "looks" like sql, underneath, there are different operations performed.
For example using Hive on top of HDFS with MapReduce. Basically translating sql query into mappers and reducers, figuring out how to tie them together and execute them globally.
If you already know sql, it can be quite handy.
However it's not sql, it's HiveQL.
Why Hive??
Interactive in nature -> type sql queries just like in a normal database.
Since it operates on top of a cluster, it is very scalable... able to work with "big data" on a cluster.
Makes queries with OLAP (Online Analytical Processing) more easier than, for example, writing Java code with MapReduce.
Highly optimized and highly extensible (user defined functions, JDBC/ODBC driver)
Why not Hive??
High latency - not for OLTP... data are de-normalized. SQL is limited.
No transactions are available. No record-level updates, insert, deletes etc. (basically not a relational database).
HiveQL
Pretty much similar to MySQL with some extensions.
For example, "views" allow complex queries to store results of a query... which allows subsequent queries can use as a table.
Allows you to specify how data is stored and partitioned.
Example
CREATE VIEW topMovieIDs AS
SELECT movieID, count(movieID) AS ratingCount
FROM ratings
GROUP BY movieID
ORDER BY ratingsCount DESC;
SELECT n.title, ratingCount
FROM topMovieIDs t JOIN movie_names m ON t.movieID = m.movieID;
after running the above example, you have a view created with this code.
Thus if you execute this again, an error will occur since the view already exists.
Thus you can change the code to the following
CREATE VIEW IF NOT EXISTS topMovieIDs AS
...
Or simply delete the view thereafter with
DROP VIEW topMovieIDs;
Metastore in Hive
Hive maintains a 'metastore' that imparts a structure you define on the unstructured data that is stored on HDFS etc.
Import data with HiveQL might look something like below:
CREATE TABLE ratings (
userID INT,
movieID INT,
rating INT,
time INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '${env:HOME}/ml-100k/u.data'
OVERWRITE INTO TABLE ratings;
Thus with the above code, you create a 'schema on read'.
- LOAD DATA: With the LOAD DATA command, it moves data from a distributed filesystem into Hive.
- LOAD DATA LOCAL: copies data from you local filesystem into Hive
Managed vs. External tables
Managed tables are like the above LOAD DATA related commands where HIVE actually takes ownership of the data created along with the table.
With external tables, Hive doesn't take ownership of the data. Thus with the DROP command, Hive doesn't delete the data. It just deletes the metadata associated with the data.
CREATE EXTERNAL TABLE IF NOT EXISTS ratings (
userID INT,
movieID INT,
rating INT,
time INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/data/ml-100k/u.data';
Partitioning
In hive, you can also store data in partitioned subdirectories which can lead to optimizations with appropriate usage of partitions within queries.
CREATE TABLE customers (
name STRING,
address STRUCT<stree:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING);
and directories could be as such:
- .../customers/country=CA/
- .../customers/country=GB/
Different ways to use Hive
- Interactive via hive> prompt (CLI)
- saved query files
- hive -f /somepath/to/query/queries.hql
- Through Ambari/Hue
- Through JDBC/ODBC server
- Through Thrift service (not suitable for OLTP)
- Via Oozie
'Data Engineering' 카테고리의 다른 글
Data Discovery Platform 알아보기 (0) | 2022.12.24 |
---|---|
AWS MWAA 활용하기 (0) | 2022.06.11 |
Intro to Airflow (0) | 2022.06.01 |
Intro to Sqoop (0) | 2022.06.01 |
Spark Example (0) | 2022.06.01 |