Codied To Clipboard !
Home > Notes > HIVE
1.) What are Hive Partitions? Hive table partition is a way to split a large table into smaller logical tables based on one or more partition keys. These smaller logical tables are not visible to users and users still access the data from just one table. Partition eliminates creating smaller tables, accessing, and managing them separately. When you load the data into the partition table, Hive internally splits the records based on the partition key and stores each partition data into a sub-directory of tables directory on HDFS. The name of the directory would be partition key and it’s value. Create Hive Partition Table To create a Hive table with partitions, you need to use PARTITIONED BY clause along with the column you wanted to partition and its type. Let’s create a table and Load the CSV file. CREATE TABLE zipcodes( RecordNumber int, Country string, City string, Zipcode int) PARTITIONED BY(state string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; LOAD DATA INPATH '/user/itv199411/zipcodes20.csv' INTO TABLE zipcodes; Show All Partitions on Hive Table After loading the data into the Hive partition table, you can use SHOW PARTITIONS command to see all partitions that are present. SHOW PARTITIONS zipcodes; jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes; +------------+ | partition | +------------+ | state=AL | | state=AZ | | state=FL | | state=NC | | state=PR | | state=TX | +------------+ 6 rows selected (0.074 seconds) Add New Partition to the Hive Table A new partition can be added to the table using the ALERT TABLE statement, you can also specify the location where you wanted to store partition data on HDFS. jdbc:hive2://127.0.0.1:10000>ALTER TABLE zipcodes ADD PARTITION (state='CA') LOCATION '/user/data/zipcodes_ca'; SHOW PARTITIONS return the added partition. jdbc:hive2://127.0.0.1:10000> SHOW PARTITIONS zipcodes; +------------+ | partition | +------------+ | state=AL | | state=AZ | | state=CA | | state=FL | | state=NC | | state=PR | | state=TX | +------------+ 7 rows selected (0.081 seconds) Rename or Update Hive Partition Using ALTER TABLE, you can also rename or update the specific partition. ALTER TABLE zipcodes PARTITION (state='AL') RENAME TO PARTITION (state='NY'); Drop Hive Partition Dropping a partition can also be performed using ALTER TABLE tablename DROP ALTER TABLE zipcodes DROP IF EXISTS PARTITION (state='AL'); 2.) Hive Bucketing a.k.a (Clustering) is a technique to split the data into more manageable files, (By specifying the number of buckets to create). The value of the bucketing column will be hashed by a user-defined number into buckets. Bucketing can be created on just one column, you can also create bucketing on a partitioned table to further split the data which further improves the query performance of the partitioned table. Each bucket is stored as a file within the table’s directory or the partitions directories. Note that partition creates a directory and you can have a partition on one or more columns; From our example, we already have a partition on state which leads to around 50 subdirectories on a table directory, and creating a bucketing 10 on zipcode column creates 10 files for each partitioned subdirectory. Hive Bucketing Example Hive Bucketing Example In the below example, we are creating a bucketing on zipcode column on top of partitioned by state. CREATE TABLE zipcodes( RecordNumber int, Country string, City string, Zipcode int) PARTITIONED BY(state string) CLUSTERED BY Zipcode INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';