BUCKETING in HIVE

let us first understand what is bucketing in Hive and why do we need it.

Partitioning in Hive offers splitting the hive data in multiple directories so that we can filter the data effectively. But the partitioning works effectively only when there are  limited number of partitions and comparatively are of equal size. But This may not be possible in all scenarios.

To overcome the problem of over partitioning hive provide another concepts called Bucketing, a technique of decomposing the data or decreasing the data into more manageable parts or equal parts.

Hive partition divides table into number of partitions and these partitions can be further subdivided into more manageable parts known as Buckets or Clusters.

The Bucketing concept is based on

  • Hash function, which depends on the type of the bucketing column. hash function on the bucketed column mod no of buckets
  • Records which are bucketed by the same column will always be saved in the same bucket
  • CLUSTER_BY clause is sued to divide the table into BUCKETS
  • Physically each bucket is just file in the table directory and the bucket numbering is 1-base
  • BUCKETING can be done with or without partitioning
  • BUCKETING will create almost equally distributed data files

When we write data in bucketed table in hive, it places the data in distinct buckets as files. Hive uses some hashing algorithm to generate a number in range of 1 to N buckets  and based on the result of hashing, data is placed in a particular buckets as a file.

For example we have an Employee table with columns like emp_name, emp_id, emp_sal, join_date and emp_dept.

CREATE TABLE IF NOT EXISTS GKDB.FO_RECORD
(
PO_NUMBER STRING,
PO_YEAR STRING,
PO_ITEM STRING,
FO_OUT_QTY DECIMAL,
FO_IN_QTY DECIMAL,
FMS_KEY STRING,
FO_SHIP_ID STRING,
FOI_NUMBER STRING,
PARTNUMBER STRING,
OPER_FROM STRING,
OPER_TO_REPNO STRING,
ADN_NO STRING,
CONTRACT_CODE STRING,
PACK_ID DECIMAL,
FO_OUT_DATE TIMESTAMP,
REQUEST_CODE STRING
 ) 
 PARTITIONED BY (SITE_CODE STRING) 
 CLUSTERED BY (PARTNUMBER) SORTED BY (PARTNUMBER) INTO 8 BUCKETS
 STORED AS ORC TBLPROPERTIES ("ORC.COMPRESS"="ZLIB");

In the above table SITE_CODE column is used as the top-level partition and the PARTNUMBER as the second-level partition leads to too many small partitions which is an overhead and not good on performance.

So here FO_RECORD table is partitioned by SITE_CODE and bucketed by PART_NUM. The value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket.

Set the below properties in order to enforce Bucketing:

set hive.enforce.bucketing = true
set hive.exec.dynamic.partition = true
set hive.exec.dynamic.partition.mode = nonstrict
set hive.exec.max.dynamic.partitions.pernode=20000
set hive.exec.max.dynamic.partitions=20000

Advantage of Bucketing:

  1. The number of buckets is fixed so it does not fluctuate with data
  2. Enables more efficient queries
  3. Hash(column) MOD(number of buckets) –evenly distributed
  4. makes sampling more efficient

           When we want to test a table which has huge amount of data or when we want to draw some            patterns or when we want some aggregations [where accuracy is not out top priority] then                  we need to sample i.e we need to run the query on smaller set of data of the entire table data              with evenly distributed sample. There are two types of sampling:            

         1.Bucket Sampling :

          SELECT *  FROM FO_RECORD TABLESAMPLE (BUCKET 1 OUT OF 4 AT emp_id);

          It will select the data from the first buckets of each partition from employee table otherwise in           normal select we can’t select data in such distributed and evenly manner.

2.Block Sampling:

            SELECT * FROM FO_RECORD TABLESAMPLE (20 PERCENT);

Limitations of Bucketing:

  • One has to handle loading the data into buckets by yourself
  • Bucketing always doesn’t ensure that the data is populated properly

How to decide how many buckets

To decide how many buckets, you should consider the amount of data in one bucket=(total size of data/number of buckets) < (should be smaller than) the size of your memory.

the consideration should be applied strictly to the bigger table as it has more impact from this direction and latter the configuration will be applied to the smaller table as a must.

As a rule of thumb each bucket should contain between 1 and 3 blocks, probably somewhere near 2 blocks. so if your block size is 256MB it is reasonable  to have ~512MB of data in each bucket in the bigger table so this becomes a simple division issue.

when you use a join with another table having buckets on the same column , with a number of buckets which is same or multiple of the number of buckets of this table,then the joins happen in the memory, or you can say map side join, which is much faster

Happy Hiving!!

Leave a comment