This document describes the privileges needed and the instructions for adding STATS to a Hive table.
The stats for a Hive table are based on four properties:
* numRows
* numFiles
* rawDataSize
* totalSize
To set these properties manually, you can write a Hive statement such as:
ALTER TABLE <tblname> SET TBLPROPERTIES (‘numRows’ = ‘xxx’, ‘numFiles’ = ‘xxx’, ‘rawDataSize’ = ‘xxxx’, ‘totalSize’ = ‘xxxx’)
To complete this work, you’ll need to calculate these values either on the table itself or the underlying storage location. To do that, you’ll need permission to access storage. In total you’ll need:
- The location of the S3 bucket containing the table’s files
- ListBucket and ReadOnly permissions for that bucket
- ALTER privilege on the Hive table
These stats help Hive select the optimal join operation. The goal is to run efficiently without exhausting memory. These values are therefore sensitive to the magnitude of the table size, but not to incremental changes. This means you can estimate some values and conserve the expense of an exact sizing. Okera recommends you calculate these values as follows
numRows: Use SELECT count(*) on <table>
numFiles: Count the number of partitions/files via the AWS CLI, but use the table’s partition count to determine the best method. In Hive, use SHOW PARTITIONS; to get the total count. If it is not very large, use:
aws s3 ls <bucket/path>/ --recursive --summarize | wc -l
to count the files (the preferred option). If you have a very large number of partitions, however, it’s possible an aws command will time out before finishing. In this case, you can sample a few partitions with:
aws s3 ls <bucket/partition k=v/ --summarize
And multiple their average file count by the total partition count. This approach depends in part on a reasonable range of differences in file count among all partitions.
The rawDataSize and totalSize properties can be reasonable estimates. For the purpose of sizing you can apply the same value to both. Just use the total size of one partition multiplied by the total partition count. Alternatively, you can sample a few partitions and multiply by their average size.
Comments
0 comments
Please sign in to leave a comment.