Question
Is there a limit on # of partitions for an object based on current catalog designs/implementations?
Answer
While there is no enforced limit on the number of partitions you can create in a dataset, it is typically not recommended to try to keep the number of partitions to a minimum. If the number of partitions necessary for a specific set is expected to be high due data loading patterns or other factors, it is recommended that they be of significant size (in the range of GB) due to the overhead of loading multiple tiny files for a given query.
As a best practice, we suggest that the lowest partition granularity that should be considered for a dataset partitioned by date should be "days" in order to ensure that the number of partitions created is set to a reasonable amount. Partitions on hour or below will result in a high number of partitions and will quickly result in poor performance over time.
In our experience, we tend to see performance degradation on datasets with over 20,000 total partitions. As with most performance metrics in a cluster environment, this limit may vary based on your environment and node sizing.
Comments
0 comments
Please sign in to leave a comment.