Question
- Will using `recover partitions` once per day for ~100 datasets (and expanding) work better or worse than `add partition`?
- What does `recover partitions` actually do behind the scenes?
Recommendation
- Okera has optimized the Recover Partitions operation. It should be okay for the size of data mentioned above.
The Add Partition operation will always be faster than the Recover Partitions operation because Recover Partitions requires scanning the existing dataset to discover new partitions.
Functionally, it is safer to rely on Recover Partitions in that if there is a bug in your script that inadvertently skips a partition, then Recover Partitions would pick it up. - Behind the scenes, Recover Partitions analyzes the physical path and the HMS partition metadata and adds just the missing partitions. You can certainly optimize this by calling `add partitions` directly if you know which physical paths are new (without inspecting the physical paths). If you're inspecting the physical paths to determine what data is new, then the suggestion is to use `recover partitions` to keep things simple. If the end-to-end wall clock timing of the 100 `recover partitions` calls is taking too long, then parallelize those calls as much as possible.
Comments
0 comments
Please sign in to leave a comment.