The ODAS planner determines how to convert a request into some number of tasks, where each task is assigned one or more splits of the data that must be scanned. The data itself resides in files for which there is no predetermined size. For any given job, it may be necessary to combine several small files into one task, or split large file into several tasks, or both. Splitting a file further assumes the file format allows splitting at the record or storage block level.
It is most important that tasks are equally-sized to allow for good performance. Fitting that process to a wide range of inputs -- thousands of large files, perhaps, or millions of small files -- while producing a task list in good time -- a few seconds for large datasets -- requires a general approach.
ODAS models the total cost of each task by its startup overhead (assumed to be fixed) plus the cost of processing one or more splits. Bear in mind that setting the ideal size for a task is limited by whether the input can be split. The task size must be the file size if not. Ideal size is also bound by the maximum number of tasks allowed for a partition. This second limit prevents combining data across partitions into one task, which would compromise the assumption of a fixed cost for task overhead.
If the number of computed tasks exceeds maxTasks, tasks are recombined until the total is reduced to that value.