- How do tasks correlate to sessions in the context of Okera?
- Can one client running one query, have multiple connections?
Since queries run distributed, a client running a single query can have multiple connections. Okera imposes a limit of 255 concurrent sessions. However, when a query runs, not all sessions will be concurrent.
- For example:
- The ODAS cluster has 3 workers
- The EMR cluster has 15 nodes
- The client is Hive
- The user has a table with 10 partitions each containing ~10 MB
- The query selects from 7 partitions
- The planner creates 7 tasks.
- Note the number of tasks is dependent on the size of the data. Large partitions would generate more tasks.
- 7 of the EMR nodes would each open a session to one of the 3 ODAS workers.
The following are then true:
- The planner determines the number of tasks based on criteria such as the amount of data, the complexity of the compute, ....The planner log reports the number of tasks generated as num_results_returned..
- A session is an active connection from the client to an ODAS worker. The relationship is exactly 1:1. Each time a client connects, it picks an worker at random, so the number of sessions (active connections) per worker is expected to be roughly even.
- Non-distributed clients like the REST API will generate 1 session. Distributed clients like spark will generate sessions based on its size (~number of client cores), which is why Okera recommends the 10:1 client:ODAS core ratio.
- From a sizing point of view, the critical metric is the max sessions per server, which is the same as the max connections at any time across clients.