Running queries on parquet data from a spark EMR cluster produces timeout errors.
Caused by: java.net.SocketTimeoutException: Read timed out
The data producers changed the schema of the table. The partitions for the old files with the now-incorrect schemas are still there. The SHOW CREATE TABLE and SHOW PARTITIONS commands, as well as a scan on the table from the Okera API, timeout.
Okera skips files that are invalid (regardless of the reason) based on the rationale that very large datasets often contain files that may be corrupt. In this case, the number of corrupt files is large so it takes a long time to find valid data.
The underlying issue is that parquet allows for resolving schemas by ordinal or by name. For example, if the schema in the file is:
A valid HMS schema when resolving by ordinal, which matches by index (not field name) is:
Similarly, a valid HMS schema when resolving by name is:
Note that the column order is different and that there are advantages and disadvantages to both methods. Ordering by name enables the user to reorder the columns while ordering by ordinal allows for renaming columns.
Since Okera defaults to 'by ordinal', the corrupt files are failing. Although the columns have the same name, they are incompatible types. In this example, the type in the catalog is string, but in the data file it is double. Okera does not allow this mismatch.
The remedy for this is to confirm whether the actual data type is string or double. If it is indeed a double, the CREATE TABLE DDL statement is wrong and the user should drop and recreate the table with the correct types.