It appears that Okera changes the value of a timestamp field based on where the compute is located. When the user retrieves data through Okera, the value is offset by the timezone as compared to reading it directly from s3. How are timestamps determined?
Timezone conversion occurs in various compute engines/clients. To circumvent this behavior, use the same timezone for clients, EMR's, ODAS clusters. We found a pretty accurate description of what we believe is happening in https://boristyukin.com/watch-out-for-timezones-with-sqoop-hive-impala-and-spark-2/.
- The article states that Hive always thinks that timestamps in Parquet files are stored in UTC and it will convert them to a local system time (cluster host time) when it outputs results.
- Specifically this means that:
- Data is assumed to be in UTC and converted to local timezone on the way out;
- This means if a user runs the same hive query in different timezones, they will see different values;
- Hive's behavior is specific to some file formats (parquet but not all others). For other files, the behavior is what we do now (same value regardless of client timezone).
Okera recommends the use of UTC as the timezone for the EMR and ODAS clusters if conversion is not desired.