Databricks Delta offers some compelling feature enhancements to building data pipelines on Apache Spark. Perhaps the most compelling feature is support for ACID transactions, which Databricks describes as "serializable isolation levels [to] ensure that readers never see inconsistent data." The Delta engine is available with Databricks Runtime 4.1 or later.
Under the covers, Databricks supports this enhancement using a storage format called Delta. The user explicitly creates tables naming this format. The usual artifacts for journal-based transaction logging are written on top of a base table, which is maintained in Parquet format. The Delta format supports partitioning and other conventions for subdividing data in storage.
The journaling (or transaction logging) components are kept in a proprietary format that is played or replayed, as necessary, to update the base table. Any data that is read naively by a third-party tool could be stale, only partially-written, or not yet arrived. It is only safe, then, for a non-Databricks client to read Delta-controlled data when it is known to be static and quiesced. That is:
a) There are no jobs actively writing to the table
b) All existing Delta storage has been compacted to the base table (Parquet format).
c) There are no stale snapshots in use by Databricks Delta.
The ODAS client libraries are coded to detect and bypass Delta storage artifacts. By default, the ODAS Planner will simply remove itself from the read path, deferring any scan work on Delta-formatted storage back to the compute client.
ODAS can however read from base tables in Parquet from a Delta location if their location has been given in an EXTERNAL TABLE creation within ODAS.
Details of native support for Delta files in Okera can be found here.
Comments
0 comments
Please sign in to leave a comment.