DevOps Zone is brought to you in partnership with:

Johannes has posted 2 posts at DZone. View Full User Profile

Sirix - Beyond Versioning of Persistent Trees

04.28.2014
| 773 views |
  • submit to reddit

Introduction

Sirix is a versioning storage system capable of storing and querying snapshots of temporal trees effectively and efficiently. Transactional semantics are preserved.

Versioning Architecture

Sirix supports a wide area of applications through a number of well known versioning strategies as for instance full-, differential-, incremental- and a novel sliding window-versioning algorithm. Instead of doing in-place updates and thus update dirty pages with a new version, the architecture adheres to the copy-on-write (COW)-semantics which is perfectly suited for flash disks (SSDs). Writes are sequentially batched during commits, whereas reads may be random due to the copy-on-write principle. Versioning is achieved through reading and writing of a predefined set of pages in order to restore a specific page in-memory. A page stores a predefined set of records (currently per default 512) and does not have to adhere to any block boundaries on disk. Dependent on the concrete versioning approach more or less records are written to / read from persistent storage. Versioning is takes place on a per data-page level. In the event of the full-versioning approach all records of a dirty page are written to disk once a write-transaction commits. This is in accordance to other COW storage- and database-systems. The incremental versioning instead writes only the currently modified records to disk, thus to restore a page in-memory a predefined number of pages must be read. In order to fasttrack a page in memory intermittent full snapshots of a page are written to disk after a threshold value of revisions have been commited. Data pages are linked, thus a pointer to the previous version of the page is stored (despite obiously for a first version of a page). The differential versioning approach in accordance to well known backup systems writes changes since the last full-snapshot of a page to disk during a commit. Thus, for reconstruction of the page in memory, the last full-snapshot as well as the current page of the revision to reconstruct has to be read. A novel sliding-snapshot approach avoids write-peaks which usually occur if full intermittant snapshots of a page have to be written to disk and trails reading and writing of in-memory pages over a predefined set of pages on disk. Thus, for instance only changes which occur during the revision to commit (all changes done within the current transaction) plus changes which are older than a predefined threshold of on-disk pages to read have to be persisted. In order to reconstruct a page in-memory a set of pages (a sliding window) has to be read from disk. The pages of the sliding window are read from newest to oldest, thus, once the page is full we are able to skip reading of pages even if the sliding window is larger. In any case we store the result of the modification event, the record rather than the event itself.

Tree Encoding

The tree-structure is formed by a nodeKey/firstChildKey/leftSiblingKey/rightSiblingKey/parentKey encoding. Thus, no secondary index-structure as for instance a B+-tree is needed and fast pointer dereferencing takes places. However this encoding doesn't allow quick determination of document order which is essential for XQuery and we added a hierarchical labeling schema (compressed DeweyIDs). Despite, optionally the number of descendants of each node, a hash and the number of children is stored. Furthermore the movement of nodes as well as whole subtrees (which is essential for implementing copy & paste in an editor for instance) in contrast to many other XML database systems is supported. All of these "attributes" can be queried by XQuery-functions to support a wide are of data mining and visual analytics tasks.

Transactional semantics

Currently one write-transaction and N-read transactions per resource are supported. A resource stores application specific data, currently the above mentioned XDM-nodes. A database / collection contains several resources. We also aim to provide sessions/transactions which encapsulate these lower level transactions and provide higher level transactions to span multiple resources. Each commit generates a new snapshot / revision of the resource and a timestamp with the current milleseconds from 1970 is stored within a RootPage. The first revision is revision 1 and every commit adds 1 to the revision count. A resource in Sirix is a tree of pages and each revision/snapshot can be reconstructed. A page is reconstructed in logarithmic time (tree-height whereas a tree of indirect pages as in ZFS is most likely in-memory plus a very small constant to reconstruct the data-pages through dereferencing of previous versions depending on the versioning algorithm). Furthermore we are able to restore and work with a dated revision and commit the changes afterwards, resulting in a new snapshot / revision containing the old revision and subsequent modifications of our currently commited transaction. The UberPage is the main entry point to a resource and is always written as the last page during a commit. Thus, even if the system crashes during a commit the last UberPage is read and consistency of the data is guaranteed. As Sirix is heavily inspired by copy-on-write file systems and particularly by ZFS we aim to store a merkle tree, that is hashes in parent-pointers to child-pages, which are going to be checked once a page is read.

Temporal XPath/XQuery extensions

Despite, much work has been put into higher level APIs. XPath has been extended to include temporal-axis as for instance first::, last::, past::, past-or-self::, future::, future-or-self::, previous::, previous-or-self, next::, next-or-self::, all-time::, revision:: in order to navigate not only in space but also in time. A path summary of all paths in a resource during a specific snapshot is always kept up-to-date in order to support index-structures. The path summary construction adheres to the versioning and transactional snapshot semantics and is done during a write-transaction commit.

Many XQuery functions are available to build and query advanced versioned, user defined, typed index-structures as for instance path-indexes, element-indexes and so called CAS (content-and-structure) indexes . Currently, however the XQuery processor Brackit does not rewrite queries to take indexes into account. However, we rewrite descendant-axis to child-axis if appropriate in a lot of cases (taking the path-summary into account). All index-structures are always kept up-to-date and depend on transactional semantics (that is the write-transaction must be commited or rolled back).

In addition to temporal XQuery functions to open/serialize a specific revision we provide a diff-algorithm which allows the versioned import of several versions of XML-documents (without the need for unqiue node-IDs). This, however is a low level Java-API and might be exposed within an XQuery function in the future. The algorithm tries to detect a minimal edit-script to modify a first version of a tree into a second version (tree-to-tree correction problem). Once stored in Sirix we are able to use a fast ID-based diff-algorithm (optionally even taking hashes, which are optionally generated for each node during insertion/update into account) to detect the differences between any two revisions of a resource. Once again to expose our Java-API for users of XQuery a diff-function is planned as well as a detailed specification of a delta-format.

Future Work

We are by no means restricted to support XML, instead as Brackit also seamlessly integrates JSON into the XQuery Data Model we may also store JSON-documents and just need to add another transaction layer on top besides XML and have to add Array- and Object-Records. We are not even restricted to the storage of tree-structures and may store any other binary data or graphs as well.

Currently a RESTful API is refactored and we aim to distribute Sirix (for instance replicate the transaction-log).

Published at DZone with permission of its author, Johannes Lichtenberger.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)