The Realm Files - Vol 2 - Physical Structure Overview

In this second installment of the Realm Files, we will move into the physical structure of a Realm Database and discuss how it is conceptually laid on disk.

At the physical level, a Realm database is organized as a hierarchy of arrays arranged in a B-tree-like structure. At the top of this hierarchy is the Group, the top-level node that serves as the root of the database. The Group contains references to Tables, each representing a class within the Realm schema. Each Table maintains a Cluster Tree which is Realm’s implementation of a B+ tree that organizes object data into Clusters at the leaf level. These Clusters store the actual object data for the database records, making them the end points of the structure and the primary source of evidentiary content. 

The challenge, forensically speaking, is linking the Clusters back to their corresponding Tables (Classes) and Columns (Properties) within those tables. To accomplish this, we must traverse the hierarchy beginning at the top-level node, then follow each reference through the series of nested arrays of Tables and Cluster Trees until we reach the Clusters that store the individual object records.

While I won’t be covering how to navigate and parse arrays at the physical level in this post, understanding the conceptual layout will provide the foundation for the more technical topics I’ll explore in future installments of this series.

The diagram below provides a high-level view of a node within a Realm database. This is a simplified representation, and real-world nodes become increasingly complex as more tables and objects are added to the database.

Copy-on-Write Architecture

What further complicates matters, is Realm maintains two distinct nodes to support its copy-on-write architecture to ensure that each database commit is both atomic and crash-safe. When a write transaction occurs, Realm does not modify the existing node in place. Instead, it allocates new space in the file and writes updated arrays, tables, and clusters into a new node structure. Once all changes are written, the database updates the inactive Top Reference in the file header to point to the new node, then flips the active flag to mark it as the current version. The previously active node remains intact and becomes the inactive snapshot. This alternating process guarantees that one complete, uncorrupted Group is always available, even if a crash or power loss occurs during a write operation.

During an examination, we would want to examine both nodes, as each represents a complete and internally consistent snapshot of the database at a specific point in time. The active node reflects the most recent committed state, while the inactive node preserves the previous version of the database before the last write transaction. By parsing both, an examiner can recover deleted or modified records and reconstruct changes between commits.

Using the iOS Replika app as an example, if we walk the arrays that make up the two nodes, we can identify arrays that are no longer part of the active node. There was a total of 84 arrays in the inactive node that were not in the active node. These arrays represent structures that changed between commits.


If we go to offset 99968, which is considered a data array (Cluster) that is 5 levels deep from the root node. Here we have an example of a 976-byte blob that consists of a concatenated string table containing 88 strings.


If we walk the active node and identify the new version of the data array (Cluster) now at offset 125832 there is a 1952-byte blob which contains that updated version of the string table that has 176 strings. This mean that part of the last transaction was adding strings to the string table.

This is just a basic example of what’s possible when parsing a Realm database at the physical level. By understanding how Realm uses its copy-on-write architecture, and combining that knowledge with advanced analysis, we can identify the specific changes made to the database during the last transaction.

Unallocated Regions  

Now we move on to the unallocated regions of the file, which are areas that exist outside of the active and inactive nodes. After a new commit is finalized and the active Top Reference is switched, Realm performs a cleanup process to manage the unused space left behind by older nodes. Because each commit writes modified arrays to new locations rather than overwriting existing ones, portions of the file that belong to the inactive node may eventually become obsolete. Realm marks these regions as free space and reuses them for future allocations during subsequent write transactions. This reclamation occurs gradually, allowing older snapshots to remain intact until they are no longer referenced by active transactions. 

From a forensic perspective, this behavior is significant because remnants of outdated or deleted objects can persist in unallocated regions of the file long after they have been removed from the active node. This means that old arrays containing data relevant to an investigation may still be recoverable.

Using the iOS Replika app as an example, if we walk all the arrays that make up the two nodes and identify where each array starts and ends physically, any bytes that are not occupied would be considered unallocated regions of the file. At offset 52616 is a 480 byte unoccupied region that contains 2 arrays (signified by AAAA).

Parsing the first array we get a Data array that contains a 225-byte blob


This happens to be a string table containing 9 values that appear to be an ID
  • 5fd0e3b1e5e78b00079b7b5e
  • 5fd0e3b1e5e78b00079b7b8b
  • 5fe363a1c32a7b000701fd84
  • 60128fb8e0704a00068aa367
  • 612ccb95e0704a00072b79f7
  • 61851af90c81a60007fa391a
  • 61bc94e00c81a6000779b2bb
  • 61bc94e00c81a6000779b3cf
  • 61bc94e50c81a6000779bfb0
Parsing the second array we get another Data array that also contains a 225-byte blob
This is another string table containing 9 values that appear to be an ID
  • 5fd0e3b1e5e78b00079b7cb1
  • 5fd0e3b1e5e78b00079b7cb4
  • 5fe363a1c32a7b000701fe11
  • 60128fc7e0704a00068aa416
  • 612ccb95e0704a00072b79f8
  • 6155ad710c81a60007334e15
  • 61e6bb7a7045d800066fbbc1
  • 61bc94e00c81a6000779b3e4
  • 61e6bb7a7045d800066fbbad
This region of the file is one that is often overlooked, either because the concept of unallocated space in a Realm database was not known or simply because our tools may not have this type of parsing ability.

Conclusion

When we bring all of this together, it becomes clear that Realm’s physical structure and copy-on-write architecture provide opportunities to look beyond the active data. Understanding these concepts allows us to not only parse the active node, which represents the data visible in Realm Studio, but also gain historical context from the inactive node and the unallocated regions of the file.

Stay tuned for the next installment, where we’ll begin the journey of decoding the arrays!









Comments

Popular posts from this blog

Inside Proton’s Vault: Uncovering Android Proton Drive Artifacts

The Duck Hunters Guide - Blog #6 - DuckDuckGo Fireproof Sites (Android)

The Realm Files - Vol 1 - Intro to RealmDB