Archiving data in DB

Two common possibilities:

  • Storing the archive data in the same table as the current data using a compound primary key (entity ID and version ID).
    • You waste the power of compound primary keys (for example, for index organized tables) because you are forced to always include the version ID as the second key.
    • The approach will result in very large tables of which only a fraction of the rows represent current data.
    • Joins are more complicated and it’s easy to make mistakes, if you forget to filter out archive data.
    • Archive data usually doesn’t need as much indexing as production data, but storing the data in the same table applies the index to archive data as well. Your indexes will grow large and will contain a lot of similar data from the archive, thus additionally slowing down performance.
    • While it might be possible to let the DBMS automatically create archive rows using ON UPDATE and ON DELETE triggers, it’s probably complicated (I’ve never seen that anywhere). Otherwise you are forced to implement the archive operation in your application code, thus slowing down performance once more, making it impossible to directly modify the database or to use more than one client application (unless you invest a lot of time and money to keep them in sync).
  • Using a dedicated archive table for each production table and create a copy of the production data before each update/delete (easy to implement with ON UPDATE and ON DELETE triggers). The archive table has the same structure as the production table plus a column for the archive id (auto-increment).
    • This option enables you to completely implement historization in your DBMS using triggers. You can enable/disable historization for single tables at any time without modifying your client applications.
    • You can use as many client applications as you like or even manually modify rows without losing the archive functionality.
    • You can use different indexes for archive and production data, depending on the actual needs of your applications.
    • If the schema changes, you can also update your archive table using standard SQL statements (the same that work for the production table). This is a minor drawback, because you must not forget to do so. It’s still better than having a huge archive of data that can not be easily compared.
    • The read performance of your production tables is not reduced. Updates and deletes might take a bit longer, but still much faster than “manually” creating archive copies. It should be comparable to the performance of the built-in historization feature of your favorite database.
Advertisements

%d bloggers like this: