Building a Distributed Audit Log with YugabyteDB
Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?
Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.
So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!
Introduction
In this article, we are going to see what is the best way you can build an Audit Log using YugabyteDB.
And what’s even more interesting is that this solution is almost the same on PostgreSQL. The only difference is how we are building the audit log table PRIMARY KEY, but everything else is exactly the same, demonstrating how easily you can migrate from PostgreSQL to YugabyteDB.
Why use an audit log
An audit log or audit trail is a data structure that captures all the modifications that occur in a given system.
For instance, the Red Log in a database system is very similar to an audit log because it captures the logical or physical modifications that occurred while executing a given transaction.
However, while you could use a tool like Debezium to parse the Redo Log, the solution is quite complex and doesn’t allow you to capture additional information about the context in which a given modification has occurred.
For this reason, we are going to build our audit log using database triggers.
How to build use an audit log using YugabyteDB
The best way to build a generic audit log using YugabyteDB is to define one audit_log
table that captures all changes happening across all our database tables, as illustrated by the following diagram:
In our case, we have two database tables, author
and book
, that we want to audit. So, for this purpose, we created the audit_log
table to capture any modification happing in any of those two tables.
The audit_log
table is created using the following DDL statement:
CREATE TABLE IF NOT EXISTS audit_log ( table_name varchar(255) NOT NULL, row_id bigint NOT NULL, old_row_data jsonb, new_row_data jsonb, dml_type dml_type NOT NULL, dml_timestamp timestamp NOT NULL, dml_created_by varchar(255) NOT NULL, trx_timestamp timestamp NOT NULL, PRIMARY KEY ((table_name, row_id) HASH, dml_type, dml_timestamp) )
The audit_log
table columns are going to store the following information:
table_name
– stores the name of the table for the record that has just been inserted, updated, or deleted.row_id
– stores the Primary Key value for the table record that has just been inserted, updated, or deleted.old_row_data
– stores the JSON representation of the table row state prior to applying the recorded DML operation.new_row_data
– stores the JSON representation of the table row state after applying the recorded DML operation.dml_type
– stores the associated DML operation type (e.g.,INSERT
,UPDATE
,DELETE
).dml_timestamp
– stores the timestamp of the DML statement that we have captured.trx_timestamp
– stores the transaction timestamp in which was enlisted the DML statement that we have captured.
The reason why the first column of the PRIMARY KEY is created as
(table_name, row_id) HASH
is that YugabyteDB auto-sharding is done by default on the first column of the PRIMARY KEY unless HASH/ASC/DESC was specified.By hashing the
table_name
androw_id
, we ensure that theaudit_log
records are going to be evenly distributed across all shards. Otherwise, we’d risk generating hotspots on the tablets associated with the tables that generate more audit log entries.On the other hand, on PostgreSQL, the PRIMARY KEY would be created like this:
PRIMARY KEY (table_name, row_id, dml_type, dml_timestamp)That’s basically the only difference between YugabyteDB and PostgreSQL when implementing this audit log solution.
In order to capture the modifications that occur on the author
and book
tables, we have to define the following database triggers:
CREATE TRIGGER author_audit_trigger AFTER INSERT OR UPDATE OR DELETE ON author FOR EACH ROW EXECUTE FUNCTION audit_log_trigger_function() CREATE TRIGGER book_audit_trigger AFTER INSERT OR UPDATE OR DELETE ON book FOR EACH ROW EXECUTE FUNCTION audit_log_trigger_function()
So, every time we execute an INSERT
, UPDATE
, or DELETE
statement on an author
or book
table row, the audit_log_trigger_function
database function is going to be executed.
The audit_log_trigger_function
is a custom database function that we created like this:
CREATE OR REPLACE FUNCTION audit_log_trigger_function() RETURNS trigger AS $body$ BEGIN if (TG_OP = 'INSERT') then INSERT INTO audit_log ( table_name, row_id, old_row_data, new_row_data, dml_type, dml_timestamp, dml_created_by, trx_timestamp ) VALUES( TG_TABLE_NAME, NEW.id, null, to_jsonb(NEW), 'INSERT', statement_timestamp(), current_setting('var.logged_user'), transaction_timestamp() ); RETURN NEW; elsif (TG_OP = 'UPDATE') then INSERT INTO audit_log ( table_name, row_id, old_row_data, new_row_data, dml_type, dml_timestamp, dml_created_by, trx_timestamp ) VALUES( TG_TABLE_NAME, NEW.id, to_jsonb(OLD), to_jsonb(NEW), 'UPDATE', statement_timestamp(), current_setting('var.logged_user'), transaction_timestamp() ); RETURN NEW; elsif (TG_OP = 'DELETE') then INSERT INTO audit_log ( table_name, row_id, old_row_data, new_row_data, dml_type, dml_timestamp, dml_created_by, trx_timestamp ) VALUES( TG_TABLE_NAME, OLD.id, to_jsonb(OLD), null, 'DELETE', statement_timestamp(), current_setting('var.logged_user'), transaction_timestamp() ); RETURN OLD; END IF; END; $body$ LANGUAGE plpgsql
No matter what DML operation was executed, we are always inserting a record in the audit_log
table. The difference is how we capture the old and the new record state because the newly inserted record does not have an old state and the deleted row has no new state.
The var.logged_user
is a session variable we are going to set in our application with the name of the currently logged application user. This can be done like this:
private void setCurrentLoggedUser(EntityManager entityManager) { Session session = entityManager.unwrap(Session.class); Dialect dialect = session.getSessionFactory() .unwrap(SessionFactoryImplementor.class) .getJdbcServices() .getDialect(); String loggedUser = ReflectionUtils.invokeMethod( dialect, "escapeLiteral", LoggedUser.get() ); session.doWork(connection -> { update( connection, String.format( "SET LOCAL var.logged_user = '%s'", loggedUser ) ); }); }
That’s it!
Testing Time
To test how the audit logging works, we are going to use JPA and Hibernate, as this is the most common Java data access technology.
First, let’s create the Author
entity that maps the author
database table:
@Entity(name = "Author") @Table(name = "author") @DynamicUpdate public class Author { @Id private Long id; @Column(name = "first_name") private String firstName; @Column(name = "last_name") private String lastName; private String country; @Column(name = "tax_treaty_claiming") private boolean taxTreatyClaiming; //Getters and setters omitted for brevity }
By default, Hibernate uses a pre-defined UPDATE statement that includes all the columns, even the ones that haven’t been modified. However, this default UPDATE is not suitable when using YugabyteDB as it will affect write scalability.
Unlike PostgreSQL, which takes a row-level lock whenever a record is changed, YugabyteDB uses a column-level locking approach, and for this reason, it’s very important that the UPDATE statement includes only the columns that were modified by the current Persistence Context, as opposed to all columns.
By providing the
@DynamicUpdate
annotation, Hibernate will generate the UPDATE statement dynamically, including only the columns that are associated with the entity properties that have been changed.For more details about this topic, check out this article.
And the book
table is mapped as follows:
@Entity(name = "Book") @Table(name = "book") @DynamicUpdate public class Book { @Id private Long id; private String title; @ManyToOne(fetch = FetchType.LAZY) private Author author; @Column(name = "price_in_cents") private int priceInCents; private String publisher; //Getters and setters omitted for brevity }
With the entity mappings in place, we can start adding the following Author
and Book
entities:
AtomicInteger auditLogCount = new AtomicInteger(); setCurrentLoggedUser(entityManager); Author author = new Author() .setId(1L) .setFirstName("Vlad") .setLastName("Mihalcea") .setCountry("România"); entityManager.persist(author); entityManager.persist( new Book() .setId(1L) .setTitle("High-Performance Java Persistence 1st edition") .setPublisher("Amazon") .setPriceInCents(3990) .setAuthor(author) );
And when checking the audit_log
table, we are going to see that two records have just been added:
| table_name | row_id | old_row_data | new_row_data | dml_type | dml_timestamp | dml_created_by | trx_timestamp | |------------|--------|--------------|------------------------------------------------------------------------------------------------------------------------------------|----------|----------------------------|----------------|----------------------------| | author | 1 | | {"id": 1, "country": "România", "last_name": "Mihalcea", "first_name": "Vlad", "tax_treaty_claiming": false} | INSERT | 2022-10-20 19:15:37.313888 | Vlad Mihalcea | 2022-10-20 19:15:37.252935 | | book | 1 | | {"id": 1, "title": "High-Performance Java Persistence 1st edition", "author_id": 1, "publisher": "Amazon", "price_in_cents": 3990} | INSERT | 2022-10-20 19:15:37.326810 | Vlad Mihalcea | 2022-10-20 19:15:37.252935 |
Notice that the old_row_data
is NULL because these audit log records are associated with INSERT statements.
When updating the two entities:
entityManager.find(Author.class, 1L) .setTaxTreatyClaiming(true); entityManager.find(Book.class, 1L) .setPriceInCents(4499);
We see that two new audit log records are going to be added:
| table_name | row_id | old_row_data | new_row_data | dml_type | dml_timestamp | dml_created_by | trx_timestamp | |------------|--------|------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|----------|----------------------------|----------------|----------------------------| | author | 1 | | {"id": 1, "country": "România", "last_name": "Mihalcea", "first_name": "Vlad", "tax_treaty_claiming": false} | INSERT | 2022-10-20 19:15:37.313888 | Vlad Mihalcea | 2022-10-20 19:15:37.252935 | | book | 1 | | {"id": 1, "title": "High-Performance Java Persistence 1st edition", "author_id": 1, "publisher": "Amazon", "price_in_cents": 3990} | INSERT | 2022-10-20 19:15:37.326810 | Vlad Mihalcea | 2022-10-20 19:15:37.252935 | | author | 1 | {"id": 1, "country": "România", "last_name": "Mihalcea", "first_name": "Vlad", "tax_treaty_claiming": false} | {"id": 1, "country": "România", "last_name": "Mihalcea", "first_name": "Vlad", "tax_treaty_claiming": true} | UPDATE | 2022-10-20 19:52:51.688286 | Vlad Mihalcea | 2022-10-20 19:52:51.656483 | | book | 1 | {"id": 1, "title": "High-Performance Java Persistence 1st edition", "author_id": 1, "publisher": "Amazon", "price_in_cents": 3990} | {"id": 1, "title": "High-Performance Java Persistence 1st edition", "author_id": 1, "publisher": "Amazon", "price_in_cents": 4499} | UPDATE | 2022-10-20 19:52:51.696792 | Vlad Mihalcea | 2022-10-20 19:52:51.656483 |
Notice that for UPDATE statements, the old_row_data
JSON column captures the previous table record state while the new_row_data
JSON column stores the state after the UPDATE was executed. This makes it easy to see how data has changed over time.
And, if we remove the Book
entity:
entityManager.remove( entityManager.getReference(Book.class, 1L) );
We can see that a new record will be added to the audit_log
table:
| table_name | row_id | old_row_data | new_row_data | dml_type | dml_timestamp | dml_created_by | trx_timestamp | |------------|--------|------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|----------|----------------------------|----------------|----------------------------| | author | 1 | | {"id": 1, "country": "România", "last_name": "Mihalcea", "first_name": "Vlad", "tax_treaty_claiming": false} | INSERT | 2022-10-20 19:15:37.313888 | Vlad Mihalcea | 2022-10-20 19:15:37.252935 | | book | 1 | | {"id": 1, "title": "High-Performance Java Persistence 1st edition", "author_id": 1, "publisher": "Amazon", "price_in_cents": 3990} | INSERT | 2022-10-20 19:15:37.326810 | Vlad Mihalcea | 2022-10-20 19:15:37.252935 | | author | 1 | {"id": 1, "country": "România", "last_name": "Mihalcea", "first_name": "Vlad", "tax_treaty_claiming": false} | {"id": 1, "country": "România", "last_name": "Mihalcea", "first_name": "Vlad", "tax_treaty_claiming": true} | UPDATE | 2022-10-20 19:52:51.688286 | Vlad Mihalcea | 2022-10-20 19:52:51.656483 | | book | 1 | {"id": 1, "title": "High-Performance Java Persistence 1st edition", "author_id": 1, "publisher": "Amazon", "price_in_cents": 3990} | {"id": 1, "title": "High-Performance Java Persistence 1st edition", "author_id": 1, "publisher": "Amazon", "price_in_cents": 4499} | UPDATE | 2022-10-20 19:52:51.696792 | Vlad Mihalcea | 2022-10-20 19:52:51.656483 | | book | 1 | {"id": 1, "title": "High-Performance Java Persistence 1st edition", "author_id": 1, "publisher": "Amazon", "price_in_cents": 4499} | | DELETE | 2022-10-20 19:56:35.847542 | Vlad Mihalcea | 2022-10-20 19:56:35.833682 |
For DELETE
statements, the audit log record contains only the old_row_data
column, which captures the record state prior to being deleted.
And, just because the table row data is stored in JSON columns, it doesn’t mean we cannot transform it back to its relational table structure.
For instance, when executing the following SQL query:
SELECT row_id AS id, cast(new_row_data ->> 'price_in_cents' AS int) AS price_in_cents, new_row_data ->> 'publisher' AS publisher, new_row_data ->> 'title' AS title, new_row_data ->> 'author_id' AS author_id, dml_timestamp as version_timestamp FROM audit_log WHERE table_name = 'book' ORDER BY dml_timestamp
We will get all the book
table changes:
| id | price_in_cents | publisher | title | author_id | version_timestamp | |----|----------------|-----------|-----------------------------------------------|-----------|----------------------------| | 1 | 3990 | Amazon | High-Performance Java Persistence 1st edition | 1 | 2022-10-20 19:15:37.326810 | | 1 | 4499 | Amazon | High-Performance Java Persistence 1st edition | 1 | 2022-10-20 19:52:51.696792 | | 1 | | | | | 2022-10-20 19:56:35.847542 |
Awesome, right?
If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.
Conclusion
As you can see, adding an audit log for YugabyteDB is just as easy as doing it in PostgreSQL. Not only that it supports JSONB columns, but database triggers and functions work exactly in the same way.
Using this audit log feature will allow you to track modifications across all database tables that need to be audited.
This research was funded by Yugabyte and conducted in accordance with the blog ethics policy.
While the article was written independently and reflects entirely my opinions and conclusions, the amount of work involved in making this article happen was compensated by Yugabyte.
