Database concurrency conflicts in
the real world.
- Maurice de Beijer -
A lot of articles have been written about database
concurrency conflict detection and the various ways of handling them.
Unfortunately most of these articles, and accompanying solutions, have one
major flaw in focusing on the technical issues and database implementation
instead of real world data and how this is used. In this article I will try to
show the difference between focusing on the database implementation and on the
real world data. I will show some possible approaches on how to solve these concurrency
issues.
What is a database concurrency conflict
Let’s start with a quick recap of what database concurrency conflicts
are and why they need to be solved in the first place.
Most database applications in this world are multi-user
applications. This means that, at any given point in time, you can expect
multiple persons and/or processes reading from and writing to a database. Given
that multiple persons/processes are updating data in a database it is only a
matter of time before two separate persons/processes are trying to update the
same piece of data. Because a typical update cycle consists of read into memory,
update in memory and write back to the database we will run into occasions
where user 1 and 2 will both read the data into memory, user 1 will update the
data and write those changes back to the database before user 2 does the same.
Now we have a concurrency control conflict because user 1 read the data before
user 2 wrote it back to the database. Why is this? Well simply because if user
1 writes his data back into the database he will overwrite the changes made by
user 2 causing them to be lost.
This kind of database concurrency issue can occur both with
humans or automated processes or a combination of the two. A concurrency issue is
more likely to occur with human users as the read/update/write cycle is likely
to take much longer. However that said the same can occur between automated processes
and is harder to solve there because in the case of a human you can ask what
the user wants and respond to that while a process needs to have all actions
fully automated.
The current state
Let’s first take a look at what is generally said and done about
solving database concurrency issues. Typically solving the problem is divided
into two basic approaches:
- Pessimistic concurrency control.
- Optimistic concurrency control.
Below I will give a brief description of the two and the
different options in handling them. This is just intended to clarify the
problem so is in no way a complete coverage of concurrency handling and all of
its aspects.
Pessimistic concurrency control
In the case of pessimistic concurrency control a collision
is prevented because the user/process must take some action before being
allowed to change the data. This action can be a number of things but typically
involves locking the data in the database thereby preventing another user from
holding the same lock.
Advantages:
- Simple to implement.
Because database servers support and enforce locking mechanisms it is very
easy to implement. Because a user needs to place the lock before making
any changes he or she is informed before a change is made that there is a
conflict.
- Very secure.
Because the locking is implemented by the database server it is very
reliable and you as a developer can be assured that nothing will be able
to ignore it and change the data anyway.
Disadvantages:
- Not very scalable.
Locking data in a database require an open connection to a database. This
means that every user must have at least one open connection to the
database which means more resources and licenses. When using older
database servers locking might also prevent other users from reading the
data.
- Prone to deadlocks.
Suppose two users both want to change two pieces of data A and B. User 1
first locks A and user 2 first locks B. Now both want to lock the second
part of the data but cannot because it is already locked by another user.
Now we have a problem if both decide to wait until the data is available.
- Locks can last a long time.
If a user starts changing some data, he or she has a lock on the data
until it is saved. If the user is distracted somehow or goes to a meeting
without saving his changes the data remains locked and no one else can
make any changes until the first changes have been committed.
Pessimistic locking can also be done using soft locks
instead of real database locks. This means that a field is updated to indicate
that a user is busy with the data and has it “locked”. This prevents the
scalability problem as the connection doesn’t need to be kept open. However
this approach has several drawbacks because it isn’t enforced by the database
and another piece of code can decide to ignore the lock. Additionally the lock
must be manually removed and if this isn’t done the data remains locked for
ever.
Even though pessimistic locking has it’s place, it has a
number of drawbacks and doesn’t combine very well with the disconnected nature
of data in a .NET application.
Optimistic concurrency control
When using optimistic concurrency control the user doesn’t
lock the data to prevent concurrency issues but detects and resolves them when
writing to the database. Typically several different approaches are used with
optimistic concurrency control:
- No checking and last one in wins.
Not really a concurrency control as the problem is ignored and the last
user to update the data overwrites the changes by made by the first user. The
SQL update command has only the primary key as the filter in the where
clause. This kind of concurrency control is only suitable for single user
applications.
- Comparing the fields the user changed.
In this case the update is done with a check if any of the data the user has
changed is still the same as it was when the user read the data. The SQL
update command has the primary key and changed fields with their old
values and the where clause.
- Comparing all fields.
In this case the update is done with a check on all fields instead of just
the ones being changed. Even though this seems like excessive it is actually
better that just checking the changed fields because of the nature of the
standard ADO.NET classes. The problem with the ADO.NET data
classes is that they will execute an update command with all fields, not
just the ones that were changed. This means that even though no one else
has updated the fields you are changing someone might have changed
another column. In this case the SQL update command has a where clause
containing all fields in the table.
- Comparing the row version.
In this case the data has an extra row version, also known as a timestamp
field which is changed very time the data is updated. In this case
filtering is much simpler and faster than when comparing all fields as we
only need to filter on the primary key and the row version in the SQL
update. However care should be taken because these two do not result in
the same behavior. If the first user sends an update that updates all data
with its old version the data’s row version will still be updated. This
means that another user, who has made a change, will have a conflict when
checking the row version while he or she will not have a conflict when
comparing all fields.
Back to the real world
So all of the above has been described in much more detail
in numerous articles but these articles have one common flaw. This flaw is that
they view and discuss the whole problem from a database perspective instead of
a user’s perspective. What do I mean with that you might be wondering? Well
every solution always focuses on the concurrency conflict and detection of a single
whole row in a table. Often an update will span multiple rows, tables and sometimes
even databases are involved. However when it comes to the way users see the data
it often doesn’t map to the rows being updated.
The following examples are taken from an accounting package
but the same applies to most, if not all, larger applications out there.
How does a user see data:
- Simple reference data.
This kind of data usually maps to a single row. An example would be a list
of VAT codes with the appropriate data.
- Complex data.
This typically consists of data spanning more than one row or table.
Examples might be an invoice or a purchase order.
- Regular application data.
This kind of data is often stored in a single row in the database but the
user actually considers it as a set of related but semi distinct sets of
data with different owners. An example of this would be article data.
While all data about an article might be stored in a single row the user
might divide the data based on who “owns” that part of the data.
- Basic article data might consist of an SKU and a
description used by everyone.
- Sales information. The row might include a sales price,
VAT information used when invoicing and the number of units sold but not
shipped. All this data is owned and updated by the sales department and
read-only for others.
- Purchasing information. The row might include a preferred
supplier, the maximum purchase price and the number on order but not yet
received. All this data is owned and maintained by the purchasing
department.
- Inventory. The row might include the number of items on
stock. This is owned by the warehousing department who will also update
the number of items to be shipped of received when an order is shipped or
received.
Simple reference data
The standard way of handling data concurrency maps very well
to the case of the simple reference data because a set of data the user sees
maps exactly to a row in the database. This means that this base is well
covered and needs very little extra attention. Unfortunately this is also the
kind of data that doesn’t change very often so the chance of a concurrency
issue occurring is very low.
Complex data
The case of the complex data is not handled very well at all.
Consider the case where two users both open the same order, each change a
different order line and save the order. Technically speaking there is no
conflict as each change a different row in the database. However a user thinks
about a single order and not about different rows in a database, so as far as
he or she is concerned this is a multi user conflict with two people changing
the same piece of data. To some extend this problem can be solved procedurally
in an organization and some people might justly argue that the chance of the
above scenario happening is very small. While that may or may not be true sooner
or later the order will be transferred from the sales department to another
department, maybe for approval or shipping. At this point two very different
parts of the organization are involved, one updating the detail lines and
another taking ownership of the order by updating the header record. Again
technically speaking no conflict as two different tables are updated by
different users but in the real world this is a very real multi user conflict.
To solve this we need to consider the separate rows in
the database as one whole instead of individual items. One possible way of
doing this is changing the way we store an order in the database and doing so
in a single row. It is possible to do this using an XML field in SQL Server 2005
and storing all details as XML elements. While possible, and solving the
concurrency issue, this approach also has a number of drawbacks, most notably
the lack of relational integrity checks between the order details in the XML
and the rest of the database which would need to be done in code and cannot be
done by the database itself.
Another approach for solving the complex data concurrency
issues would be to check a row version of the order header row when updating
the detail rows. This would basically work as follows:
- Start a database transaction.
- Check the row version of the order header row.
- Update the order header row. This must be done, even if
there is no actual change to the order header, to update the row version.
- Update, delete or insert all order detail rows.
- Commit the transaction.
The problem here is that the database cannot really help us
to enforce this is done and there isn’t another process that just updates an
order detail row without updating the order header row version as well. To
solve this problem the order header row version can be updated using a trigger
defined on the order detail table. This way you can always be sure that the header
row version is updated but it does mean that the in memory order header row
must be refreshed if any of the detail rows have been updated. Again this is
not an ideal situation but at the very least very consistent and testable. You
can test this by changing the same order twice, the second time directly after
the first change was saved. If the second change can be saved than the row
version in memory is the same as the row version in the database.
Regular application data
This case isn’t handled very well either. The main problem is
that both the database and the .NET class for working with data are based
around a complete row. Both the tools inside Visual Studio and classes like the SqlCommandBuilder are based around a SQL select command and will generate an
SQL update command for every field in the SQL select. This is fine for simple
reference data as a user typically updates the whole row but in this case that
doesn’t make much sense.
Let’s take a look at a simple example to see the
shortcomings. We store debtor information in a database. This information
contains the address, the current balance and the credit limit for a debtor. The
balance is automatically updated every time a debtor pays a bill or places a
new order and cannot be updated directly. One person in the customer relation
department corrects the address of the client while another, in the finance
department updates the debtor’s credit limit. Technically this is a concurrency
conflict as both users are trying to update the same row at the same time.
However the customer relation department is owner of the address data while the
finance department is owner of the credit limit so procedurally there is no
conflict. Trying to solve the issue by giving each user only a subset of the
data isn’t going to work as the finance department needs to know where a debtor
lives in order to make an assessment of their credit worthiness. They are not
going to update the address so read only access is good enough. The same is
true the other way round, the customer relation department needs to be aware of
the debtor’s credit limit but is never going to update it.
Even if both changes where made by people that have permission
to update all debtor information it can hardly be called a concurrency
conflict. Clearly a SQL update statement that simply includes all fields from a
SQL select isn’t going to cut it here. One way of solving the problem is by
updating each field as a separate entity. While this would avoid and solve the
problem described above it is an oversimplification of the problem as certain
fields are logically grouped together like the complete address. What we need
instead is the concept of read only fields and to group the remaining fields
into separate SQL updates. To do this we need a CommandBuilder like class that
takes a data definition as input and produces a collection of SQL update
command objects to update the data. In the update command collection there
would be one command for the address and a second for the credit info,
presuming the current user has the rights to update both. The code then goes
through the collection and populates each command with parameters containing
old and new values for each field being updated. If all the old and new values
for a specific command match there is clearly no update in that part of the
data and the command doesn’t need to execute. Once all update commands have
been loaded they can be executed. In this case we do not want to wrap these as
part of a single transaction. If all update commands execute successfully there
was no conflict and the user can go on with their job. If a conflict occurs it
is limited to a specific subset of the data and the other parts of the data can
be updated without any delay. Because of the group wise nature of the updates
using a SQL row version is impossible. Instead we need to include the complete
list of fields being updated together with the primary key in the where clause
of each command.
An example
The following code is a simple example of how an update
could work. In this example I use the column ExtendedProperties to store the
field grouping information. Columns with the same group name are bundled
together into a single update command.
''' <summary>
''' Test application to check database updates.
''' </summary>
''' <remarks></remarks>
Sub Main()
Dim dta As New pubsDataSetTableAdapters.titlesTableAdapter
Dim table As
pubsDataSet.titlesDataTable
Dim cb As New CommandBuilder
' Load the data
table = dta.GetData()
' Configure the table
cb.ConfigureDataTable(table)
' Make a change to the data
table.Item(0).price *= 1.1
'table.Item(0).title = "New title"
' Update the database
cb.UpdateTable(table)
Console.WriteLine( _
"Press any key to terminate the
application.")
Console.ReadKey()
End Sub
Listing 1: The Main() function.
The main program uses a typed table adapter to load the
titles table from the Pubs database.
''' <summary>
''' Configure the columns into update groups.
''' </summary>
''' <param
name="table">The table with
columns.</param>
''' <remarks></remarks>
Public Sub
ConfigureDataTable( _
ByVal table As
pubsDataSet.titlesDataTable)
' Basic data about the book
table.title_idColumn.ExtendedProperties("UpdateGroup")
= "Book"
table.titleColumn.ExtendedProperties("UpdateGroup")
= "Book"
table.typeColumn.ExtendedProperties("UpdateGroup")
= "Book"
table.notesColumn.ExtendedProperties("UpdateGroup")
= "Book"
table.pubdateColumn.ExtendedProperties("UpdateGroup")
= "Book"
' Finacial data about the book
table.pub_idColumn.ExtendedProperties("UpdateGroup")
_
= "Financial"
table.priceColumn.ExtendedProperties("UpdateGroup")
_
= "Financial"
table.advanceColumn.ExtendedProperties("UpdateGroup")
_
= "Financial"
table.royaltyColumn.ExtendedProperties("UpdateGroup")
_
= "Financial"
' Sales information about the book
table.ytd_salesColumn.ExtendedProperties("UpdateGroup")
_
= "Sales"
End Sub
Listing 2: The ConfigureDataTable() function.
In the ConfigureDataTable() function each column receives an
UpdateGroup property. This is used during the update to determine which fields
are grouped together. This information isn’t determined at update time because
the same table might be used for multiple purposes in different business object
with different field groupings. Additionally some fields might be read-only,
something not taken into account in this sample code.
''' <summary>
''' Sends all updates to the database
''' </summary>
''' <param
name="table">The table with
changes,</param>
''' <returns></returns>
''' <remarks>Just demo code.
''' Cannot execute as there is no connection and
Insert/Delete
''' is not implemented.
''' </remarks>
Public Function
UpdateTable(ByVal table As DataTable) As Boolean
Dim updateCommands As
List(Of SqlCommand)
' Get a list of update commands to execute
updateCommands = GetUpdateCommands(table)
For Each row As DataRow In _
table.GetChanges(DataRowState.Modified).Rows
Select Case
row.RowState
Case DataRowState.Added
' New row, do an database insert
Case DataRowState.Deleted
' Deleted row, do a database delete
Case DataRowState.Modified
' Changed row, do the required database updates
For Each cmd As SqlCommand In
updateCommands
Dim hasChanges As
Boolean = False
For Each param As SqlParameter In _
cmd.Parameters
' Populate all parameters
Dim fieldName As
String
fieldName = _
param.ParameterName.Substring(3)
If param.ParameterName.StartsWith("old") _
Then
param.Value = row(fieldName, _
DataRowVersion.Original)
Else
param.Value = row(fieldName, _
DataRowVersion.Current)
End If
' Check if this field is changed
hasChanges = hasChanges _
OrElse Not
row(fieldName,
DataRowVersion.Original).Equals( _
row(fieldName, DataRowVersion.Current))
Next
If hasChanges Then
Console.ForegroundColor = _
ConsoleColor.Yellow
Console.WriteLine("Executing
command:")
'cmd.ExecuteScalar()
Else
Console.ForegroundColor = ConsoleColor.Red
Console.WriteLine("Skiping command:")
End If
Console.WriteLine(cmd.CommandText)
Console.WriteLine()
Console.ResetColor()
Next
End Select
Next
End Function
Listing 3: The UpdateTable() function.
The UpdateTable() function first retrieves a collection of
update commands. Note that for this example I have skipped the insert and
delete commands as these are handled in the same way as normal.
''' <summary>
''' Build a collection of update commands for the table.
''' </summary>
''' <param
name="table">The table that needs
to be updated.</param>
''' <returns>A collection of SQLCommands for the update.</returns>
''' <remarks></remarks>
Private Function
GetUpdateCommands(ByVal table As DataTable) _
As List(Of
SqlClient.SqlCommand)
Dim groups As
IDictionary(Of String,
List(Of DataColumn))
Dim cmds As List(Of SqlClient.SqlCommand)
cmds = New List(Of
SqlClient.SqlCommand)
Console.WriteLine("Building update
commands.")
Console.WriteLine()
' Split all columns into groups based upon the
' UpdateGroup extended property.
groups = SplitColumnIntoGroups(table)
For Each group As List(Of
DataColumn) In groups.Values
Dim cmd As
SqlCommand
cmd = CreateUpdateCommand(table, group)
cmds.Add(cmd)
Console.WriteLine("Update command {0}:",
cmds.Count)
Console.WriteLine(cmd.CommandText)
Console.WriteLine()
Next
Return cmds
End Function
Listing 4: The GetUpdateCommands() function.
The GetUpdateCommands() function loops through all the field
groups and created a different SQL update command for each one. All separate
commands are bundled together into a collection and returned.
''' <summary>
''' Split all columns into groups based upon the
UpdateGroup
''' extended property.
''' </summary>
''' <param
name="table">The table with
columns to split.</param>
''' <returns>A dictionary with the groups of columns.</returns>
''' <remarks></remarks>
Private Function
SplitColumnIntoGroups(ByVal table As DataTable) _
As Dictionary(Of String, List(Of DataColumn))
Dim groups As New Dictionary(Of String, List(Of
DataColumn))
For Each col As Data.DataColumn In
table.Columns
Dim updateGroup As
String
If col.ExtendedProperties.Contains("UpdateGroup") Then
updateGroup = _
col.ExtendedProperties("UpdateGroup").ToString()
Else
updateGroup = ""
End If
If Not
groups.ContainsKey(updateGroup) Then
groups.Add(updateGroup, New List(Of DataColumn))
End If
groups(updateGroup).Add(col)
Next
Return groups
End Function
Listing 5: The SplitColumnIntoGroups() function.
The SplitColumnIntoGroups() function takes all the columns
in a table and splits them into separate update groups. This would be a good
point to exclude read-only columns and possibly the primary key columns as
these are typically not updatable.
''' <summary>
''' Create a SqlCommand to update the field group.
''' </summary>
''' <param
name="table">The table being
updated.</param>
''' <param
name="group">The field group.</param>
''' <returns>The SqlCommand to update the table.</returns>
''' <remarks></remarks>
Private Function
CreateUpdateCommand(ByVal table As DataTable, _
ByVal group As
IEnumerable(Of DataColumn)) As SqlCommand
' Build an update command for the group of columns
Dim cmd As New Data.SqlClient.SqlCommand
Dim sqlSet As New System.Text.StringBuilder()
Dim sqlWhere As New System.Text.StringBuilder()
For Each col As DataColumn In
table.PrimaryKey
If sqlWhere.Length > 0 Then
sqlWhere.Append(" and ")
End If
sqlWhere.Append("([")
sqlWhere.Append(col.ColumnName)
sqlWhere.Append("] = @org")
sqlWhere.Append(col.ColumnName)
sqlWhere.Append(" or [")
sqlWhere.Append(col.ColumnName)
sqlWhere.Append("] = @new")
sqlWhere.Append(col.ColumnName)
sqlWhere.Append(")")
cmd.Parameters.AddWithValue("old" +
_
col.ColumnName, col.DataType)
cmd.Parameters.AddWithValue("new" +
_
col.ColumnName, col.DataType)
Next
For Each col As DataColumn In
group
If sqlSet.Length > 0 Then
sqlSet.Append(", ")
End If
sqlSet.Append("[")
sqlSet.Append(col.ColumnName)
sqlSet.Append("] = @new")
sqlSet.Append(col.ColumnName)
If sqlWhere.Length > 0 Then
sqlWhere.Append(" and ")
End If
sqlWhere.Append("([")
sqlWhere.Append(col.ColumnName)
sqlWhere.Append("] = @org")
sqlWhere.Append(col.ColumnName)
sqlWhere.Append(" or [")
sqlWhere.Append(col.ColumnName)
sqlWhere.Append("] = @new")
sqlWhere.Append(col.ColumnName)
sqlWhere.Append(")")
If Not
cmd.Parameters.Contains("old" +
col.ColumnName) Then
cmd.Parameters.AddWithValue("old" +
col.ColumnName, _
col.DataType)
End If
If Not
cmd.Parameters.Contains("new" +
col.ColumnName) Then
cmd.Parameters.AddWithValue("new" +
col.ColumnName, _
col.DataType)
End If
Dim commandText As
String
commandText = "Update [{0}] Set {1} Where
({2})"
cmd.CommandText = String.Format(commandText, _
table.TableName, sqlSet.ToString(), sqlWhere.ToString())
Next
Return cmd
End Function
Listing 6: The CreateUpdateCommand() function.
The CreateUpdateCommand() function creates a single SQL
update command for a group of fields. The SQL Where clause consists of the
primary key of the table as well as the fields being updated. Each field is
compared both against the old value as well as its new value. This is done in
order to avoid seeing the same change by two users as a conflict where it
really doesn’t matter that a field is overwritten with the same value.
Conclusion
While the technique described above isn’t a complete
solution for all update concurrency issues I believe it is a step in the right
direction. It is a work in progress and I feel that the best way to handle the
separate cases is still to be discovered. In the mean time the solution above
is a great help in creating applications that are friendlier and less
technology oriented in their nature.
Maurice de Beijer is an independent software
developer, beta tester and a recipient of the MVP award. He specializes in
.NET, object orientation, Visual FoxPro and solving technically challenging
problems. Maurice is The Problem Solver and can be reached at mauricedb@computer.org or at www.TheProblemSolver.nl.