Designing for Concurrency

by Kevin Garwood

Designing for concurrency helps ensure that when multiple threads of execution try to access the same resource at the same time, that any contentions are resolved so that the behaviour of the code remains predictable. Without this guarantee, a test case that passes in one run may fail in another run. As well, unexpected errors may happen in scenarios that cannot be easily replicated.

In single-threaded programs, it is easy to trace a flow of execution. We can rely on a predictable sequence of steps that helps make a piece of code behave predictably. In multi-threaded programs, Thread A may execute a few steps before being stopped to allow another Thread B to execute another sequence of steps. The order in which the threads execute their code may become unpredictable. If Thread A and Thread B are both trying to access the same object X, the sequence of steps that are executed could leave X in an undetermined state.

In single-threaded programs, we can often rely on a body of test cases to help us identify problems in code we've written. If you change a piece of code and test cases suddenly begin to fail, then you can make the assumption that each time you run the same group of test cases the same failures will occur. The consistent outcome of running the test case suites can help narrow the sources of problems until they can be identified and fixed.

In testing multi-threaded programs, the scenarios that lead to a contention amongst multiple threads may not occur every time the same test suites are run. Some scenarios may ever happen unless the software is tested with specific data sets and there are hundreds or thousands of execution threads running at the same time.

Fixing multi-threading errors can be very difficult because detecting them in the first place can be very difficult. Furthermore, in a project having few developer resources, the time needed to identify and fix concurrency bugs could be longer than it would take in projects that had more people. The difficulty in fixing multi-threading errors provides great motivation to invest in a strategy that prevents them from happening in the first place. We break our approach down into three parts. First, we take a high-level view to examine how the different tools might be used together to encourage concurrent access problems to occur. The goal of the analysis is to identify ways that the tools are used and created could help reduce the likelihood of certain patterns of use from happening. Second, we look at the code base in general and develop ways to minimise the likelihood that executing the code will cause concurrency problems. Third, we identify specific resources that may cause contention amongst threads and try to protect them.

Identifying scenarios of resource contention amongst tools in the RIF tool suite

For this analysis, we identify three types of concurrent access scenarios which might result in concurrency problems. These are:

Two tools which are reading data at the same time shouldn't produce concurrency problems providing that the data are not being updated while they do it. Therefore, we exclude this scenario from further consideration. Simultaneous writes by different tools may leave a database record in an undetermined state. If one tool is updating a database that is being read by another tool, there may be a possibility of having a stale read problem.

For each of simultaneous writes and simultaneous read and write, we examine different pairings of tools.

Concurrency-1 : Identify scenarios for concurrent access that may occur in interactions between various pairings of tools in the tool suite. Systematically evaluate potential scenarios for concurrent writes and stale reads situations.

Initial Identification of potential stale read scenarios

In the table below, you should read each cell as:
Tool [row number] is reading data that is being produced by tool [column number]
Study Submission Tool Study Result Viewer Tool IT Governance Tool Data Loader Tool
Study Submission Tool Y N N Y
Study Result Viewer Tool N Y Y Y
IT Governance Tool Y N Y Y
Data Loader Tool N N N Y

A green "N" means there is no potential for a stale read situation because the tools don't interact with each other in that way. Areas showing a red "Y" show that there is potential for one tool to read data that are being produced by another. For example, the Study Submission Tool does not read data written by the Study Result Viewer and therefore that relationship is labeled "N". The Study Submission Tool could be reading data that are being updated by the Data Loader Tool; therefore, that area of the table is assigned a "Y".

Initial Identification of potential concurrent write scenarios

In the table below, you should read each cell as:
Tool [row number] and tool [row number] are trying to write to the same record at the same time
Study Submission Tool Study Result Viewer Tool IT Governance Tool Data Loader Tool
Study Submission Tool Y N N N
Study Result Viewer Tool N Y N N
IT Governance Tool N N Y N
Data Loader Tool N N N Y

We can help eliminate some problem areas by prescribing how the tool suite should be used. For example, it is a common practice to suggest shutting down end-user tools when the database on which they depend is being significantly updated. We anticipate that data loading activities will happen infrequently throughout a year. But when they are happening, they may last for a few days.

For some software projects, it is extremely important to develop high-availability systems where there is little down-time for the system. Developers may create sophisticated ways of allowing users to continue using data when the data sets are being altered. However, in our project, we do not expect that our users would be significantly inconvenienced if they could not access the system for a few periods of a few days.

Concurrency-2 : When the Data Loader Tool is loading data sets, the other tools should not be in use. When end user tools are being used, the Data Loader Tool should not be in use.

By prescribing restrictions for how the tools should be used in practice, we are reducing the likelihood that certain kinds of stale read and concurrent write problems will occur.

Another way to reduce concurrency problem areas is to ensure that a new study has been completely created before it can appear visible to parts of either the Study Submission Tool or the Study Result Viewer Tool. The Study Submission Tool will allow users to create new studies by copy existing ones. If a study has been partially created, there may be inconsistencies between the properties of a copied study and the properties of the original, which may be in the process of changing. As well, it may be a problem if the Study Result Viewer can try to obtain results for studies which may have only been partially created.

Concurrency-3 : The Study Submission Tool should submit a new study as part of a single atomic database transaction. Once a study has been created it becomes immutable. However, it can be copied to make a new study. Until a study description has been completely created in the database, neither the Study Submission Tool nor the Study Viewer Tools should be able to access it.

We can try to limit the number of instances of any one tool which are operating. For example, although we anticipate that many people may be using the Study Submission Tool and Study Result Viewer Tool, it seems clear that there would only need to be one instance of the Data Loader Tool and the IT Governance Tool running at once.

Concurrency-4 : The tool suite should allow only one instance of the Data Loader Tool and one instance of the IT Governance Tool to run at the same time.

By applying these decisions, our matrices of tool interactions can now show fewer potential problem areas: In the table below, you should read each cell as:

Tool [row number] is reading data that is being produced by tool [column number]
Study Submission Tool Study Result Viewer Tool IT Governance Tool Data Loader Tool
Study Submission Tool N Y N N N Y
Study Result Viewer Tool N Y Y N Y
IT Governance Tool Y N N Y N Y
Data Loader Tool N N N N Y
Tool [row number] and tool [row number] are trying to write to the same record at the same time
Study Submission Tool Study Result Viewer Tool IT Governance Tool Data Loader Tool
Study Submission Tool N Y N N N
Study Result Viewer Tool N Y N N
IT Governance Tool N N N Y N
Data Loader Tool N N N N Y

We can also limit some concurrency concerns if we recognise that some tools will be web applications and others will be desktop applications. Both the Study Submission Tool and the Study Result Viewer will be produced as web applications. The web services we develop cannot make assumptions about the order in which service methods are called. The web application calls to web services are made asynchronously, and the URLs that are submitted to web services could be easily submitted in a different order by a malicious code script.

In contrast, the Data Loader Tool and the IT Governance Tool will be Swing-based desktop applications. Whereas the web server cannot make assumptions about what is submitting URL requests, we can guarantee that end users of the RIF Manager tools will be forced to use the Swing-based electronic forms. We can rely on the Swing-based applications to use guided data entry features and modal dialogs to limit the order in which service methods are called. Furthermore, these tools can support many of their features without having to make asynchronous calls.

Concurrency-5 : In the Swing-based Data Loader Tool and Information Governance Tool, rely on guided data-entry features and modal dialogs in the GUI to help limit the likelihood of concurrency problems. For the web-based Study Submission and Study Result Viewer tools, assume they can may call any service method in any order.

Minimising Concurrency Problems in the Code Base

In this section, we forget about concurrent access activities at the tool level and focus on concurrent access activities at the level of the code base. Our goal here is to produce service methods which are thread-safe: they can exhibit predictable behaviour in a multi-threaded environment, regardless of the order in which threads are run.

First we should recall one of the aspects of our coding philosophy that governs the whole body of middleware. Preventing concurrency problems from happening depends heavily on adopting best practice coding conventions which will be described later in this section. The coding techniques must be adopted uniformly across the code base so that we can eliminate certain scenarios from consideration as we identify the sources of potential concurrency problems.

For seasoned developers, applying a coding technique may be required in one situation but may not be necessary to use in another similar situation. Experience can help inform when to use a technique or not. However, in the interest in maintaining a thread-safe code base by developers with little skill, it is best that by default, a coding technique is applied in all similar cases for a given kind of scenario. Using a number of conventions, a novice developer can help make a method thread-safe without knowing why it works. Although this benefit carries some risk, for small-budget projects the team may have little choice.

General Design-8 : Apply the steps of the coding philosophy:
  1. get it working
  2. get it working for the next developer
  3. get it working well

As well, favour coding to convention over coding to circumstance.

The next step in our approach is to review another one of our general decisions, which is to minimise the number of execution paths in the code and to force front-end applications to interact with the database via the service methods provided by the middleware. Taking these steps does not fix concurrency problems but helps limit what pathways through the code are likely to have potential contentions amongst many threads.

General Design-4 : Wherever possible, limit the paths of execution that are likely to occur.

Here we can summarise three techniques which are discusses in other areas of the design manual:

As another measure to help minimise the code execution scenarios which could go wrong, we should also decide on how we support or restrict sub-classing. We have limited our commitment to software reuse by indicating that the services may be reused by other client applications. It must first be proven that the RIF code base works to support its current goals before we can give priority for making it reusable in use cases that were not initially envisioned for our project. We don't predict there will be any urgent need for other developers to want to sub-class any of the Java classes in the middleware. It is likely that only the people on the RIF team will need to subclass the class files and they will do this only after they can justify extending them. Therefore, we again present here one of the general design decisions for limiting subclasses.

General Design-9 : If a class is not abstract, it should be marked "final".

Once we have minimised the number of likely execution paths through the code base, we can focus on building a collection of test suites that exercise the service methods and business classes that are used by client applications. One important way of designing for concurrency is to ensure that the test cases can pass in a single-threaded environment first. Maintaining testability in a single-threaded context allows us to eliminate other causes of error before we conclude that a failed test case is caused by contention amongst threads. If versions of the test cases pass in a single-threaded environment but fail when used in a multi-threaded environment, then we may be more confident that a concurrency problem may account for the difference.

Concurrency-6 : Maintain test suites that will pass in a single-threaded environment. Then adapt copies of those test suites for a multi-threaded environment. Use the differences in pass results to identify potential concurrency problems.

We need to decide to what extent we should trust the database to manage concurrent access scenarios. Microsoft SQL Server, PostgreSQL and Oracle will each have long established mechanisms to handle scenarios where database queries are submitted simultaneously. However, it is unclear whether they would handle these scenarios in the same way.

If the RIF supported only one type of database, then we could choose a database driver that was considered thread-safe. We could then rely on the database to handle problems with two threads trying to concurrently access the same database connection object. However, we need to support multiple database vendors and cannot assume that the database driver will be considered thread-safe. Therefore, we must assume that the middleware is responsible for ensuring that there are no concurrent access problems for its shared resources.

Concurrency-7 : Assume that various database vendors may not support concurrent access to records in exactly the same way. Further assume that none of the JDBC database connections are thread-safe, even though for some databases they definitely are. Assume that the middleware must take its own measures to mediate concurrent access for queries it applies to the database.

It is also important that we make some assumptions about how the client applications and the RIF database will handle concurrency issues. Although we would hope that front-end client applications would not cause concurrency problems, it is best if we assume they will. We should assume that a client application may call any of the service methods in any order. We should also assume that a client application may have multiple threads which could attempt to alter parameter values it passes to service methods.

Concurrency-8 : Assume that the front-end client applications will have multiple threads that may each be trying to alter parameter values the applications pass to the middleware service methods.

One way to eliminate the side effects of concurrency errors induced by the client applications is to "safe-copy" the values it passes to service methods. In safe copying, a deep copy of each formal parameter is done as the first step in the method. All subsequent actions applied to the parameter values will affect the local copies of the objects, rather than the original objects. If the client application has threads which change the values of the values it passes to the middleware, the middleware will not be affected because it is using its own copies.

Concurrency-9 : Perform a deep-copy of any mutable object that is passed to a service method by a client application. This action will prevent the client from altering parameter values while they are being processed by the middleware.

If the middleware assumes that any of its methods could be called in any order by asynchronous threads, then it should try to minimise how much it needs to 'remember' from one call to the next. Each thing it remembers forms part of its state, and if multiple threads can access it, then they could contend for them simultaneously. Minimising the state of the middleware invites us to explore ways of reducing the scope of variables so they cannot be accessed simultaneously by multiple threads.

Suppose a variable X is declared as a class variable which is accessible by public methods. Multiple threads which call those methods may experience contention in trying to change or read the state of X. However, if X is declared as a local variable within a method, then each thread that calls that method will maintain separate copies of the variable. If it is possible to move a variable declaration from within a class to within a method, it is desirable to do so. The move helps guarantee that the variable will only be accessed by a single calling thread.

Concurrency-10 : Where it is possible, favour using method variables over using class variables. Class variables can be accessed by multiple threads, but local variables in methods will be used by only a single calling thread.

It is also desirable to limit the lifetime of variables. Class variables have a lifetime that is the lifetime of the object, whereas method variables exist only for the duration needed to execute the method. Limiting the lifetime of a variable reduces the opportunities that multiple threads could have to use it at the same time.

Concurrency-11 : Declare variables as close as possible to when they will be first used.

Another way to reduce potential concurrency problems is to declare class variables as final wherever possible. The final variables will be assigned either as part of the declaration statement or as part of a static code block that will be loaded when the class is loaded by the JVM. Marking class variables as final effectively turns them into constants. If some part of the class code tries to reassign the value of the final variable, the compiler will complain. If during execution multiple threads attempt to set the value of the same final variable, an exception will be thrown.

General Design-9 : If a class is not abstract, it should be marked "final".

One common source of concurrency problems occurs when an object is used before it has finished being instantiated. In multi-threaded execution environments, it is considered best practice to ensure objects are "safely constructed". One common way of implementing safe construction is to create new instances through a static factory method rather than through the use of the class's constructors. To facilitate safe construction, all the business classes have private access for constructors and a static method called newInstance().

The following example from the code base shows how this is done:

public class HealthCode 
	extends RIFAbstractConcept {

...

	private HealthCode() {
		isTopLevelTerm = false;
		code = "";
		nameSpace = "";
		description = "";
		numberOfSubTerms = 0;		
	}

	public static HealthCode newIntance() {
		HealthCode healthCode = new HealthCode();
		return healthCode;
	}	
...
}

We use a similar approach for doing deep copying of an object. Each class has a static method called createCopy(...) which also attempts to support safe contruction. These methods prevent the reference to the created object from being visible to other threads until it has finished being created and initialised.

Concurrency-12 : Support "safe construction" of business objects. In business classes, make constructors private and rely on static factory methods newInstance(...) and createCopy(...) to control the creation of new objects.

We can use the same safe construction technique to help control the creation of manager class objects that may be used by multiple services

Identifying and protecting resources that could experience contention by multiple threads

In the previous section, we took steps to help reduce the likelihood of concurrency problems happening in the code base in general. Now we will try to identify specific places within the code which may cause contention.

We begin by identifying class variables that are managed within the classes that implement the service APIs. We will focus on services supporting study submission and viewing study results because at this point in the project these services are the most well-developed.

Business Classes

Business classes are used to create input parameter values and return values for service methods. Once client applications pass business objects to service methods, the objects which could be mutable are safe-copied so that the middleware uses local copies. The safe copying activity relies on safe-construction methods which ensure that objects are properly instantiated before they are referenced by calling code.

Neither the classes which implement the service APIs nor the manager classes they use for delegation retain reference to these business objects as class variables. The only exception is that the SQLConnectionManager retains a hash map of registered userIDs, but this is discussed later.

References to safe-copied parameter values are passed to various methods that eventually delegate to the manager classes. In general, the business objects that are used as service parameter values are 'alive' for as long as it takes to execute a service method and are considered thread-bound.

The main exception are calls to the HealthCodeManager, which in some scenarios passes HealthCode objects to abstract taxonomy services. In theory these taxonomy services could hold onto references of the HealthCode objects. We do not control how other developers could implement the rifServices.taxonomyServices.HealthCodeProviderInterface. But in our implementation of taxonomy services, the classes do not try to retain the references.

Business class objects can also be created when the manager classes are attempting to convert result sets into a form that is returned to the client by the middleware. For example, the ResultSet that is returned from an SQL query to select geo level select values is copied into a collection of GeoLevelSelect business objects. Again, the creation of the business objects uses safe construction provided by the static factory method newInstance().

In general, the lifetime of result objects will last as long as client applications hold references to them. For the web-based applications, the objects will not be referenced after web service methods have serialised them into JSON text. Therefore, we don't expect that references to old result objects will linger in memory and be vulnerable to be accessed by other threads.

The result objects may persist in the Swing-based Data Loader Tool and Information Governance Tool. Here it is possible the client applications could introduce multiple threads which may try to simultaneously access the same result object. However, wherever possible, we try to minimise this possibility.

If the client applications exhibit multi-threaded access to business objects, the middleware should not be affected. In the case of parameter values for service methods, the values are safe-copied and those copies are then thread-bound. In the case of return values, the middleware does not retain references to result objects it sends back to the client applications.

The remaining area of concern about concurrency problems with business objects is whether the middleware itself may create multi-threading activity that could contend for them. An inspection of the code shows that neither the service implementation classes nor the manager classes spawn and manage new threads that could cause a problem.

Concurrency-13 : The business classes do not appear to be used in any scenario which would cause them to be the source of contention in a way that would affect the middleware.

Data Storage Layer Classes

The data storage layer contains three major kinds of classes that we will examine:

Instances of the query formatter classes are always declared as local variables within methods of manager classes. They are thread-bound and are not exposed to any middleware code that spawns different threads. Therefore, objects from these classes will not present concurrency problems.

Instances of manager classes in rifServices.dataStorageLayer are created in rifServices.dataStorageLayer.RIFServiceResource, which then shares them between the RIFStudySubmissionAPI and RIFStudyResultRetrievalAPI services. The RIFServiceResources class uses a static factory method to ensure that the manager objects are created before they can be used.

The manager classes have very little state which can be subject to contention by multiple threads. Most of them only contain references to other manager classes. One also has an instance of RIFStartupOptions, which contains configuration options that can configure the service. The RIFStartupOptions object is created through a static factory method that ensures it can't be used before it has been initialised. Its startup properties can only be set by reading them from a properties file and the class has no set methods that would be available to other threads. Next, we consider the manager classes. For most of them, their state is limited to variables that reference other manager classes. Some also retain a reference to a RIFStartupOptions object, which contains settings to help configure the runtime behaviour of RIF services. RIFStartupOptions is created using safe construction techniques, and is only set with a value once, before the services become active. The public methods of manager classes assume that most of their parameter values are objects that were safe-copied from the original parameter values passed by the client applications.

The only parameter which could be vulnerable to multi-threaded problems is the Connection object. However, in any service method, the service class will obtain a single Connection object from a pool, and thread that through all of the methods it uses in the managers.

Next, we examine the potential shared resources that could be used by service classes. It is possible that Study Submission and Study Result Retrival services could be running at the same time. They share a lot of common code and would access the same set of manager objects. In order to ensure that they both share the same manager objects and that these manager objects are not referenced before they have finished being initialised, we use the safe construction technique again.

The class rifServices.dataStorageLayer.RIFServiceResources is designed to create and properly initialise all the manager objects before the services use them. The class centralises initialisation code that would otherwise be vulnerable to concurrent access problems caused by multiple services trying to share manager objects.

Besides manager classes, the other main source of contention would be database connection objects. If we were just using PostgreSQL, then we could rely on the JDBC driver to create a thread-safe connection object. However, we're unclear how drivers that support Microsoft SQL Server.