In earlier posts I’ve written a lot about the various approaches to multiprocessing and the potential benefits. What I haven’t talked much about is the risk and commitment one must accept when embracing these technologies. The efforts go well beyond the initial investigation and implementation phases, to ongoing testing and maintenance. It is a continual obligation to assure correctness throughout the life-cycle of a multithreaded product.
What is the risk? In a nutshell I would say non-deterministic behavior, which can exhibit itself either subtly - as a slightly different result from an operation - or drastically as in an application crash - to just about anything in between. Whether caused by operations being performed in different order or by obscure race-conditions, this behavior is unacceptable.
There is little more frustrating than trying to reproduce a critical customer issue that occurs randomly and only when multithreading is enabled. This is also a double whammy in that it not only takes a lot of effort to report such issues, but also to diagnose and correct them. Fortunately there are ways to significantly reduce the potential for these sorts of issues with a modicum of commitment.
What is the commitment? It is a gradual increase in checks and balances that ultimately lead to a high level of confidence. This ranges from increased awareness and understanding in the development organization to a rigorous testing infrastructure. Investments made here will certainly pay off in the long run.
Spatial has accepted the risks associated with making ACIS thread-safe and has firmly committed to assuring correctness. Although this was a difficult decision, with an ongoing burden, we knew it was necessary to allow our customers to take advantage of the multi-core revolution. We wanted to make sure that ACIS wasn’t the bottleneck.
To understand what it took for us to fully support our decision, consider the major areas we focused on:
Spreading the knowledge
In my experience, good developers are willing to tackle just about anything; from porting code to AIX, to writing unit tests, to authoring documentation. The same holds true for multiprocessing. Given the opportunity, I bet most developers would welcome the opportunity to learn about developing thread-safe code. We took this a step further and put a group of developers in a room and let them learn how to add multiprocessing to ACIS functionality. The result of this exercise is our multithreaded entity-point-distance functionality, which scales near linearly to the number of available processors.
Developing test suites
A good test is worth a thousand … somethings? The truth is that we rely on our tests, so it behooves us to invest in good ones. We’ve not only developed a fair number of tests to make sure things work correctly, but also to make sure the results remain identical, with and without multithreading.
Employing dedicated hardware
Having machines with varying configurations is an important aspect of commercial testing environments. In the multiprocessing case, it is paramount. For example, dual core machines may never exhibit a problem that routinely occurs on 8 or 16 core machines. Spatial has made investments in systems that contain anywhere from two to 48 cores. Additionally, most of our developers have at least eight cores in their primary systems.
Trust is a good thing, but apparently proof is better. Developing in-house tools to enforce coding guidelines is a common practice. For instance, we’ve developed a tool that routinely checks our code base for newly added global and static variables. After all, these are the root of the problem for thread-safety. The tool is another safety net to help us address issues efficiently.
Utilizing commercial tools
In my opinion, a commercial grade race detector is an absolute necessity when developing multithreaded code. I’ve been especially vocal about this to the folks at Microsoft. However, Intel beat them to the punch with their Parallel Studio toolset. Specifically, Parallel Inspector is worth its weight in platinum. I use it regularly and we are now in the process of integrating it into our automated testing system to help discover data races.
Our commitment to multiprocessing has involved cross-team projects to spread knowledge, purchasing dedicated multi-core hardware, developing specialized test suites, developing in-house tools to analyze code, and making commercial tools such as race-detectors and performance analyzers available to our development staff. It may sound like a lot, but it has become a part of our DNA.
I think it’s fair to say that most rewards involve taking risks and making commitments. Improving your applications performance by factors that go well beyond what is possible with even the best serial code is well worth the investment. What’s keeping you from taking the leap?