Software engineering for resilient systems




















If the system architecture is known, what critical system components such as subsystems, hardware, and software are needed to provide these capabilities and services? Similarly, what critical system data must be protected in terms of availability, confidentiality, and integrity?

Are there any system-external assets on which the system services depend and for which the system is responsible? Determine Potential Harm. What kind of harm can adversities cause to these critical assets that would result in a loss or degradation of critical system capabilities or services?

Determine Maximum Acceptable Harm. What are the maximum acceptable amounts of harm that adversities can cause to assets needed for maintaining critical capabilities and services? For capabilities and services, consider setting the following types of limits:. Capability and service degradation might be measured in terms of decreased performance and might depend on the mode of operation such as operational, training, exercise, maintenance, and update.

Prioritize assets and harm. Sufficient resources are rarely available to ensure adequate resilience under all credible circumstances. Analysis must therefore be limited based on the prioritization of the assets and associated harm.

Develop associated resilience requirements. Based on the maximum acceptable harm to critical assets, develop high-level system resilience requirements. Example templates of high-level resilience requirements with optional clauses enclosed in brackets include:. Determine relevant adversities. What types of adverse conditions and events can cause unacceptable critical services and capabilities to be lost or significantly degraded?

For each subordinate quality attribute, consider the associated types of adverse events and conditions that can harm resiliency-related critical assets. Prioritize credible adversities. Because one rarely has sufficient resources required to ensure adequate resilience under all credible circumstances, requirements analysis must therefore be limited based on the prioritization of the harm-causing adversities. Derive associated quality attribute requirements.

Use the prioritized adversities to derive adversity-specific subordinate quality attribute requirements from the resilience requirements for reach appropriate quality attribute. Top-level system resilience requirements can be used to derive component- and data-level resilience requirements, as well as to derive subordinate quality attribute requirements. The following are examples of resilience requirements and related derived requirements:.

Many different credible adverse events and conditions can disrupt the same critical system capability. Some of these adversities are independent of each other and are of such low probability that one can reasonably overlook the possibility that these adversities happen simultaneously.

On the other hand, it is possible for other adversities to have a common cause or to have a sufficiently high probability that simultaneous occurrences must be considered. Even if you do want to avoid having your customers find your bugs first, you can't spend forever testing your system before releasing it. One reason why people don't test before production is that it can be hard to configure all of the things your system depends on. So you need to figure out how to design a system that's easy to test.

It comes back to thinking about modularity and how your software systems fit together. The idea behind TDD is that you should write your tests first, then write stub code that fails the test, and finally write the implementation that makes your test pass.

While this might be ideal, it is still uncommon for teams to follow this strictly. Whether or not you use TDD, you need to write testable code. So, how do you do it? Here are some tips:. These principles were first described at Sun Microsystems in the early 90s. The fallacies are things that people take for granted when designing systems that come back to bite them in production.

The first three in particular are ripe for testing:. Finally, let's talk about change. The blessing and curse of software is that it's easy to change. I love using bridges for analogies, because they are so different from software. Changing physical infrastructure is slow, painful, and expensive. Once a bridge is built, no one comes back six months later and tells the builder they have a week to hook it up to a different road 10 miles down the river.

However, every software engineer can tell a story about adapting a working system or library to work with an unexpected component at the last minute. The only software systems that aren't modified are the ones that no one uses. You need to design your systems for change, or there are going to be weird failures when, not if , you are asked to do so. Luckily, all of the things we've talked about so far enable us to make changes.

If you've used well-understood technologies, properly isolated third-party dependencies, considered non-functional requirements, and have tests to validate that you've got everything right, this becomes vastly easier. Don't compromise these principles when the inevitable new features come along. It's easy to make one-off exceptions for changes, but you should figure out how to best make the new components and code fit into what's already there.

Another reason to design for change is that you might need to change back. Here's something else that all of us will experience at some point: releasing an update that breaks production.

Once this happens, you need to redeploy the old system while you figure out what went wrong. If backwards compatibility is broken, you're going to have additional downtime. In order to ensure backwards compatibility, support old APIs alongside new ones.

If you have a database, make sure that any database changes are backwards-compatible by never redefining or removing columns or tables. There's one other change that you need to design for: the eventual replacement of your system. Software is such a new field that we are still figuring out how to do things correctly. Don't take it as an insult when someone replaces your code.

If you create a system that lasts five years, that's impressive. One day, every bit of code that you write will be turned off and part of your job in designing your system is to make that as seamless as possible. Hopefully, the person replacing your system is following Chesterton's Fence and understands your system and what it does so that the replacement is both justified and an actual improvement. If not, send them a link to this article. Jon Bodner has been developing software professionally for over 20 years as a programmer, chief engineer, and architect.

The SERENE workshop has a long tradition of bringing together researchers and industry practitioners to discuss advances in engineering resilient systems.

The SERENE workshop will provide a forum for researchers and practitioners to exchange ideas on advances in all areas relevant to software engineering for resilient systems, including, but not limited to: Development of resilient systems. Submissions of all the categories will be 7 pages long except for the technical papers which will be 15 pages long. These adverse events with their associated quality attributes include the occurrence of the following:.

Adverse conditions are conditions that due to their stressful nature can disrupt or lead to the disruption of critical capabilities. These adverse conditions include the existence of the following:.

Note that anti-tamper AT is a special case that, at first glance, might appear to be unrelated to resilience. The goal of AT is to prevent an adversary from reverse-engineering critical program information CPI such as classified software.

Anti-tamper experts typically assume that an adversary will obtain physical possession of the system containing the CPI to be reverse engineered in which case, ensuring that the system continues to function despite tampering would be irrelevant. However, tampering can also be attempted remotely i. In situations where the adversary does not have access, an AT countermeasure might be to detect an adversary's remote attempt to access and copy the CPI and then respond by zeroizing the CPI, at which point the system would no longer be operational.

Thus, remote tampering does have resilience ramifications. This first post on system resilience provides a detailed and nuanced definition of the system resilience quality attribute. In the second post in this series , I will explain how this definition clarifies how system resilience relates to other closely-related quality attributes. Read the third post in this series on system resilience, Engineering System Requirements. Caralli, Julia H. Allen, David W. White, Lisa R. Young, Nader Mehravari, Pamela D.

Brtis, Michael A. Get our RSS feed. Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Software Engineering Institute. SEI Blog. Firesmith, "System Resilience: What Exactly is it? Donald Firesmith. What Makes a System Resilient?



0コメント

  • 1000 / 1000