|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TOP THREE LINKS YOU MUST CLICK ON Feature The New Math: Application Survivability + Operational Readiness = Business Continuity
The New Math: Application Survivability + Operational Readiness = Business Continuity
By: Joe Farsetta
Mar. 12, 2002 12:00 AM
Business continuity. A new and exciting catchphrase? For some, perhaps. Traditional business continuity planning involves many aspects of corporate activities, from call center rerouting and alternative raw material suppliers to policies requiring executives to fly on separate planes. or the IT professional, business continuity usually equates to off-site data storage, along with contracting for data center space with recovery sites such as Sunguard. For the seasoned IT professional, it has long been the somewhat elusive prize at the end of the day. It’s the delicate balance between data center heartiness and capital expenditure, between overkill and practicality, and between business needs and shrinking IT budgets. In the end, it’s an exercise in risk assessment. And, as with any insurance policy, the more protection you want, the more it’s going to cost. As many businesses discovered after recent events, the data center space they thought they were guaranteed in the event of a disaster may not always be available, especially if the event has affected a number of businesses and your disaster recovery partner is oversubscribed. So, what steps can one reasonably take to help ensure IT business continuity and application survivability? The exercise requires detailed analysis. It may ultimately require dual data centers taking geographically dispersed and load-balanced traffic. As such, this article will focus on business continuity from the standpoint of bulletproofing your network and server farms to help ensure application survivability. Although I point out some WebSphere product suites that offer some added functionality in a WebSphere environment, most of the principles revealed here are truly universal and pertain to most high-availability scenarios.
Basic Concepts
Availability, as a concept, differs somewhat from the notion of “downtime.” Availability assures that the system, or application, is functioning and able to process desired transactions. Individual servers, network segments, or storage devices may indeed be down, without affecting the overall system or application to the point where transactions have stopped. For instance: a single server within a high-availability server cluster may have “tuned itself for maximum smoke.” (I like that phrase.) That is, it’s gone, finished, light a candle – it’s dead. Although this server is indeed down, the application survives within the remaining clustered servers. The server failure is indeed catastrophic, but the damage is limited to hardware, rather than system or application availability. Availability is unaffected and the system survives. This assumes, of course, that the failure was properly monitored, spare parts are available to replace the damaged component in a timely fashion, everything is documented, and that a well-trained operations staff is on top of the situation before the next failure occurs. So, what should the targeted availability threshold be? Well, that depends on the nature of your, or the client’s, business. For instance, if your business truly relies upon continuous system availability, including off hours, be prepared for a 99.99% availability threshold. This equates to approximately four minutes of actual allowable system unavailability per month. Impossible, you say? Not really. I, myself, was entrusted with bringing a large financial institution onto the Internet and delivering a guaranteed 99.99% site availability SLA. In the end, we ended up actually delivering 99.999%. So don’t think it can’t be done. That client’s system is still running, two years later. And, if your business relies on the system or application as its lifeblood, then we are talking about that all-important concept: business continuity.
Planning, Planning, Planning
Anticipated Traffic
Know Your Application
Operating Systems
Storage Requirements
Server hardware
Certain network infrastructure requirements may influence server hardware choices. Multiple processors within the server framework are typically required in a high-availability design. RAID arrays are typically included in base server design. SAN configuration and the physical proximately on the mass storage units to the server may influence connection methodologies. If you are connecting the server to the storage unit via optical fiber, dual fiber cards (for redundant links) are also a good idea.
Server Connectivity
Server Clustering
The function of the servers to be clustered is also an important consideration. Certain applications are “cluster-aware.” For example, WebSphere Application Server successfully manages both load balancing within the cluster, as well as failover functionality. In database clusters, things are somewhat different. In this scenario it’s likely that some clustering software will work very closely with the server OS, and typically be transparent to the application. So, knowing your cluster is an important consideration. Most server manufacturers have their own version of cluster, or high-availability, software. Some software houses, such as Veritas, also have offerings. Regardless, it’s also critical to be sure and have the cluster certified by the software manufacturer before putting it into a production state. This helps eliminate any finger-pointing later on.
Base Application Functionality
Network Design
Network Load Balancing within the Data Center
As an example, WebSphere Edge Server offers enhanced load balancing via NAT, content-based routing, and Edge Server Consultant for Cisco CSS Switches. Very powerful stuff! The product can also help ensure Quality of Service (QoS) by allocating computing and network resources via custom-defined policy rules. And, since I’ve brought it up, remember I mentioned how application release methodology and change control are needed to ensure site availability? Well, here’s how it could work…Let’s say that your new application is evolving rather quickly. Six new versions are planned for the next 12 months. How can you ensure continuous application availability with little or no downtime, while providing a mechanism for easy and near instantaneous restoral of the previous system? Well, you install production and nonproduction groupings of servers. Common services that both groupings rely on, and that will typically be unaffected by an application upgrade, remain separate and in their own separate, or common utilities grouping. One of the application groupings remains in hot-standby mode, while the other is in production. Both groupings are loaded with the same application release. For illustrative purposes, let’s call the current production grouping “A,” and the standby grouping “B.” A new release is announced and loaded into grouping B. The evening of cutover, redirect network traffic to grouping B and stop the flow or traffic to grouping A. Perform all real-time production testing and leave A with the previous release. When user traffic hits the application, carefully monitor those metrics you’ve previously benchmarked. Ensure that enhancements and functionality are performing to spec. If things aren’t working as advertised, revert traffic back to A. If things are running as expected, upgrade A to contain the new release. Grouping B is now production, while A has now become the standby, at least until the next release. Another benefit of this design covers you in the event of a catastrophic failure on the production side. So long as both groupings are on the same release, you can take the troubled production machines out of service and activate the hot-standby machines. This design goes a long way to ensure ultra-high application availability. Load balancers help you manage the flexibility you need to make this sort of design function correctly.
Geographically Dispersed Load Balancing Across Multiple Data Centers
Geographic load balancing isn’t a new concept, but one that should be carefully considered. Most enterprise-class businesses probably have the real estate readily available to support a second data center. These sites cannot be too close to each other. Ideally, they’ll reside several counties or states away from each other. A configuration of this nature is no easy task, but is the best high-availability solution available to help ensure business continuity. Logistical challenges present themselves at every turn. Common challenges regularly faced include database storage and synchronization. Where physical distance between storage arrays is a factor, asynchronous transmission may be the only viable option. Be aware that WebSphere product suites, along with IBM’s SHARK storage solution, provide some interesting features and functionality for these types of environments. Be sure to check them out. In the end, geographic load balancing will play into operational realities and procedural changes. Again, knowing all the cause-and-effect scenarios will help to correctly manage expectations as to how the application and systems, will function as a whole. The major benefit to this design is the reality that if one data center “goes away” a refresh or reset will bring you to the other data center. You may have to restart whatever transaction, function, or query you were in the middle of, but your application has survived. There may even be a degradation in response time, but that is far superior to no response time at all.
Redundant Wide-Area Links
Security
Operational Readiness
Operational readiness includes everything from full documentation, to QA and troubleshooting procedures. It includes everything from on-call duties, to network operation center configuration. It addresses monitoring and alert status. It covers all service-level objectives. It includes procedures for on-site spare parts, parts replacement, and outside support contracts. It covers staffing models, SLA requirements, and reporting procedures. It also covers the all-so-important procedural elements of incident documentation, notification, and crisis management that could someday save your job. It’s the mind behind the machine; the differentiator between your high-availability application and the competition’s.
Conclusion
Remember, an undertaking of this sort requires that all bases be covered. So, you must remember to view application survivability in a holistic manner, where a single problem or flaw may affect everything. Keep your eyes and ears open. Embrace change, and remember that no one has all the answers. An undertaking of this magnitude requires a lot of work and a true team effort. WEBSPHERE LATEST STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING WEBSPHERE NEWS
|
||||||||||||||||||||||||||||