YOUR FEEDBACK
John Portnov wrote: This code does not work for me. I created a new website and a C# console applic...
AJAXWorld RIA Conference
$300 Savings Expire August 22
Register Today and SAVE!

SYS-CON.TV
TOP THREE LINKS YOU MUST CLICK ON


The New Math: Application Survivability + Operational Readiness = Business Continuity
The New Math: Application Survivability + Operational Readiness = Business Continuity

Business continuity. A new and exciting catchphrase? For some, perhaps. Traditional business continuity planning involves many aspects of corporate activities, from call center rerouting and alternative raw material suppliers to policies requiring executives to fly on separate planes.

or the IT professional, business continuity usually equates to off-site data storage, along with contracting for data center space with recovery sites such as Sunguard. For the seasoned IT professional, it has long been the somewhat elusive prize at the end of the day. It’s the delicate balance between data center heartiness and capital expenditure, between overkill and practicality, and between business needs and shrinking IT budgets. In the end, it’s an exercise in risk assessment. And, as with any insurance policy, the more protection you want, the more it’s going to cost.

As many businesses discovered after recent events, the data center space they thought they were guaranteed in the event of a disaster may not always be available, especially if the event has affected a number of businesses and your disaster recovery partner is oversubscribed. So, what steps can one reasonably take to help ensure IT business continuity and application survivability? The exercise requires detailed analysis. It may ultimately require dual data centers taking geographically dispersed and load-balanced traffic. As such, this article will focus on business continuity from the standpoint of bulletproofing your network and server farms to help ensure application survivability. Although I point out some WebSphere product suites that offer some added functionality in a WebSphere environment, most of the principles revealed here are truly universal and pertain to most high-availability scenarios.

Basic Concepts
Business continuity, from the IT side at least, is as much of an exercise in operational planning and readiness as it is in engineering and application development. Operational criteria is the driving force behind the need for fault tolerance and survivability in the first place. The establishment of service-level objectives is a key concept early in the planning phases. Service-level objectives will establish system availability criteria. Availability, the benchmarking threshold by which varying degrees of engineered fault tolerance are applied to the overall design, is the long-range goal. Availability thresholds will set the stage for policies and procedures addressing quality assurance, testing, application release methodologies, change control, monitoring, alerts, remote support, and, of course, the degree of fault tolerance required from the application, server, and network levels.

Availability, as a concept, differs somewhat from the notion of “downtime.” Availability assures that the system, or application, is functioning and able to process desired transactions. Individual servers, network segments, or storage devices may indeed be down, without affecting the overall system or application to the point where transactions have stopped. For instance: a single server within a high-availability server cluster may have “tuned itself for maximum smoke.” (I like that phrase.) That is, it’s gone, finished, light a candle – it’s dead. Although this server is indeed down, the application survives within the remaining clustered servers. The server failure is indeed catastrophic, but the damage is limited to hardware, rather than system or application availability. Availability is unaffected and the system survives. This assumes, of course, that the failure was properly monitored, spare parts are available to replace the damaged component in a timely fashion, everything is documented, and that a well-trained operations staff is on top of the situation before the next failure occurs.

So, what should the targeted availability threshold be? Well, that depends on the nature of your, or the client’s, business. For instance, if your business truly relies upon continuous system availability, including off hours, be prepared for a 99.99% availability threshold. This equates to approximately four minutes of actual allowable system unavailability per month. Impossible, you say? Not really. I, myself, was entrusted with bringing a large financial institution onto the Internet and delivering a guaranteed 99.99% site availability SLA. In the end, we ended up actually delivering 99.999%. So don’t think it can’t be done. That client’s system is still running, two years later. And, if your business relies on the system or application as its lifeblood, then we are talking about that all-important concept: business continuity.

Planning, Planning, Planning
Let’s say that your service level objectives call for high system availability as the end goal. Management has determined that this lofty criteria is cost-justifiable, without running the numbers. You’ve got the green light, and are now expected to produce a near bulletproof solution. Let’s begin the planning phase…

Anticipated Traffic
Sometimes a bit difficult to really nail down, the formula for determining anticipated traffic patterns becomes a cross between standard accounting practices, wishful thinking, and magic. In the end, though, you need to really try for an accurate projection, then add some headroom just in case. Anticipated traffic will drive network and server design. The higher the transaction rate, the more network bandwidth and processing power required. The more processing and storage required, the larger the equipment footprint. This leads to even more dependencies, so you can see the importance of this first step.

Know Your Application
Familiarity with the application is key to peak performance, scalability, and supportability. For instance, knowing the size of a typical transaction will help determine network throughput requirements. Benchmarking a typical user transaction is also key in managing server and network performance, as well as correctly setting expectations as to how an application will react and respond to the average user during peak and nonpeak periods. Familiarity with your application will also help to determine which operating system and server platforms it runs on best. This dovetails into selecting clustering software, and all that’s entailed in that endeavor. For example, WebSphere Everyplace Server has some robust capabilities for this type of high-availability environment, including fault-tolerance, load balancing at the cluster-level, and caching. The product also runs on a variety of OS platforms, including AIX and Solaris

Operating Systems
Key, key, key. Decide which OS you’ll use, choose the most reliable and robust high-availability clustering solution, and go with it. Know how your app will run in this OS environment. WebSphere Application Server has cluster software bundled within the application suite, while a database cluster may require separate software to handle the cluster function. Also, find out if the OS supports dual network interface cards (NICs) or single NICs with dual interfaces. This is important for the purposes of dual network pathing. Spend some time and effort on this decision. It’s that important.

Storage Requirements
Beyond storage capacity, logistical characteristics are key when selecting a storage solution. For instance, will you be using high-capacity arrays in a SAN configuration? How far from the server farm will these arrays be? Knowing this may ultimately require you to purchase and install fiber cards and switches to link the servers to the storage boxes if distance limitations are exceeded. What about backup and restoral requirements and procedures? Will off-site storage be required? Will tape robots be deployed? How often will incremental and full backups be performed? What’s the expected restoral time if lost data needs to be reloaded? These are all critical factors that must be carefully thought out.

Server hardware
As your understanding of what you are trying to achieve begins to gel, and all the interdependencies are identified, server hardware requirements come into focus. The first thing you must determine is whether you plan to use legacy platforms or purchase new ones. Remember the application benchmarking exercise I mentioned a few paragraphs back? This is where your analysis will determine if expected application performance criteria can be achieved on the processing platform you plan to use. Whatever the case, it is of paramount importance that you benchmark application performance early on. I remember being part of one of the first real SAP R3 rollouts for an enterprise-class client. We actually benchmarked transaction size and true response time from the user experience. Based on that data, I was able to engineer the LAN, MAN, and WAN infrastructures to ensure expected application performance in peak and nonpeak traffic periods. Again, this is about managing expectations and building an information base for future troubleshooting

Certain network infrastructure requirements may influence server hardware choices. Multiple processors within the server framework are typically required in a high-availability design. RAID arrays are typically included in base server design. SAN configuration and the physical proximately on the mass storage units to the server may influence connection methodologies. If you are connecting the server to the storage unit via optical fiber, dual fiber cards (for redundant links) are also a good idea.

Server Connectivity
Dual NICs at the server level are a must. Separate interface cards are the optimal choice, but carry their own unique set of problems. Two interfaces on a single card are also acceptable and may ultimately be your only viable option. The main idea is to provide dual pathing of network traffic to the cluster servers (yes, I mean the individual servers within the cluster) in the event a network switch fails. Dual-pathing to all servers in the infrastructure is a requirement. Application behavior and supportability of a dual-NIC scenario is an important component of your design due diligence, so don’t underestimate its importance.

Server Clustering
An absolute requirement for most high-availability server infrastructures, clustering helps ensure application survivability by load balancing across multiple servers within a cluster, or by pooling cluster resources. Depending on the application and platform of choice, various clustering solutions are available. For instance, it’s critical to understand how failover will occur, and how long the process will take. Will the cluster software support an active-active configuration? Will the cluster software configuration delay application restoral for more than five minutes? If your service-level objectives call for 99.99% application availability and this is a two-server cluster, you have just blown your number. So, you’d better be sure that your solution can meet the expected performance criteria.

The function of the servers to be clustered is also an important consideration. Certain applications are “cluster-aware.” For example, WebSphere Application Server successfully manages both load balancing within the cluster, as well as failover functionality. In database clusters, things are somewhat different. In this scenario it’s likely that some clustering software will work very closely with the server OS, and typically be transparent to the application. So, knowing your cluster is an important consideration. Most server manufacturers have their own version of cluster, or high-availability, software. Some software houses, such as Veritas, also have offerings. Regardless, it’s also critical to be sure and have the cluster certified by the software manufacturer before putting it into a production state. This helps eliminate any finger-pointing later on.

Base Application Functionality
Remember to ensure that your application will function correctly (or at all) across multiple processors in a single box, across multiple servers, and with storage in mind. If you plan to use some form of disk shadowing or mirroring, be sure that the app supports it. QA is really important; so is documentation. How the overall data flow works within the application and across the platform is key. Building, understanding, and documenting relevant and usable bug codes is also essential. If the application malfunctions, or merely hiccups, flag the user or monitoring system with a noncryptic bug code. This will go a long way toward supporting the application in the future.

Network Design
What good is having multiple servers and sophisticated storage solutions if your infrastructure relies on a single communications link between boxes, and that link goes away? It’s kind of pathetic to imagine yourself standing in the data center staring at a dead application, lots of blinking red lights, and your sole Ethernet switch sitting dark and silent. As you hear your boss screaming from down the hall, you wonder, “How long will it take to update my resume?” All kidding aside, dual network paths are a prerequisite for any high-availability framework. This dual-pathing follows throughout the network design, including NICs, switches, routers, and firewalls

Network Load Balancing within the Data Center
A critical concept, network load balancing achieves several things. Aside from the obvious, load balancers (dual of course) also allow servers to be seamlessly rolled in and out of production. This is especially useful if you have servers in a hot-standby mode. Now, let’s imagine your new application is geared toward some retail sales function. Seasonal traffic may crush all available processing power. Wouldn’t it be nice to place your hot-standby machines into production by simply adding them to the group that the load balancers distribute traffic to? This is truly production on demand!

As an example, WebSphere Edge Server offers enhanced load balancing via NAT, content-based routing, and Edge Server Consultant for Cisco CSS Switches. Very powerful stuff! The product can also help ensure Quality of Service (QoS) by allocating computing and network resources via custom-defined policy rules.

And, since I’ve brought it up, remember I mentioned how application release methodology and change control are needed to ensure site availability? Well, here’s how it could work…Let’s say that your new application is evolving rather quickly. Six new versions are planned for the next 12 months. How can you ensure continuous application availability with little or no downtime, while providing a mechanism for easy and near instantaneous restoral of the previous system? Well, you install production and nonproduction groupings of servers. Common services that both groupings rely on, and that will typically be unaffected by an application upgrade, remain separate and in their own separate, or common utilities grouping. One of the application groupings remains in hot-standby mode, while the other is in production. Both groupings are loaded with the same application release. For illustrative purposes, let’s call the current production grouping “A,” and the standby grouping “B.” A new release is announced and loaded into grouping B. The evening of cutover, redirect network traffic to grouping B and stop the flow or traffic to grouping A. Perform all real-time production testing and leave A with the previous release. When user traffic hits the application, carefully monitor those metrics you’ve previously benchmarked. Ensure that enhancements and functionality are performing to spec. If things aren’t working as advertised, revert traffic back to A. If things are running as expected, upgrade A to contain the new release. Grouping B is now production, while A has now become the standby, at least until the next release.

Another benefit of this design covers you in the event of a catastrophic failure on the production side. So long as both groupings are on the same release, you can take the troubled production machines out of service and activate the hot-standby machines. This design goes a long way to ensure ultra-high application availability. Load balancers help you manage the flexibility you need to make this sort of design function correctly.

Geographically Dispersed Load Balancing Across Multiple Data Centers
If you really want application survivability, you must plan on having it reside in more than one data center. This covers you in the event of the unthinkable – natural disasters or acts of terrorism. Grim, but nonetheless an unfortunate reality. So, how do you plan for something like this? The most surefire way, with no real downtime at all, is to have your traffic and transactions fed to two mirrored and live data centers.

Geographic load balancing isn’t a new concept, but one that should be carefully considered. Most enterprise-class businesses probably have the real estate readily available to support a second data center. These sites cannot be too close to each other. Ideally, they’ll reside several counties or states away from each other. A configuration of this nature is no easy task, but is the best high-availability solution available to help ensure business continuity.

Logistical challenges present themselves at every turn. Common challenges regularly faced include database storage and synchronization. Where physical distance between storage arrays is a factor, asynchronous transmission may be the only viable option. Be aware that WebSphere product suites, along with IBM’s SHARK storage solution, provide some interesting features and functionality for these types of environments. Be sure to check them out.

In the end, geographic load balancing will play into operational realities and procedural changes. Again, knowing all the cause-and-effect scenarios will help to correctly manage expectations as to how the application and systems, will function as a whole. The major benefit to this design is the reality that if one data center “goes away” a refresh or reset will bring you to the other data center. You may have to restart whatever transaction, function, or query you were in the middle of, but your application has survived. There may even be a degradation in response time, but that is far superior to no response time at all.

Redundant Wide-Area Links
Depending on your particular application and who will use it, you mustn’t forget the importance of peering arrangements (in a Web environment), or conventional WAN links in a closed network. Lack of redundancy in these situations could lead to disaster should a link failure occur. It would be a shame to have a robust LAN design feeding your redundant servers only to have the faucet shut off at the door with no WAN connectivity.

Security
With all this flexibility, redundancy, dual pathing, and multiple clusters, let’s not forget the need for really tight network, server, and application security. A hacker can bring down your system faster than any lightning strike. Worse than that, viruses, Trojans, and God knows what else can leave you with a parade of pain for weeks to come. I can’t stress enough how important is is to to find and close security holes. It’s an ongoing effort that includes continuous intrusion detection, and an ever-vigilant attitude.

Operational Readiness
Another key to business continuity is operational readiness. This is the policies, procedures, and support infrastructures that distinguish planned availability from actual high availability. And, with all this investment in hardware, software, and application development, be sure to examine your data center design (WSDJ, Vol. 1, issue 1) to ensure that it’s up to high-availability standards.

Operational readiness includes everything from full documentation, to QA and troubleshooting procedures. It includes everything from on-call duties, to network operation center configuration. It addresses monitoring and alert status. It covers all service-level objectives. It includes procedures for on-site spare parts, parts replacement, and outside support contracts. It covers staffing models, SLA requirements, and reporting procedures. It also covers the all-so-important procedural elements of incident documentation, notification, and crisis management that could someday save your job. It’s the mind behind the machine; the differentiator between your high-availability application and the competition’s.

Conclusion
As you can see, you’ll encounter many interdependencies, variables, and unexpected twists and turns when planning and implementing a high-availability design.

Remember, an undertaking of this sort requires that all bases be covered. So, you must remember to view application survivability in a holistic manner, where a single problem or flaw may affect everything. Keep your eyes and ears open. Embrace change, and remember that no one has all the answers. An undertaking of this magnitude requires a lot of work and a true team effort.

About Joe Farsetta
Joe is an engineer with over 20 years of industry experience in telecommunications, networking, operations, business process architecture, applications, and support. An entrepreneur and inventor, Joe’s past engagements have included Unilever, NJ Transit, and a Regional Directorship at Bell Atlantic Network Integration. He is currently employed by one of the world's premier Web-hosting providers, as well as operating a consultancy in the New York metropolitan area. He can be reached at XXXXXXXX.

WEBSPHERE LATEST STORIES . . .
IBM announced that Vantage Deluxe World Travel has increased sales and improved business operations since turning to IBM to run its Web site and online booking system. Since switching to IBM WebSphere Commerce software, Vantage Travel has reduced order-taking time by 80 percent and inc...
Red Hat CTO Brian Stevens, Citrix CTO Simon Crosby, Egenera CTO Pete Manca, Allen Stewart, Group Manager, Windows Virtualization at Microsoft, and Brian Duckering, Sr. Director of Products and Alliances at Symantec were the top industry executives who joined Jeremy Geelan in the 4th Fl...
Mike Neil is general manager for virtualization strategy in the Windows Server Division at Microsoft. Mike is focused on the delivery of the Windows virtualization technology, including Windows Server 2008 Hyper-V, Microsoft Hyper-V Server and Virtual PC 2007. Mike also directs the tec...
The AJAX for IBM WebSphere Platform Early Program is an optionally installable product extension for IBM WebSphere Application Server Version 6.1 and WebSphere Application Server Community Edition that offers targeted, incremental new features that can make Web applications running on ...
Two of the biggest launches in Rich Internet Application history took place in 2007/2008 when Adobe launched AIR 1.0 in February '08 and Microsoft launched Silverlight (September '07). At the 6th International AJAXWorld RIA Conference & Expo in October SYS-CON Events is delighted to be...
Unify announced the expansion of its Composer for Lotus Notes solution through a partnership with CASAHL Technology. Partnering with CASAHL extends the Composer solution to include an assessment of the Lotus Notes infrastructure in order to inventory, categorize and analyze the types o...
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON FEATURED WHITEPAPERS

ADS BY GOOGLE
BREAKING WEBSPHERE NEWS
Today at the TDWI World Conference, IBM (NYSE: IBM) introduced new versions of two software product...