The DBA Best Practices Series Part 3- Reducing Organizational Risk
We learned in previous articles that it takes more than just being a great technician to keep your customers happy. The theme of this series is that if you want to be viewed as a strategic resource in your organization, being a technical expert isn’t enough. Because of the trade you have chosen, the DBA position provides you with an excellent opportunity to play a more strategic role in your organization. In the first installment, we discussed the business value drivers that all organizations share: reduce costs, generate revenue, improve quality and reduce risk. The second article of the series focused on how you can reduce your organization’s cost of doing business by ensuring they fully maximize their database investment. Let’s move forward with our best practices series by talking about how DBAs can reduce organizational risk. Risk management is the strategy of identifying, analyzing and prioritizing potential threats to the business and is often used as the foundation for business continuity and disaster recovery strategies. But risks that organizations, their IT departments and DBA units in particular face are far more wide-ranging than just local and regional disasters. IT risk management is a well-developed science with various methodologies available to personnel who have dedicated their careers to this profession. Much of the focus is on security. We’ll discuss reducing the risk of unauthorized data access in an upcoming article, but there are daily operational risks, in addition to security risks, that DBAs face on a daily basis. Let’s identify some of the very basic risks that DBAs face on a daily basis:
- Ensuring critical database-driven business application availability
- Preventing and reducing impact of processing errors
- Ensuring regulatory compliance
In this article, let’s focus on availability and preventing/reducing impact of processing errors. In upcoming articles in this series on risk reduction, we’ll focus on database security and regulatory compliance. Critical database-driven application availability Computer professionals, by the very essence of their job descriptions, are the protectors of their organization’s core data assets. They are tasked with ensuring key data stores are continuously available. However, ensuring that data is available on a 24 x 7 basis is a wonderfully complex task. Hardware failures, software problems, user errors and disasters all combine to make many technicians lie awake nights thinking about whether their applications will be continuously available. When a mission-critical application becomes unavailable, it can threaten the survivability of the organization. The financial impact of downtime is not the only issue that faces companies that have critical application failures. Loss of customer goodwill, bad press, idle employees and legal penalties (lawsuits, fines, etc.), must also be considered. It is up to the database administrator to recommend and implement technical solutions that deal with these unforeseen “technology disruptions.” Modern database systems are gaining features to help avoid downtime altogether by enabling change, optimization and management tasks to be performed while databases remain online. It is up to the IT support organization to understand and implement these features to reduce the amount of downtime required to perform administrative activities. Proactive Monitoring and Administration Daily support requests and problem solving often overwhelm IT groups. Understaffing, over-commitment to supporting new and legacy applications, lack of repeatable processes and budgetary constraints are the most common causes of IT professionals being reactive instead of proactive. The reactive IT professional can be compared to a firefighter; they resolve problems only after the problem occurs. The biggest problems garner the most attention. Their time is dominated by these firefighting activities, reducing the amount of time they are able to spend implementing the processes and procedures required to switch their mode from reactive firefighter to proactive problem prevention. Proactively monitoring your environments doesn’t mean “set it and forget it.” The goal of your monitoring activities should be to proactively predict, analyze and prevent database availability and performance problems before they occur. Monitoring is an activity that you need to spend dedicated time on. There are numerous monitoring products to choose from. The market provides a wide variety of options from open source to commercial versions offered by big name vendors. The decision becomes budgetary. How much is your organization willing and able to spend on a monitoring solution? Once that decision is made, it is your responsibility as the DBA to fully leverage the product’s capabilities. Once those products are installed, there are numerous websites that will assist you in selecting the appropriate metrics to evaluate. The advice that I have is to establish a base set of monitors that evaluate general, industry recommended performance and availability metrics. As stated, those metric recommendations are easily found. Not only will you find the metrics to monitor, you’ll also find numerous tips, tricks and homegrown scripts for both open source and commercial products that will jumpstart the process. After installation, the key is to “monitor the monitors.” The challenge to setting thresholds is to balance chatter reduction (unwarranted alerts) with adequate forewarning that an unfortunate event is going to occur. Each application has a unique workload; the goal is to customize the thresholds accordingly. Start with the base threshold setting and tailor them to each application’s workload. Change Management Industry analysts estimate that as much as 80 percent of application failures can be attributed to human error. Database administrators usually support different business units with each unit having their own set of unique procedural requirements. Formalizing and documenting the change request process minimizes the potential for miscommunication between the business units, application development areas and the database administration unit. Standardized communication practices are a key ingredient of a trouble-free application environment. Effective communication ensures that there are no “surprises” when a change is implemented in production. If your organization doesn’t have a formal change request process in place (and many shops don’t), create your own! There are dozens of change management and source code versioning tools available on the market today. The prices can range from thousands to tens of thousands of dollars. Although I highly recommend these types of products, I wouldn’t let the lack of having one prevent me from formalizing the change management process. Do the best with what you have. Be creative and focus on procedural enforcement that results in high-quality implementations and not “oh woe is me, I have no tools.” From creating a single index to the implementation of a complex, multi-tier database-driven application architecture, good communication is absolutely critical during the change management process. Change Management Meetings If you read some of my earlier articles, you know I’m a proponent of constant communication between all units that are involved in the change management process. Here at RDX, we make sure we attend as many of our customers’ change management meetings as we can. If it’s a database change, we want to be there to discuss it. How often should you hold these change management meetings? As often as you implement objects in production. If your organization makes changes to production environments on a daily basis, the meetings should be held daily. This is not as big of an imposition on your time as you may think. We provide remote database services for several very large organizations that have these change management meetings on a daily basis. The process takes about 15 to 20 minutes. It’s not a lot of time spent to ensure that everyone knows what is happening. To shorten the amount of time these meetings consume and to make them as productive as possible, the following discussion items should be a standard part of the meeting’s agenda:
- Application name being changed
- Date and time change will be implemented
- Change description
- Potential business impact if the changes don’t go as expected (include both units affected and how they will be affected)
- Backoff procedures
- Requestor
- Tested by
Repeatable Processes Documenting processes, procedures and best practices is a task that is often considered to be boring and mundane. Most DBAs would rather perform virtually any other activity than sit in front of a screen using a word processor. As a result, creating documentation is often postponed until the DBA has a little free time to kill. Today’s database administration units are operating with smaller staffs, tighter budgets and ever-increasing workloads. The end result is that the documentation is either never created or created and not kept current. But a robust detailed documentation library creates an environment that is less complex, less error-prone, reduces the amount of time DBAs spend learning new database environments and reduces the overall time spent on day-to-day support activities. DBAs are able to spend more time administering the environment than finding the objects they are trying to support. Repetition, even though it can be boring, is the foundation for a high quality support environment. If the scripts and administrative process worked correctly the first time, chances are they will continue to work correctly in the future. Highly Available Architectures Building HA architectures is, once again, mainly a factor of cost, like mostly everything else in this profession. How much money is your organization willing, and able, to spend on highly available systems? All factors must be considered, not just hardware and software purchase costs and maintenance agreements. DBAs must all include training and labor hours required to administer the highly available system. Your job as a DBA is to present the options with your best cost estimations to management then assume full ownership of the quality of the chosen architecture. Recoverability When an unfortunate event does occur, do you have the correct components in place to ensure that your systems can be brought back to a usable state as quickly as possible? Databases can become unavailable for a variety of reasons: hardware failures, database software bugs, programs running haywire, changing too much data, too little data or changing it incorrectly. Your job is to ensure that your database environment is architected to successfully meet all challenges. Understand the recovery features provided by the database products you support. For example, not only can you roll changes forward in Oracle, the product also provides features to allow you to roll changes back. When I was consulting, I showed up early on my first day at a customer site. During my introduction they told me that key members of their DBA team were unavailable as they were preparing to restore a critical database to perform a point-in-time recovery. I asked if I could sit in, found that the restore wasn’t needed and that Oracle would be able to “undo” the changes. When I announced that this was an option, by their comments, I knew the customer’s DBA team was thinking “he has no idea what he’s talking about” and I’m thinking “you don’t even know the recovery features your database provides.” Most botched recoveries can be attributed to human error. Make sure all backups have proper retention periods, verify that all backups are executing correctly and run test recoveries on a regular basis. Don’t let missing backups of data files and/or log files cause you to lose data. You don’t want to hear UNIX support say “The retention on that backup was supposed to be how long?” in the middle of a recovery. COMMUNICATE with others who are responsible for all other pieces of the recovery “pie” (system admins, operators) on a regular basis to ensure you have everything you need to recover a crashed database. Pick a database, identify the backup output files and verify that they are available when you need them. Remember, YOU are the IT professional who is ultimately responsible for ensuring that your organization’s databases can be quickly recovered. Not O/S support, operations, application developers…. Don’t let your recovery skills get rusty. The more test recoveries you do the easier the production recoveries become. If you are a senior-level DBA, make sure you keep the junior folks on their toes. I have never personally seen the database make a mistake during the recovery process. That leaves incomplete backups and DBA error as the most likely causes of “good recoveries gone bad.” RELAX and Plan Your Attack. When you are notified of a database failure, take a deep breath and relax. Don’t immediately begin to paste the database back together without a plan. Create a recovery plan, put it on paper, have others review it if you can, and then execute it. You shouldn’t be trying to determine what the next step is in the middle of the recovery process. I will plan my attack on paper for all recoveries, no matter how simple they are. We’ll continue our DBA Best Practices series in the next article in which we’ll discuss database security and regulatory compliance.