Disaster Recovery Part Two: Achieving a 5 minute lag behind production with Oracle Dataguard
This is the second installment in a three-part series about my experiences with a DR project for a long time client that was entering into a new line of business.
Part one of the series provided some background on Disaster Recovery and where it fits in to an enterprise application ecosystem.
Part two of the series will discuss how and why Oracle Dataguard was used to provide a robust database replica that maintained less than a 5-minute lag behind the production site.
Part three of the series will detail the reasons for a shift from Dataguard to Oracle Golden Gate.
As you know from reading the previous installment, Disaster Recovery is all about RTO and RPO. This particular customer came to an agreement on a fairly typical RPO of 15 minutes and RTO of 1 hour. To meet these requirements, we decided that a standby database managed with Oracle Dataguard was the most effective solution. The customer is already running Oracle Enterprise Edition and Grid Control so it was a no-brainer to use Grid control to setup and manage the standby databases. Grid control also provided us with a mechanism for monitoring the shipping and apply process as well as a lag between production and DR to ensure we were meeting the 15-minute RPO.
Since Oracle’s Enterprise Edition is a requirement for this service, Eagle has developed a custom service that delivers similar functionality for Oracle Standard Edition databases where Dataguard is not available.
This customer already has existing equipment in a data center in Tampa, Florida and their DR site is located in Charlotte, North Carolina. Both data centers are operated by the same provider so network connectivity was easy. The primary site already had the Database and Grid Control running so we added hardware in the DR environment to host the standby database and applications as well as another Grid Control server. We then migrated the primary Grid Control to the DR site as it is our preference to have Grid control running in the DR site so that it is still available for us to use if the primary site goes down.
When planning a Disaster Recovery solution, it is important to minimize stress points and change at the time of failover and we have found this is an easy step to take proactively to ensure that the processes used to manage and monitor the database are all intact if a disaster is declared and a failover or switch-over becomes necessary.
With all the hardware and tools in place, we simply used Grid Control to create the standby database. By the next morning, we had a fully functional standby database that was managed and monitored using Oracle Grid Control. When Grid Control works to create your standby, it really is a beautiful thing. In our experience, even if it ‘fails’ to complete the build, it is relatively easy to manually fix the problem and complete the build.
So now that we had a working standby database, we had to place priority on our core objectives…RTO and RPO. The RPO goal was 15 minutes and we were currently running about 10 minutes using the default configuration, which clearly meets the requirement for this customer but we knew that we could do better. After making a few standard configuration tweaks to increase performance we are now able to maintain a lag of less than 5 seconds, which far exceeds the 15 minute RPO goal.
The RTO is a little more difficult to configure since parts of it are out of our control. From a database standpoint, we are able to switch the roles of the databases in just under 5 minutes. This 5 minutes is measured from when the application is taken offline on the primary side until the database is open and available in the DR site. The next step would be to start up the application servers and let the users connect and get back to work. In our testing, it took about 10 minutes to cleanly shutdown the primary site, 5 minutes to perform the database switchover, and 15 minutes to bring up all the applications in the DR site and switch over the networking via BGP. Even with a total RTO of about 30 minutes, it far exceeded the 2 hours the customer was seeking.
Overall, the customer was very pleased with the outcome of the project and we had another successful Disaster Recover project in the books…but a few months later came a change in the RTO requirement. It was determined that, due to the critical nature of the data, our customer needs to make the application available in the DR site in under 10 minutes. Due to the time it takes to perform the switchover, it was determined that an active-active solution would be required, meaning they needed to abandon Data Guard and switch to Oracle Golden Gate– the topic of the third installment in this series.