As many of you are well aware, CygNet does not have a great solution for productized failover. You can replicate CygNet data. You can setup a clustering environment to detect failures. You can write scripts to swap config files in order to change a services role. I’ve talked to many of you over the last month or two and for those who do have a working solution, it took a lot of work to setup, and takes a lot of work to maintain.
For those of you who are in this boat, or are afraid to get into the boat, I’m happy to announce that we have a team assembled and are working to vastly improve this process!
If you want all the nitty-gritty details, you are going to have to come to users group at the end of April (shameless plug, I know). For now, I wanted to share our main areas of focus and ask for any feedback you may have to provide us in this process. What I’m looking for specifically are any comments you have on the plan I’m proposing, as well as any issues/pain you are currently facing in the areas of replication and failover that this plan doesn’t address. You are welcome to email me directly, but it would be great if you can reply to the post so that everybody can chime in (Plus, I want to see if we can make this the most commented on post in CygNet Blog history!)
Before getting into it, I feel the need to state the obvious, which is that we won’t be able to do everything. This is a very complex feature that will need to be refined over time, so don’t expect it to be complete in the next release. The reason is that there are so many configuration and data nuances to redundancy that we can’t hope to review and test them all. We are going to do our best to address the most common ones, but we will have to continue to iterate and improve as it gets adopted and you provide feedback. So at the very least, you can reply to this post by telling me which changes you are most excited about since not everything will make it in.
Now to the plan! There are going to be five main areas of focus (in no particular order):
Before I go into any more detail on what we’re going to change, let me set the baseline and describe how it will generally work. First, the goal is to achieve high availability whether locally or across data centers. You could just have a couple of locally redundant servers, or two servers in different data centers, or locally redundant servers in multiple data centers which are themselves redundant. On top of that, you will be able to have multiple networks (i.e. production, business, test), take part in the redundancy. Each site in your environment will run on a unique domain and each domain will have a specific role. Then when you perform a failover, your sites will swap domains, and therefore, assume a different role. The benefit to this model is that clients will always connect to the same domain, regardless of where it’s running, so that they don’t have to make any change as the result of a failover. For now, the failover process will still include stopping all the services and restarting them on a different domain.
Goal: Centralized configuration
To start, our plan is to allow you to define all the domains and servers at play in one central place. This information will then be used by the system to make good decisions about how to execute a failover and what the downstream consequences will be. Additionally, a services configuration file will contain everything it needs to run in any mode. When a redundant service starts, the RSM will tell it what domain to run in and if it’s replicating, where to replicate from. We thought about making the configuration files replicate so that you only have to change them on one domain, but there are lots of exceptions where they need to be different. So to make them more manageable, we are going to create a utility that will allow you to remotely manage all the configuration files in your system, even if you aren’t on the host machine. It will allow you to validate and mass modify your configuration files so that you only have to make changes in one place.
Goal: Improve replication status
To improve the visibility into the replication process, we are going to start by creating a sample dashboard in CygNet Studio. It will show you every service in your system, what domain it’s running on, and if in standby mode, whether or not its ready to become the active service. One key metric we will display is how long it has been since each service was last fully in sync. We are going to build this dashboard in Studio so that you can modify it, and add information that is meaningful in your environment. Another key change we are making in this area is that we hope to convert the ReplValidator utility into a service. This validation service would be able to run validation checks on a scheduled basis, store the results, and even fix problems that are discovered. This validation service would then notify any services that has problems so that those services would know that they are out of sync. Although there is too much data in CygNet to validate every record in every service, our hope is that you will be able to schedule meaningful validation runs that will identify any problems with your replicated data.
Goal: Replicate more data
You may be surprised to see that we are planning to replicate more data, so before you panic about what we do today, let me explain. The most common items in this category are the mostly static files that live in the service directories. Like the time zone file in the ARS or the Admin.sec file in the ACS. Some of these don’t replicate at all, and others only replicate within the domain. So managing these files today is a challenge because you have to make your changes in multiple places. Another thing that doesn’t replicate today is the queue of current value records used to calculate alarms. This lives in the memory of the UIS and is used to calculate alarms that are only triggered if certain conditions are met over a period of time. This queue is refreshed pretty quickly which is why it’s not been a big issue, but by replicating it, we are going to further reduce the impact from performing a failover. One more item we’ve identified is the list of executing commands triggered by the MSS. When you do a failover today, the UIS has to restart so any commands that were executing get lost. That’s not usually an issue since most polling commands are on a pretty tight schedule, but if you happened to be in the middle of a command that executes only once a day, or even less often, it will be a while before the command runs again. Our plan is to track the set of commands executing in the UIS so that if you perform a failover, we can re-execute those commands once the UIS and MSS start back up.
Goal: Decrease the time it takes to execute a failover
This really just comes down to service startup and shutdown times. We haven’t done the detailed analysis to give many specifics, but our goal is to be able to stop and then restart a medium sized system in under a minute. There are a huge number of factors that go into this time, so please take it with a grain of salt. One specific idea is to start and stop the services in groups. Services do a lot of disk access on startup/shutdown so by doing it in groups, you will be able to get the most important services running quicker.
Goal: Make CygNet more aware of a failover event
I’m sure it’s not a surprise to most of you to hear that our replication model was not designed to have the direction of replication change with any regularity. It is certainly possible, but often the downstream clients or systems have to do a full sync to recover. This is time consuming and error prone, so we are re-evaluating the entire solution to make sure every piece of it is aware and prepared for a change in the replication stream. Also, recovering from a failure right now is a manual process as it requires configuration file changes and possibly database maintenance. Our hope is to make the recovery of a failed service completely automated so that all you need to do is get the machine up and running again.
And that’s the plan. At least, that’s the plan right now. I’m sure it will be different tomorrow. I know that there are flaws and gaps, so please help us out and let us know what we’re missing. It is going to be a lot of work, but with your help, I’m confident that we can solve this redundancy challenge.
Enter your email address to subscribe to this post and receive notifications of new comments by email.
Enter your email address to subscribe to this blog and receive notifications of new posts by email.