This article is discusses sane system administration for a group of similarly-configured machines. It's also the basis for a talk that I'd like to give at the Spokane Linux User's Group.
Due credit: this is essentially my digested version of the concepts presented in the paper Bootstrapping an Infrastructure, and the derived material at http://www.infrastructures.org/. Some bits, like unified user accounts and NFS home directories, aren't important to me, so I haven't included them here. And I've been a lot more specific about choices of tools than the infrastructures.org folks have—you'll find a general Debian bias. =)
Due disclosure: I'm only in the initial stages of implementing for myself what I describe here. I'm writing this while the goals and principles are fresh in my mind. Actual mileage may vary, especially around section 5.
Your current situation may look something like this: you manage several machines—could be a few or it could be hundreds. These machines could be servers, routers, or workstations. You have certain preferences and practices for what you like to have in common for all your machines.
- Security updates should run at 2:10am.
- All my web servers should have these bits of configuration.
- All my firewalls should have these particular rules.
- I really want curl and wget on all my machines, and pv is a neat little utility that I want everywhere, too.
But there are problems:
- You find that your machines, in aggregate, easily diverge from your ideal vision, gradually decaying into chaos.
- You find yourself doing the same thing over and over again. Shell loops and SSH can only get you so far, and mistakes lead to even greater divergence among machines.
You want the things that should be in common to machines to actually be consistent. You want to quit repeating yourself.
Here, we'll be discussing what to do about your situation.
1. Figure out how to install a consistent machine image, automatically if you can.
I'm using Debian preseeding because I use Debian, and FAI is too complex and hard to learn. FAI can manage machine configuration, and kind of needs to to some extent, but that overlaps with what I want Puppet to do (see below). Debian preseeding is "as simple as possible, but no simpler."
2. Learn the hierarchy of data.
Your infrastructure is divided into a number of machines. The data on each machine is divided into system and non-system data.
System data means installed programs, initialization scripts, and other stuff that comes from your OS distribution. System stuff is boilerplate. It's easy to replace this.
Non-system data is everything that makes a machine unique. It is further divided into configuration and local state.
Configuration is easy: parameters that define what a machine is
supposed to do, and how it's supposed to do it. This includes the
selection of installed software packages, and most of everything in
Local state is the non-configuration, non-distribution-supplied
data that your applications need in order to operate. This would include the
web root hierarchy of a web server, the mail spool and mailboxes on
a mail server, or the zone files of a DNS server. Note that this
includes both machine-generated state (mail spool directories) and
human-generated state (web roots). The local state includes most
of everything in
3. Understand key goals when it comes to machine data.
System data can be thrown away and replaced easily, because you can reimage a new machine consistently.
Configuration data should largely be consistent across machines.
- Some things will always vary: hostname, IP address, and so forth.
- Some things you'll want to be standardized: NTP configuration, smarthost configuration for non-mail servers, OS package repository selection, a set of software packages you want to have available on all machines, and so forth.
- Some things will be standardized, but only for certain classes of machines: web servers will always have Apache with certain configuration bits, mail servers will always run Postfix with other configuration bits—or whatever you prefer, of course.
A machine's configuration is important—of course you don't want to lose it. But we'll be managing configuration from a central location, so if it gets physically lost on a particular machine, we can recreate it with the configuration management system (discussed below).
Local state is the crux of a machine's data. Your web server just won't work correctly unless you have the right HTML pages in place. You cannot lose this data, so you'll back it up regularly. With the methodology described here, it won't be hard to make this automatic.
3. Implement a system for centralized, managed configuration.
You have an ideal vision of what the common configuration of your machines should be. You need the ability to express that configuration, as well as the parts that differ on each machine. The differences are conditional on certain variables: what's the MAC address or the hostname of the machine? What's the role or class (web, mail, etc.) that I've assigned to it? And so forth.
You need to be able to express this information in one place. When you make a configuration change, your running machines will synchronize to reflect that change.
A configuration management system should include:
- A procedure to express configuration in this fashion, and
- The tools for your machines to apply the configuration.
Add a client for Puppet or Cfengine to your standard machine installation image. Your new machines will automatically configure themselves.
- Installing a machine from your installation image results in a plain, stripped-down configuration.
- The configuration management system updates the new machine's configuration to whatever it needs to be, as you define in the managed configuration.
You'll also want to log, timestamp, and document your changes to the managed configuration. You'll want to see which of your admins made what change, and why. Which files were touched? What were the lines that were altered? What did the file look like before the change?
Applying a revision control system—such as Git, Darcs, or Subversion—to your managed configuration makes this possible. When you do this, you have a complete history of configuration changes, and can roll-back changes that turn out to be mistakes.
4. The golden rule.
Now that you have a configuration management system in place, follow our golden rule for infrastructures:
Never, ever change the configuration data or system data of a machine in a way which is outside the control of your configuration management system.
(...also known as "cowboy sysadmining.") It's very tempting to do, because it's the quick and lazy thing. It's also quite common in many (if not most) IT shops. However, this causes long-term damage.
A violation of this rule could be considered a defect or an error in the machine. When you break this rule, a machine diverges from your ideal vision. One of two things can happen:
- The configuration management system detects the error and corrects it, undoing your work.
- The error goes undetected, and the machine is permanently different from all the other machines in your infrastructure.
Possibility #2 is the most dangerous of the two. In this case:
- The machine behaves unpredictably. It's different from all the other machines of its type, so it's hard to reason about its behavior.
- Personnel trained to Do The Right Thing by using the configuration management system will probably not be aware of the variation. There's no automatic documentation or audit trail, like there is with a revision-controlled configuration management system.
- If the machine needs to be replaced, the new machine will not carry this change, since the change isn't tracked. So, the replacement of a machine is now a dangerous and risky undertaking.
I would go so far as to say it's not worth doing configuration management if you won't stick to this golden rule. After all, you wanted predictability in your infrastructure, right? You wanted to keep the important stuff in common, right?
If you find yourself with a compelling reason to compromise on this, think carefully about how to do it safely. Develop a procedure comparable to how you would update the managed configuration, to make sure that your infrastructure doesn't gradually descend into chaos once again.
5. Reap the rewards.
Need to add a new web server? It's as simple as:
- Imaging a machine,
- Updating your managed configuration to say "the machine with MAC address <foo> has IP address <bar>, and it's a web server,"
- And letting the machine configure itself.
Now just drop in your web content.
Need to replace the failed hard disk of a critical server? This, too, should be pretty easy:
- Image a machine.
- Restore the backup you made of the local server data onto the new machine.
- Update your managed configuration to say "this particular critical server no longer has the MAC address <foo>. Now it's <bar>."
- Watch the machine configure itself.
If you did everything right, you'll see your replacement machine come up and take off running. Just like that.
Welcome to the future.
A. "But that would be putting all my eggs in one basket. And then someone pwns my basket."
It's true. Implementing a configuration management system on a
centralized server is a security risk. That's because each of your
managed machines will do whatever your configuration management
server tells them to do, automatically. If a bad guy cracks your
management server, they could change the configuration to include
r00tsh3ll-3.1.337.tar.gz from my web server, untar it
/tmp, and move this executable file to this system binaries
directory, and run it."
But this is a risk that can be managed. And if you manage it well, it's probably less risky than what you're doing now.
You have to log into your servers for maintenance somehow. In non-managed situations, that "somehow" is probably "with SSH from a system administrator's workstation or laptop." If a cracker gets access to the admin's computer, they can spread their compromise to each other machine that it has access to.
With a centralized, formal configuration management server, you'll be able to lock that server down and make sure nobody's visiting Flash-laden porn web sites on the critical server.
See also: http://www.infrastructures.org/