boot or not: virtual interface vote

James Brigman ncsa-discussion@ncsysadmin.org
Tue, 1 Oct 2002 15:07:46 -0400


Lisa;

I have to second Itzok's virtual interfaces suggestion. Config of the server
can vary the idea, though.

> I'm running Solaris 2.7 on a new Sun E220R.  This server is a hot backup
> for another server, but can't be connected to the network (same IP
> addresses configured on the NICs).

1) What services are being provided by the server and must be taken over by
the backup? Running Oracle would greatly influence the best answer: Oracle
has built-in pieces where you could accomplish the same idea as virtual
interfaces.

2) How much failover do you need? Is it OK to require human intervention and
is it ok for it to take as long as a machine reboot? Would it be good to be
able to use the box for other things instead of making hot air? (like
sharing the load of the first box?)

> * Powered off.
...
> * Powered on,
...
> * Powered on,
...
> Which would you recommend, and why?

Powered on and hopefully, sharing the load. I like the virtual interfaces
idea, running the hot backup machine and using it to spread the load. If you
can put together something that'll fail over unattended, you'll get to be
the Solaris MegaStud in the manager's eyes. If you can put both boxes in a
concurrent load-sharing situation, then you'll never have 100% of your users
down if one box fails. In that case, you've got a grey area to work with
since that does not meet the strict definition of an "outage". Politically,
having two machines running can really work in your favor in the event of a
failure.

> (I'm particularly curious whether disk issues are still considered that
> much of a concern when powering-cycling a Sun box.  I know that was an
> issue for Solaris boxen several years ago, but I haven't seen those
> problems recently, and I thought Sun was over that that.  Am I just lucky?

I positively dispute that Solaris has any cruft requiring machine reboots. I
regard that as a "microsoft prejudice" and have experienced conflict with
users who actually demanded I schedule a weekly UNIX server reboot.

If you keep the "hot backup" cold (powered off) there have been two problems
in the past that Suns have had to deal with that got some mention by the
other folks who responded to you: a) the disk not spinning up. Seagate used
to have a problem with bearings. It's nice if you can keep a disk running in
a steady state, but the problem was a temporary one that I've not seen in a
"modern" disk drive (drives produced in the last 3 years). b)inability to
apply patches and keep the machine synchronized.  Both of these issues are
resolved by using virtual interfaces or in the case of Oracle and portal
apps, moving the oracle webserver off the database box onto two small
front-end UNIX systems.

Backup bonus points: you can bring one or the other offline to back it up
for bare metal recovery with no downtime visible to the users. More
importantly, you can restore the box with no downtime either. And you can
TEST a bare metal recovery with no downtime either. (wow...this just gets
better and better...;-)

Nothing we've told you can overcome the ultimate disaster recovery scenario
where the computer room is blasted to bits or loses power beyond the range
of the UPS/generator. But that scenario may not be realistic and is
certainly expensive to plan for. More relevant are the everyday, commonplace
"disasters", like user error, disk or disk bus failure, or some goober
kicking out a power cord.

Also keep in mind: if you must apply patches in the future, having load
sharing/virtual interfaces running really gives you a tremendous edge. You
can patch one system after 5pm and run it awhile to see what happens. If it
breaks something (common in the Oracle world) then you just bring the box
down the next day, back the patches out, and turn in your observed
results/complaint to the responsible party. Again, you get to be the Solaris
MegaStud.

If you can afford to keep a machine sitting there cold, then running in this
configuration will give you far greater results for the same money. I've run
HP-UX systems serving Oracle databases in a similar manner and it saved my
bacon several times. Applying both O/S and Oracle patches turned out to be a
joy and a routine task rather than the nightmare from hell it ends up being
with only one "mission critical" server. Turns out I wouldn't run a critical
Oracle database any other way. The political capital and user confidence I
was able to build were invaluable.

JKB