[TriLUG] Jihad! ( was Remote server monitoring)

Thu Sep 1 12:56:27 EDT 2005

A lot of good stuff....yes, there probably needs to be a better solution 
to the whole problem...after all, how do you determine that lack of httpd 
response stems from a server using a slow LDAP server for logins rather 
than httpd being turned off or local system processes putting on a severe 
load?

As I said, I've been on the development side of an in-house system that 
does client-based monitoring, so I have a particular viewpoint on the 
entire process.  You've helped me see that it's not the only (or 
necessarily best) solution.

I'm beginning to think that (even with all of the existing monitoring 
tools) something should be developed that provides data from both inside 
and outside a server, with a set of rule-based processes on the data 
collection/reporting server that intelligently ties that information 
together...

e.g., a high CPU load, low memory usage, and high network latency when 
connecting to a server, coupled with low network latency to another server 
in the same rack, could tell you that server A is having LDAP server 
issues (or whatever)....data that you can't get just looking at services 
provided or internal vmstat data.

William

On Thu, 1 Sep 2005, Shane O'Donnell wrote:

> William -
> 
> Thanks for keeping us grounded in reality.
> 
> You're really discussing two separate issues -- network availability
> monitoring and systems performance monitoring.  To almost any company,
> these are at the core of their monitoring goals.  Unfortunately, many
> companies have grown to the point that organizationally they've split
> up the server folks from the networks folks (as opposed to have a
> larger "services" focus, but that's another topic altogether).
> 
> What I've seen is that in companies where the function is split, the
> server folks take on a "server-centric" view of the universe and
> typically go about deploying agents to servers because, well, that's
> what they do--maintain software on servers.  They chalk up the spotty
> availability to "network problems" that are outside their scope or
> area of responsibility.  There is nothing wrong with this approach,
> until you get to the user's take on a situation.  When a user can't
> access a resource, there is a service-related problem that should not
> involve finger-pointing.  This means the server guys should have an
> idea as to what's going on on the network (meaning they need insight
> into simple and up-to-date availability reports) as well as data from
> their servers over the period of time during which they can't be
> reached.  The solution to the problem usually ends up leaning toward
> the expertise of the area that solves the problem; the network guys go
> with a polling approach while the server guys fall toward the agent
> side.
> 
> Personally, I trend toward the polling approach, for a few reasons:
> 
>  - Agents can be a bitch to maintain
>  - Agents arguably intrude on (and steal resources from) the
> systems/apps they're supposed to managing/monitoring
>  - If you can't reach a box, there are usually bigger problems than
> what's going on on the box
>  - Good polling solutions collect data over multiple time periods, so
> small gaps can be interpolated for reporting purposes
> 
> As an example, OpenNMS is configured that if it discovers a box
> running the Net-SNMP agent (which ships by default with Red Hat, SuSE,
> etc.), it will collect performance metrics on CPUs, network
> performance, disks (IIRC), and most interestingly, it will collect the
> 1-5-15 minute load metrics.  All this data gets automagically slammed
> into JRobin databases and graphs are dynamically generated by the UI,
> on demand.
> 
> For reporting purposes, this out-of-the-box functionality is usually
> sufficient.  If you need to augment this with logs of performance
> metrics from the remote machines, I'd recommend a lighter weight
> approach--cron jobs that capture df/netstat/load/proc data to a file
> for access if the network is unavailable.  If you need a solution that
> reports on data collected on a batch basis from machines that are
> regularly unaccessible, you'll probably want to look to a full-blown
> agent solution--and you should be prepared for the maintenance
> overhead that's associated therewith.
> 
> Hope this helps,  
> 
> Shane O.
> On 9/1/05, William Sutton <william at trilug.org> wrote:
> > Hmmm...
> > 
> > The question wasn't entirely theoretical.  We have an in-house developed
> > system monitoring tool at $WORK to make sure that our servers aren't being
> > bogged down by manufacturing processes (a lot of back-end stuff going on
> > with databases and so-on).  We also have a large worldwide VPN where
> > segments run over hardware we don't own or control.  Consequently, fixing
> > the outages isn't an option...
> > 
> > FWIW....
> > 
> > On Thu, 1 Sep 2005, Tarus Balog wrote:
> > 
> > >
> > > On Sep 1, 2005, at 12:14 PM, William Sutton wrote:
> > >
> > > > It seems like a more sensible alternative to polling is to have
> > > > separate
> > > > tools for monitoring and data collection/reporting:  Place the
> > > > monitor on
> > > > the servers, and allow them to queue up reports in event of network
> > > > problems.
> > >
> > > Depends on what you want to monitor. I can have a program check if
> > > apache is running on the server, but does that mean that server is
> > > available in LA? New York? If all you care about is "is there an
> > > apache process running on this server that I can connect to, from
> > > this server" then, yeah. If you want to measure service availability,
> > > you need to measure it from the user's point of view. If Travelocity
> > > is slow, I go to Orbitz, whether or not the Travelocity server is
> > > actually up as far as they are concerned. In my case, I want to
> > > capture the user experience.
> > >
> > > You can also place "agents" on systems, but agent management outside
> > > of what ships with a O/S can be problematic on an enterprise scale. I
> > > guess you could write an agent to store performance data, like CPU,
> > > disk, etc., and then report it up to an NMS, but many people would
> > > rather spend resources to fix issues with the "spotty" network and
> > > leave it at that.
> > >
> > > -T
> > >
> > > -----
> > >
> > > Tarus Balog
> > > The OpenNMS Group, Inc.
> > > Main  : +1 919 545 2553   Fax:   +1 503-961-7746
> > > Direct: +1 919 647 4749   Skype: tarusb
> > > Key Fingerprint: 8945 8521 9771 FEC9 5481  512B FECA 11D2 FD82 B45C
> > >
> > >
> > --
> > TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
> > TriLUG Organizational FAQ  : http://trilug.org/faq/
> > TriLUG Member Services FAQ : http://members.trilug.org/services_faq/
> > TriLUG PGP Keyring         : http://trilug.org/~chrish/trilug.asc
> > 
> 
> 
>