[TriLUG] What could be going on with my nameserver?

Tue Nov 1 18:06:30 EST 2005

Rick DeNatale wrote:

>I'm plagued by what looks like an intermittent problem with my nameserver setup.
>
>I'm running bind9 as a cacheing name server, and to resolve local
>addresses on my LAN.
>
>>From time to time, resolution of internet names seems to stop for a
>while.  Sometimes it's all external names, and sometimes it's only
>some.  For example, right now I can resolve www.google.com, but not
>en.wikipedia.org.
>
>The bind configuration has a forward first directive, and a forwarders
>directive to forward to my netgear router which in turn forwards to
>the name servers it gets from my isp via dhcp. The router's local ip
>address is 192.168.0.11
>
>Here's some recent attempts to figure out what's going on using dig.
><trimmed>
>;; QUESTION SECTION:
>;www.google.com.                        IN      A
>
>;; ANSWER SECTION:
>www.google.com.         310     IN      CNAME   www.l.google.com.
>www.l.google.com.       270     IN      A       64.233.161.99
>www.l.google.com.       270     IN      A       64.233.161.104
>www.l.google.com.       270     IN      A       64.233.161.147
>
>;; QUESTION SECTION:
>;en.wikipedia.org.              IN      A
>
>;; ANSWER SECTION:
>en.wikipedia.org.       1288    IN      CNAME   rr.wikimedia.org.
>rr.wikimedia.org.       175     IN      CNAME   rr.pmtpa.wikimedia.org.
>rr.pmtpa.wikimedia.org. 1222    IN      A       207.142.131.246
>
>;; QUESTION SECTION:
>;www.google.com.                        IN      A
>
>;; ANSWER SECTION:
>www.google.com.         822     IN      CNAME   www.l.google.com.
>www.l.google.com.       231     IN      A       64.233.161.99
>www.l.google.com.       231     IN      A       64.233.161.104
>www.l.google.com.       231     IN      A       64.233.161.147
><end trimmed>
>So I can get google resolved via my local nameserver, but I can only
>resolve en.wikipedia.org if I bypass the local nameserver and go
>directly to the netgear router.
>  
>
The results you pasted above all have ANSWER sections with valid A 
records, meaning that these were all successful dns queries.  I don't 
doubt that you've having a DNS problem, I just wanted to highlight that 
your above output doesn't show the problem clearly, so my answers are 
just speculation.

>As I said these problems seem to come and go.  Resolution of local
>names seems solid (they're all in a local subdomain
>local.denhaven2.com). Restarting bind doesn't seem to make a
>difference.
>
>Any ideas?
>  
>
I can make a pretty good educated guess.  A good way to test it would be 
to isolate how long the queries fail, although that's definitely tricky 
if you're not using it when it starts to fail (although you'd probably 
have to be).  So here's the guess.  Your NetGear router is imperfect, 
and has a pretty slow CPU.  This leads to the condition where you may 
look up a name, and your BIND server looks up that name by passing the 
query to the NetGear router.  The router then attempts to forward that 
query to the remote name server, get the response, and return it to 
BIND.  If that process takes less than X seconds (where I don't know X 
off the top of my head), or fails for some other reason (specifically 
something like a NXINFO or SERVFAIL, which the NetGear may incorrectly 
return if *it* gets a timeout), then BIND will negatively cache that 
record.  So for the next 5 to 10 mins (roughly) BIND won't try to look 
up that name, because it just tried it, and it failed, so obviously 
there's no sense trying it again right away (this is a debatable point, 
of course).

So how can you detect this type of failure?  Well, you can dig the local 
nameserver right away, when it's failing, and look at the output.  You'd 
do this by something like `dig bad.domain.tld @localhost` on the BIND 
server.  You would see a result such as a response with no ANSWER 
section, or a "connection timed out" error.  Connection timed out would 
indicate that it's not negatively cached, but that it's unable to look 
up the name up-stream.  An empty answer section is a likely result of a 
negatively cached answer.  Unfortunately, it's really hard to chase down 
and prove that something is negatively cached in BIND, as it doesn't 
create an entry in a dumpdb (that I've found), but it does fail quickly 
on queries with no answer, and with out generating an upstream query.  
You can dig locally while doing a `tcpdump -ni eth0 port 53` and see if 
any traffic goes out during the dig, which is one way, but I don't know 
of a better way to sift for that information.

Generally, there's no good reason to have that NetGear box in the 
middle, and my gut instinct is that it's the problem.  Configure up a 
few fast forwarders in your local BIND nameserver, and go on with life.  
If my suspicions aren't correct, and you can gather some more definitive 
queries, perhaps I can help chase farther into the problem.

Best of luck!
Aaron S. Joyner