[TriLUG] (no subject)

Len Boyle Len.Boyle at sas.com
Mon Feb 27 15:10:57 EST 2012


One possible issue is that you are running out of available socket connections. 
This could be at the gobal level. Or at a smaller level if you have restrictions on which port ranges can be used. 
We have seen this  with backup servers and had to make changes to reduce the number of sockets tied up in the 
tcp_fin_wait_2 state. On Solaris there is a setting for this, /usr/sbin/ndd -set /dev/tcp tcp_fin_wait_2_flush_interval 67500 I do not believe that this exists in the same form on Linux. 



-----Original Message-----
From: trilug-bounces at trilug.org [mailto:trilug-bounces at trilug.org] On Behalf Of Joseph S. Tate
Sent: Monday, February 27, 2012 1:51 AM
To: TriLUG
Subject: [TriLUG] (no subject)

I'm running into a problem that's really kicking my tail; when my server gets under high network load, I'm getting connection timeouts.  These don't just happen on the port with the high load, but even on ssh's port too.
What are some things to do to track down why the connections are failing?

I've got a web server that's running varnish + nginx + a python app on the same box.

netstat -nt shows the number of connections topping out at about 500.  Lots of them in TIME_WAIT (75% or more).

vmstat shows CPU to be ok, with some interrupt and context switching going on.  Swapping is negligible, and IO is at "barely there" levels.

# vmstat -S M 2
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa
 0  0    232   2419    133   2775    0    0     1     5    0    0  6  1 93
0
 0  0    232   2419    133   2775    0    0     0     0  549  619  3  0 97
0
 0  0    232   2419    133   2775    0    0     0   402 1051 1044 18  0 81
0
 0  0    232   2419    133   2775    0    0     0    56  961 1034  9  0 89
1
 0  0    232   2419    133   2775    0    0     1     5    0    0  6  1 93
0
 0  0    232   2419    133   2775    0    0     0     0  549  619  3  0 97
0
 0  0    232   2419    133   2775    0    0     0   402 1051 1044 18  0 81
0
 0  0    232   2419    133   2775    0    0     0    56  961 1034  9  0 89
1
 1  0    232   2419    133   2775    0    0     0     0  898  962 14  0 86
0

ifconfig shows no errors on my public network interface.

I've got shorewall as a firewall management tool.

The apps database is on a separate server, but that server doesn't look loaded either.  Varnish is showing 99% or better cache hit ratios, so not much is hitting my python app.  What is going to the backend universally returns within 30 seconds according to varnishhist.

Any other things I can look at?  More information you need to help diagnose?


--
Joseph Tate
Personal e-mail: jtate AT dragonstrider DOT com
Web: http://www.dragonstrider.com
--
This message was sent to: len.boyle at sas.com <len.boyle at sas.com> To unsubscribe, send a blank message to trilug-leave at trilug.org from that address.
TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug
Unsubscribe or edit options on the web	: http://www.trilug.org/mailman/options/trilug/len.boyle%40sas.com
TriLUG FAQ          : http://www.trilug.org/wiki/Frequently_Asked_Questions





More information about the TriLUG mailing list