I have a Windows NT Service that hosts several WCF end-points listening on TCP on a single application server. An ASP.NET Web API application running on several other web servers, in a load-balanced configuration, is connecting to these end-points. The system runs perfectly for a few days and then connections from the web servers to precisely one end-point start failing.

The Web API application logs two exceptions. The salient one appears to be this:

Connecting to via net.tcp://blahblahblah:23000/Cache/BlahBlahBlah timed out after 00:00:00. Connection attempts were made to 0 of 1 available addresses (). Check the RemoteAddress of your channel and verify that the DNS records for this endpoint correspond to valid IP Addresses. The time allotted to this operation may have been a portion of a longer timeout.
EXCEPTION Stack Trace:
   at System.ServiceModel.Channels.SocketConnectionInitiator.CreateTimeoutException(Uri uri, TimeSpan timeout, IPAddress[] addresses, Int32 invalidAddressCount, SocketException innerException)
   at System.ServiceModel.Channels.SocketConnectionInitiator.Connect(Uri uri, TimeSpan timeout)
   at System.ServiceModel.Channels.BufferedConnectionInitiator.Connect(Uri uri, TimeSpan timeout)
   at System.ServiceModel.Channels.ConnectionPoolHelper.EstablishConnection(TimeSpan timeout)

The second one appears to be a consequence of the above:

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
EXCEPTION Stack Trace:
   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
   at System.ServiceModel.Channels.SocketConnection.ReadCore(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout, Boolean closing)

After the failure begins, features of the web application that cause controllers on the web servers to connect to other end-points continue to work without issue. It is always the same end-point that fails. Restarting the Windows NT Service (which will stop and restart all the hosted end-points, not just the one that failed) resolves the problem. The problem repeats every few days - it runs stable for weeks, sometimes.

I have tried the following to resolve the problem:

  • I have attempted, unsuccessfully, to reproduce the problem in development and U.A.T. environments.
  • I have attempted, unsuccessfully, to reproduce the problem on demand in the production environment - it continues to happen at random but we can't seem to trigger it.
  • I have tried to work around the problem by recycling the ASP.NET app. pools on the web servers instead of restarting the Windows NT Service hosting the problematic end-point. Unfortunately, we mistakenly concluded that this did not resolve the issue and went ahead and restarted the NT Service before realising that we had botched the test. We're waiting for the problem to occur again in order to repeat this experiment.
  • I have ensured that all WCF client proxies and channels created by the Web API controllers (and SignalR hubs) are disposed correctly and timeously.
  • I have examined the code of the problematic service itself - it is really primitive, being a simple Dictionary-based cache of the active users on the site and their associated viewing preferences. I do not think its implementation is the cause of this problem.
  • I have checked DNS resolution of the hostname in the net.tcp URI on the web servers. The problem might be DNS related but I do not think it is.
  • I have employed Google and found little helpful information, apart from two other links listed below.

I am also currently reading https://support.microsoft.com/en-us/kb/2504602 and waiting for Operations to tell me whether that hot-fix might be applicable on the servers in question and reading http://forums.iis.net/t/1167668.aspx?net+tcp+listener+adapter+stops+responding to decide whether it is relevant to my issue.

If you have any input into this odd behaviour, please help! If you can suggest another line of attack, please do!

Related posts

Recent Viewed