I have a Windows NT Service that hosts several WCF end-points listening on TCP on a single application server. An ASP.NET Web API application running on several other web servers, in a load-balanced configuration, is connecting to these end-points. The system runs perfectly for a few days and then connections from the web servers to precisely one end-point start failing.
The Web API application logs two exceptions. The salient one appears to be this:
Connecting to via net.tcp://blahblahblah:23000/Cache/BlahBlahBlah timed out after 00:00:00. Connection attempts were made to 0 of 1 available addresses (). Check the RemoteAddress of your channel and verify that the DNS records for this endpoint correspond to valid IP Addresses. The time allotted to this operation may have been a portion of a longer timeout.
EXCEPTION Stack Trace:
at System.ServiceModel.Channels.SocketConnectionInitiator.CreateTimeoutException(Uri uri, TimeSpan timeout, IPAddress[] addresses, Int32 invalidAddressCount, SocketException innerException)
at System.ServiceModel.Channels.SocketConnectionInitiator.Connect(Uri uri, TimeSpan timeout)
at System.ServiceModel.Channels.BufferedConnectionInitiator.Connect(Uri uri, TimeSpan timeout)
at System.ServiceModel.Channels.ConnectionPoolHelper.EstablishConnection(TimeSpan timeout)
The second one appears to be a consequence of the above:
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
EXCEPTION Stack Trace:
at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
at System.ServiceModel.Channels.SocketConnection.ReadCore(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout, Boolean closing)
After the failure begins, features of the web application that cause controllers on the web servers to connect to other end-points continue to work without issue. It is always the same end-point that fails. Restarting the Windows NT Service (which will stop and restart all the hosted end-points, not just the one that failed) resolves the problem. The problem repeats every few days - it runs stable for weeks, sometimes.
I have tried the following to resolve the problem:
Dictionary
-based cache of the active users on the site and their associated viewing preferences. I do not think its implementation is the cause of this problem.net.tcp
URI on the web servers. The problem might be DNS related but I do not think it is.I am also currently reading https://support.microsoft.com/en-us/kb/2504602 and waiting for Operations to tell me whether that hot-fix might be applicable on the servers in question and reading http://forums.iis.net/t/1167668.aspx?net+tcp+listener+adapter+stops+responding to decide whether it is relevant to my issue.
If you have any input into this odd behaviour, please help! If you can suggest another line of attack, please do!