TCP要保证在所有可能的情况下使得所有的数据都能够被投递。当你关闭一个socket时,主动关闭一端的socket将进入TIME_WAIT状态,而被动关闭一方则转入CLOSED状态,这的确能够保证所有的数据都被传输。当一个socket关闭的时候,是通过两端互发信息的四次握手过程完成的,当一端调用close()时,就说明本端没有数据再要发送了。这好似看来在握手完成以后,socket就都应该处于关闭CLOSED状态了。但这有两个问题,首先,我们没有任何机制保证最后的一个ACK能够正常传输,第二,网络上仍然有可能有残余的数据包(wandering duplicates),我们也必须能够正常处理。
通过正确的状态机,我们知道双方的关闭过程如下
假设最后一个ACK丢失了,服务器会重发它发送的最后一个FIN,所以客户端必须维持一个状态信息,以便能够重发ACK;如果不维持这种状态,客户端在接收到FIN后将会响应一个RST,服务器端接收到RST后会认为这是一个错误。如果TCP协议能够正常完成必要的操作而终止双方的数据流传输,就必须完全正确的传输四次握手的四个节,不能有任何的丢失。这就是为什么socket在关闭后,仍然处于 TIME_WAIT状态,因为他要等待以便重发ACK。
如果目前连接的通信双方都已经调用了close(),假定双方都到达CLOSED状态,而没有TIME_WAIT状态时,就会出现如下的情况。现在有一个新的连接被建立起来,使用的IP地址与端口与先前的完全相同,后建立的连接又称作是原先连接的一个化身。还假定原先的连接中有数据报残存于网络之中,这样新的连接收到的数据报中有可能是先前连接的数据报。为了防止这一点,TCP不允许从处于TIME_WAIT状态的socket建立一个连接。处于TIME_WAIT状态的socket在等待两倍的MSL时间以后(之所以是两倍的MSL,是由于MSL是一个数据报在网络中单向发出到认定丢失的时间,一个数据报有可能在发送图中或是其响应过程中成为残余数据报,确认一个数据报及其响应的丢弃的需要两倍的MSL),将会转变为CLOSED状态。这就意味着,一个成功建立的连接,必然使得先前网络中残余的数据报都丢失了。
由于TIME_WAIT状态所带来的相关问题,我们可以通过设置SO_LINGER标志来避免socket进入TIME_WAIT状态,这可以通过发送RST而取代正常的TCP四次握手的终止方式。但这并不是一个很好的主意,TIME_WAIT对于我们来说往往是有利的。
客户端与服务器端建立TCP/IP连接后关闭SOCKET后,服务器端连接的端口
状态为TIME_WAIT
是不是所有执行主动关闭的socket都会进入TIME_WAIT状态呢?
有没有什么情况使主动关闭的socket直接进入CLOSED状态呢?
主动关闭的一方在发送最后一个 ack 后
就会进入 TIME_WAIT 状态 停留2MSL(max segment lifetime)时间
这个是TCP/IP必不可少的,也就是“解决”不了的。
也就是TCP/IP设计者本来是这么设计的
主要有两个原因
1。防止上一次连接中的包,迷路后重新出现,影响新连接
(经过2MSL,上一次连接中所有的重复包都会消失)
2。可靠的关闭TCP连接
在主动关闭方发送的最后一个 ack(fin) ,有可能丢失,这时被动方会重新发
fin, 如果这时主动方处于 CLOSED 状态 ,就会响应 rst 而不是 ack。所以
主动方要处于 TIME_WAIT 状态,而不能是 CLOSED 。
TIME_WAIT 并不会占用很大资源的,除非受到攻击。
还有,如果一方 send 或 recv 超时,就会直接进入 CLOSED 状态
socket-faq中的这一段讲的也很好,摘录如下:
2.7. Please explain the TIME_WAIT state.
Remember that TCP guarantees all data transmitted will be delivered,
if at all possible. When you close a socket, the server goes into a
TIME_WAIT state, just to be really really sure that all the data has
gone through. When a socket is closed, both sides agree by sending
messages to each other that they will send no more data. This, it
seemed to me was good enough, and after the handshaking is done, the
socket should be closed. The problem is two-fold. First, there is no
way to be sure that the last ack was communicated successfully.
Second, there may be "wandering duplicates" left on the net that must
be dealt with if they are delivered.
Andrew Gierth (andrew@erlenstar.demon.co.uk) helped to explain the
closing sequence in the following usenet posting:
Assume that a connection is in ESTABLISHED state, and the client is
about to do an orderly release. The client's sequence no. is Sc, and
the server's is Ss. Client Server
====== ======
ESTABLISHED ESTABLISHED
(client closes)
ESTABLISHED ESTABLISHED
------->>
FIN_WAIT_1
<<--------
FIN_WAIT_2 CLOSE_WAIT
<<-------- (server closes)
LAST_ACK
, ------->>
TIME_WAIT CLOSED
(2*msl elapses...)
CLOSED
Note: the +1 on the sequence numbers is because the FIN counts as one
byte of data. (The above diagram is equivalent to fig. 13 from RFC
793).
Now consider what happens if the last of those packets is dropped in
the network. The client has done with the connection; it has no more
data or control info to send, and never will have. But the server does
not know whether the client received all the data correctly; that's
what the last ACK segment is for. Now the server may or may not care
whether the client got the data, but that is not an issue for TCP; TCP
is a reliable rotocol, and must distinguish between an orderly
connection close where all data is transferred, and a connection abort
where data may or may not have been lost.
So, if that last packet is dropped, the server will retransmit it (it
is, after all, an unacknowledged segment) and will expect to see a
suitable ACK segment in reply. If the client went straight to CLOSED,
the only possible response to that retransmit would be a RST, which
would indicate to the server that data had been lost, when in fact it
had not been.
(Bear in mind that the server's FIN segment may, additionally, contain
data.)
DISCLAIMER: This is my interpretation of the RFCs (I have read all the
TCP-related ones I could find), but I have not attempted to examine
implementation source code or trace actual connections in order to
verify it. I am satisfied that the logic is correct, though.
More commentarty from Vic:
The second issue was addressed by Richard Stevens (rstevens@noao.edu,
author of "Unix Network Programming", see ``1.5 Where can I get source
code for the book [book title]?''). I have put together quotes from
some of his postings and email which explain this. I have brought
together paragraphs from different postings, and have made as few
changes as possible.
From Richard Stevens (rstevens@noao.edu):
If the duration of the TIME_WAIT state were just to handle TCP's full-
duplex close, then the time would be much smaller, and it would be
some function of the current RTO (retransmission timeout), not the MSL
(the packet lifetime).
A couple of points about the TIME_WAIT state.
o The end that sends the first FIN goes into the TIME_WAIT state,
because that is the end that sends the final ACK. If the other
end's FIN is lost, or if the final ACK is lost, having the end that
sends the first FIN maintain state about the connection guarantees
that it has enough information to retransmit the final ACK.
o Realize that TCP sequence numbers wrap around after 2**32 bytes
have been transferred. Assume a connection between A.1500 (host A,
port 1500) and B.2000. During the connection one segment is lost
and retransmitted. But the segment is not really lost, it is held
by some intermediate router and then re-injected into the network.
(This is called a "wandering duplicate".) But in the time between
the packet being lost & retransmitted, and then reappearing, the
connection is closed (without any problems) and then another
connection is established between the same host, same port (that
is, A.1500 and B.2000; this is called another "incarnation" of the
connection). But the sequence numbers chosen for the new
incarnation just happen to overlap with the sequence number of the
wandering duplicate that is about to reappear. (This is indeed
possible, given the way sequence numbers are chosen for TCP
connections.) Bingo, you are about to deliver the data from the
wandering duplicate (the previous incarnation of the connection) to
the new incarnation of the connection. To avoid this, you do not
allow the same incarnation of the connection to be reestablished
until the TIME_WAIT state terminates.
Even the TIME_WAIT state doesn't complete solve the second problem,
given what is called TIME_WAIT assassination. RFC 1337 has more
details.
o The reason that the duration of the TIME_WAIT state is 2*MSL is
that the maximum amount of time a packet can wander around a
network is assumed to be MSL seconds. The factor of 2 is for the
round-trip. The recommended value for MSL is 120 seconds, but
Berkeley-derived implementations normally use 30 seconds instead.
This means a TIME_WAIT delay between 1 and 4 minutes. Solaris 2.x
does indeed use the recommended MSL of 120 seconds.
A wandering duplicate is a packet that appeared to be lost and was
retransmitted. But it wasn't really lost ... some router had
problems, held on to the packet for a while (order of seconds, could
be a minute if the TTL is large enough) and then re-injects the packet
back into the network. But by the time it reappears, the application
that sent it originally has already retransmitted the data contained
in that packet.
Because of these potential problems with TIME_WAIT assassinations, one
should not avoid the TIME_WAIT state by setting the SO_LINGER option
to send an RST instead of the normal TCP connection termination
(FIN/ACK/FIN/ACK). The TIME_WAIT state is there for a reason; it's
your friend and it's there to help you :-)
I have a long discussion of just this topic in my just-released
"TCP/IP Illustrated, Volume 3". The TIME_WAIT state is indeed, one of
the most misunderstood features of TCP.
I'm currently rewriting "Unix Network Programming" (see ``1.5 Where
can I get source code for the book [book title]?''). and will include
lots more on this topic, as it is often confusing and misunderstood.
An additional note from Andrew:
Closing a socket: if SO_LINGER has not been called on a socket, then
close() is not supposed to discard data. This is true on SVR4.2 (and,
apparently, on all non-SVR4 systems) but apparently not on SVR4; the
use of either shutdown() or SO_LINGER seems to be required to
guarantee delivery of all data.
-------------------------------------------------------------------------------------------
讨厌的 Socket TIME_WAIT 问题
netstat -n | awk '/^tcp/ {++state[$NF]} END {for(key in state) print key,"\t",state[key]}'
会得到类似下面的结果,具体数字会有所不同:
LAST_ACK 1
SYN_RECV 14
ESTABLISHED 79
FIN_WAIT1 28
FIN_WAIT2 3
CLOSING 5
TIME_WAIT 1669
状态:描述
CLOSED:无连接是活动的或正在进行
LISTEN:服务器在等待进入呼叫
SYN_RECV:一个连接请求已经到达,等待确认
SYN_SENT:应用已经开始,打开一个连接
ESTABLISHED:正常数据传输状态
FIN_WAIT1:应用说它已经完成
FIN_WAIT2:另一边已同意释放
ITMED_WAIT:等待所有分组死掉
CLOSING:两边同时尝试关闭
TIME_WAIT:另一边已初始化一个释放
LAST_ACK:等待所有分组死掉
也就是说,这条命令可以把当前系统的网络连接状态分类汇总。
下面解释一下为啥要这样写:
一个简单的管道符连接了netstat和awk命令。
------------------------------------------------------------------
每个TCP报文在网络内的最长时间,就称为MSL(Maximum Segment Lifetime),它的作用和IP数据包的TTL类似。
RFC793指出,MSL的值是2分钟,但是在实际的实现中,常用的值有以下三种:30秒,1分钟,2分钟。
注意一个问题,进入TIME_WAIT状态的一般情况下是客户端,大多数服务器端一般执行被动关闭,不会进入TIME_WAIT状态,当在服务
器端关闭某个服务再重新启动时,它是会进入TIME_WAIT状态的。
举例:
1.客户端连接服务器的80服务,这时客户端会启用一个本地的端口访问服务器的80,访问完成后关闭此连接,立刻再次访问服务器的
80,这时客户端会启用另一个本地的端口,而不是刚才使用的那个本地端口。原因就是刚才的那个连接还处于TIME_WAIT状态。
2.客户端连接服务器的80服务,这时服务器关闭80端口,立即再次重启80端口的服务,这时可能不会成功启动,原因也是服务器的连
接还处于TIME_WAIT状态。
检查net.ipv4.tcp_tw当前值,将当前的值更改为1分钟:
[root@aaa1 ~]# sysctl -a|grep net.ipv4.tcp_tw
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_tw_recycle = 0
[root@aaa1 ~]#
vi /etc/sysctl
增加或修改net.ipv4.tcp_tw值:
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
使内核参数生效:
[root@aaa1 ~]# sysctl -p
[root@aaa1 ~]# sysctl -a|grep net.ipv4.tcp_tw
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
用netstat再观察正常
这里解决问题的关键是如何能够重复利用time_wait的值,我们可以设置时检查一下time和wait的值
#sysctl -a | grep time | grep wait
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120
问一下TIME_WAIT有什么问题,是闲置而且内存不回收吗?
是的,这样的现象实际是正常的,有时和访问量大有关,设置这两个参数: reuse是表示是否允许重新应用处于TIME-WAIT状态的
socket用于新的TCP连接; recyse是加速TIME-WAIT sockets回收
Q: 我正在写一个unix server程序,不是daemon,经常需要在命令行上重启它,绝大
多数时候工作正常,但是某些时候会报告"bind: address in use",于是重启失
败。
A: Andrew Gierth
server程序总是应该在调用bind()之前设置SO_REUSEADDR套接字选项。至于
TIME_WAIT状态,你无法避免,那是TCP协议的一部分。
Q: 如何避免等待60秒之后才能重启服务
A: Erik Max Francis
使用setsockopt,比如
--------------------------------------------------------------------------
int option = 1;
if ( setsockopt ( masterSocket, SOL_SOCKET, SO_REUSEADDR, &option,
sizeof( option ) ) < 0 )
{
die( "setsockopt" );
}
--------------------------------------------------------------------------
Q: 编写 TCP/SOCK_STREAM 服务程序时,SO_REUSEADDR到底什么意思?
A: 这个套接字选项通知内核,如果端口忙,但TCP状态位于 TIME_WAIT ,可以重用
端口。如果端口忙,而TCP状态位于其他状态,重用端口时依旧得到一个错误信息,
指明"地址已经使用中"。如果你的服务程序停止后想立即重启,而新套接字依旧
使用同一端口,此时 SO_REUSEADDR 选项非常有用。必须意识到,此时任何非期
望数据到达,都可能导致服务程序反应混乱,不过这只是一种可能,事实上很不
可能。
一个套接字由相关五元组构成,协议、本地地址、本地端口、远程地址、远程端
口。SO_REUSEADDR 仅仅表示可以重用本地本地地址、本地端口,整个相关五元组
还是唯一确定的。所以,重启后的服务程序有可能收到非期望数据。必须慎重使
用 SO_REUSEADDR 选项。
Q: 在客户机/服务器编程中(TCP/SOCK_STREAM),如何理解TCP自动机 TIME_WAIT 状
态?
A: W. Richard Stevens <1999年逝世,享年49岁>
下面我来解释一下 TIME_WAIT 状态,这些在<>
中2.6节解释很清楚了。
MSL(最大分段生存期)指明TCP报文在Internet上最长生存时间,每个具体的TCP实现
都必须选择一个确定的MSL值。RFC 1122建议是2分钟,但BSD传统实现采用了30秒。
TIME_WAIT 状态最大保持时间是2 * MSL,也就是1-4分钟。
IP头部有一个TTL,最大值255。尽管TTL的单位不是秒(根本和时间无关),我们仍需
假设,TTL为255的TCP报文在Internet上生存时间不能超过MSL。
TCP报文在传送过程中可能因为路由故障被迫缓冲延迟、选择非最优路径等等,结果
发送方TCP机制开始超时重传。前一个TCP报文可以称为"漫游TCP重复报文",后一个
TCP报文可以称为"超时重传TCP重复报文",作为面向连接的可靠协议,TCP实现必须
正确处理这种重复报文,因为二者可能最终都到达。
一个通常的TCP连接终止可以用图描述如下:
client server
FIN M
close -----------------> (被动关闭)
ACK M+1
<-----------------
FIN N
<----------------- close
ACK N+1
----------------->
为什么需要 TIME_WAIT 状态?
假设最终的ACK丢失,server将重发FIN,client必须维护TCP状态信息以便可以重发
最终的ACK,否则会发送RST,结果server认为发生错误。TCP实现必须可靠地终止连
接的两个方向(全双工关闭),client必须进入 TIME_WAIT 状态,因为client可能面
临重发最终ACK的情形。
{
scz 2001-08-31 13:28
先调用close()的一方会进入TIME_WAIT状态
}
此外,考虑一种情况,TCP实现可能面临先后两个同样的相关五元组。如果前一个连
接处在 TIME_WAIT 状态,而允许另一个拥有相同相关五元组的连接出现,可能处理
TCP报文时,两个连接互相干扰。使用 SO_REUSEADDR 选项就需要考虑这种情况。
为什么 TIME_WAIT 状态需要保持 2MSL 这么长的时间?
如果 TIME_WAIT 状态保持时间不足够长(比如小于2MSL),第一个连接就正常终止了。
第二个拥有相同相关五元组的连接出现,而第一个连接的重复报文到达,干扰了第二
个连接。TCP实现必须防止某个连接的重复报文在连接终止后出现,所以让TIME_WAIT
状态保持时间足够长(2MSL),连接相应方向上的TCP报文要么完全响应完毕,要么被
丢弃。建立第二个连接的时候,不会混淆。
A: 小四
在Solaris 7下有内核参数对应 TIME_WAIT 状态保持时间
# ndd -get /dev/tcp tcp_time_wait_interval
240000
# ndd -set /dev/tcp tcp_time_wait_interval 1000
缺省设置是240000ms,也就是4分钟。如果用ndd修改这个值,最小只能设置到1000ms,
也就是1秒。显然内核做了限制,需要Kernel Hacking。
# echo "tcp_param_arr/W 0t0" | adb -kw /dev/ksyms /dev/mem
physmem 3b72
tcp_param_arr: 0x3e8 = 0x0
# ndd -set /dev/tcp tcp_time_wait_interval 0
我不知道这样做有什么灾难性后果,参看<>的声明。
Q: TIME_WAIT 状态保持时间为0会有什么灾难性后果?在普遍的现实应用中,好象也
就是服务器不稳定点,不见得有什么灾难性后果吧?
D: rain@bbs.whnet.edu.cn
Linux 内核源码 /usr/src/linux/include/net/tcp.h 中
#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to successfully
* close the socket, about 60 seconds */
最好不要改为0,改成1。端口分配是从上一次分配的端口号+1开始分配的,所以一般
不会有什么问题。端口分配算法在tcp_ipv4.c中tcp_v4_get_port中。