Solaris 2.x - tuning your TCP/IP stack and more

Source: http://www.rvs.uni-hannover.de/people/voeckler/tune

1. History and introduction
2. TCP connection initiation
3. Retransmission related parameters
4. Path MTU discovery
5. Further advice, hints and remarks
6. Windows, buffers and watermarks
7. Tuning your system
8. Recommended patches
9. Related books and software
10. Uncovered material
11. Startup scripts

1. History and introduction

This page and the related work have a long history in gathering. I started out peeking wide eyed over the shoulders of two people from a search engine provider when they were installing the German server of a customer of my former employer. My only alternative resource of tuning information was the brilliant book TCP/IP Illustrated 1 by Stevens. I started gathering all information about tuning I was able to get my hands upon. The cumulation of these you are experiencing on these pages.

Solaris allows you to tune, tweak, set and reset various parameters related to the TCP/IP stack while the system is running. Back in the SunOS 4.x days, one had to change various C files in the kernel source tree, generate a new kernel, reboot the machine and try out the changes. The Solaris feature of changing the important parameters on the fly is very convenient.

Many of the parameters I mention in the rest of the document you are reading are time intervals. All intervals are measured in milliseconds. Other parameters are usually bytecounts, but a few times different units of measurements are used and documented. A few items appear totally unrelated to TCP/IP, but due to the lack of a better framework, they materialized on this page.

Most tunings can be achieved using the program ndd. Any user may execute this program to read the current settings, depending on the readability of the respective device files. But only the super user is allowed to execute ndd -set to change values. This makes sense considering the sensitive parameters you are tuning. Details on the use of ndd can be obtained from the respective manual page.

ndd /dev/tcp \?                  # show all parameter keys
ndd /dev/tcp tcp_mss_def         # show the value to this key
ndd -set /dev/ip ip_forwarding 0 # switch off forwarding

All keys starting out with ip_ have to be used with the pseudo device /dev/ip. Analog behaviour is true for the keys starting in tcp_ etc. Andres Kroonmaa kindly supplied a nifty script to check all existing values for a network component (tcp, udp, ip, icmp, etc.). Usually I do the same thing using Perl.

2. TCP connection initiation

This section is dedicated exclusively to the various queues and tunable variable(s) used during connection instantiation. The socket API maintains some control over the queues. But in order to tune anything, you have to understand how listen and accept interact with the queues. For details, see the various Stevens books mentioned in the literature section.

When the server calls listen, the kernel moves the socket from the TCP state CLOSED into the state LISTEN, thus doing a passive open. All TCP servers work like this. Also, the kernel creates and initializes various data structures, among them the socket buffers and two queues:

incomplete connection queue: This queue contains an entry for every SYN that has arrived. BSD sources assing so_q0len entries to this queue. The server sends off the ACK of the client's SYN and the server side SYN. The connection get queued and the kernel now awaits the completion of the TCP three way handshake to open a connection. The socket is in the SYN_RCVD state. On the reception of the client's ACK to the server's SYN, the connection stays one round trip time (RTT) in this queue before the kernel moves the entry into the
completed connection queue: This queue contains an entry for each connection for which the three way handshake is completed. The socket is in the ESTABLISHED state. Each call to accept() removes the front entry of the queue. If there are no entries in the queue, the call to accept usually blocks. BSD source assign a length of so_qlen to this queue.

Both queues are limited regarding their number of entries. By calling listen(), the server is allowed to specify the size of the second queue for completed connections. If the server is for whatever reason unable to remove entries from the completed connection queue, the kernel is not supposed to queue any more connections. A timeout is associated with each received and queued SYN segment. If the server never receives an acknowledgement for a queued SYN segment, TCP state SYN_RCVD, the time will run out and the connection thrown away. The timeout is an important resistence against SYN flood attacks.


Figure 1: Queues maintained for listening sockets.		Figure 2: TCP three way handshake, connection initiation.

Historically, the argument to the listen function specified the maximum number of entries for the sum of both queues. Many BSD derived implementations multiply the argument with a fudge factor of 3/2. Solaris <= 2.5.1 do not use the fudge factor, but adds 1, while Solaris 2.6 does use the fudge factor, though with a slightly different rounding mechanism than the one BSD uses. With a backlog argument of 14, Solaris 2.5.1 servers can queue 15 connections. Solaris 2.6 server can queue 22 connections.

Stevens shows that the incomplete connection queue does need more entries for busy servers than the completed connection queue. The only reason for specifying a large backlog value is to enable the incomplete connection queue to grow as SYN arrive from clients. Stevens shows that moderately busy webserver has an empty completed connection queue during 99 % of the time, but the incomplete connection queue needed 15 or less entries in 98 % of the time! Just try to imaginge what this would mean for a really busy webcache like Squid.

Data for an established connection which arrives before the connection is accept()ed, should be stored into the socket buffer. If the queues are full when a SYN arrived, it is dropped in the hope that the client will resend it, hopefully finding room in the queues then.

According to Cockroft, there was only one listen queue for unpatched Solari <= 2.5.1. Solari >= 2.6 or an applied TCP patch 103582-12 or above splits the single queue in the two shown in figure 1. The system administrator is allowed to tweak and tune the various maxima of the queue or queues with Solaris. Depending on wether there are one or two queues, there are different sets of tweakable parameters.

The old semantics contained just one tunable parameter tcp_conn_req_max which specified the maximum argument for the listen(). The patched versions and Solaris 2.6 replaced this parameter with the two new parameters tcp_conn_req_max_q0 and tcp_conn_req_max_q. A SunWorld article on 2.6 by Adrian Cockroft tells the following about the new parameters:

tcp_conn_req_max [is] replaced. This value is well-known as it normally needs to be increased for Web servers in older releases of Solaris 2. It no longer exists in Solaris 2.6, and patch 103582-12 adds this feature to Solaris 2.5.1. The change is part of a fix that prevents denial of service from SYN flood attacks. There are now two separate queues of partially complete connections instead of one.
tcp_conn_req_max_q0 is the maximum number of connections with handshake incomplete. A SYN flood attack could only affect this queue, and a special algorithm makes sure that valid connections can still get through.
tcp_conn_req_max_q is the maximum number of completed connections waiting to return from an accept call as soon as the right process gets some CPU time.

In other words, the first specifies the size of the incomplete connection queue while the second parameters assigns the maximum length of the completed connection queue. All three parameters are covered below.

You can determine if you need to tweak this set of parameters by watching the output of netstat -sP tcp. Look for the value of tcpListenDrop, if available on your version of Solaris. Older versions don't have this counter. Any value showing up might indicate something wrong with your server, but then, killing a busy server (like squid) shuts down its listening socket, and might increase this counter (and others). If you get many drops, you might need to increase the appropriate parameter. Since connections can also be dropped, because listen() specifies a too small argument, you have to be careful interpreting the counter value. On old versions, a SYN flood attack might also increase this counter.

Newer or patched versions of Solaris, with both queues available, will also have the additional counters tcpListenDropQ0 and tcpHalfOpenDrop. Now the original counter tcpListenDrop counts only connections dropped from the completed connection queue, and the counter ending in Q0 the drops from the incomplete connection queue. Killing a busy server application might increase either or both counters. If the tcpHalfOpenDrop shows up values, your server was likely to be the victim of a SYN flood. The counter is only incremented for dropping noxious connection attempts. I have no idea, if those will also show up in the Q0 counter, too.

tcp_conn_req_max: default 8 (max. 32), since 2.5 32 (max. 1024), recommended 128 <= x <= 1024
tcp_conn_req_max_q0: since 2.5.1 with patches 103630-09 and 103582-12 or above applied: default 1024;
tcp_conn_req_max_q: since 2.5.1 with patches 103630-09 and 103582-12 or above applied: default 128;
tcp_conn_req_min: Since 2.6: default 1, recommened: don't touch

3. Retransmission related parameters

The retransmission timeout values used by Solaris are way too agressive for wide area networks, although they can be considered appropriate for local area networks. SUN thus did not follow the suggestions mentioned in RFC 1122. Newer releases of the Solaris kernel are correcting the values in question:

The recommended upper and lower bounds on the RTO are known to be inadequate on large internets. The lower bound SHOULD be measured in fractions of a second (to accommodate high speed LANs) and the upper bound should be 2*MSL, i.e., 240 seconds.

Besides the retransmit timeout (RTO) value two further parameters R1 and R2 may be of interest. These don't seem to be tunable via any Solaris' offered interface that I know of.

The value of R1 SHOULD correspond to at least 3 retransmissions, at the current RTO. The value of R2 SHOULD correspond to at least 100 seconds.
[...]
However, the values of R1 and R2 may be different for SYN and data segments. In particular, R2 for a SYN segment MUST be set large enough to provide retransmission of the segment for at least 3 minutes. The application can close the connection (i.e., give up on the open attempt) sooner, of course.

Great many internet servers which are running Solaris do retransmit segments unnecessarily often. The current condition of European networks indicate that a connection to the US may take up to 2 seconds. All parameters mentioned in the first part of this section relate to each other!

As a starter take this little example. Consider a picture, size 1440 byte, LZW compressed, which is to be transferred over a serial linkup with 14400 bps and using a MTU of 1500. In the ideal case only one PDU gets transmitted. The ACK segment can only be sent after the complete PDU is received. The transmission takes about 1 second. These values seem low, but they are meant as 'food for thought'. Now consider something going awry...

[New] Solaris 2.5.1 is behaving strange, if the initial SYN segment from the host doing the active open is lost. The initial SYN gets retransmitted only after a period of 4 * tcp_rexmit_interval_initial plus a constant C. The time is 12 seconds with the default settings. More information is being prepared on the retransmission test page.

The initial lost SYN may or may not be of importance in your environment. For instance, if you are connected via ATM SVCs, the initial PDU might initiate a logical connection (ATM works point to point) in less than 0.3 seconds, but will still be lost in the process. It is rather annoying for a user of 2.5.1 to wait 12 seconds until something happens.

tcp_rexmit_interval_initial: default 500, since 2.5.1 3000, recommended >= 2000 (500 for special purposes)
tcp_rexmit_interval_min: default 200, recommended >= 1000 (200 for special purposes)
tcp_ip_abort_interval: default 120000, since 2.5 480000, recommended 600000
tcp_ip_abort_cinterval: default 240000, since 2.5 180000, recommended ?
tcp_rexmit_interval_max: default 60000, RFC 1122 recommends 240000 (2MSL), recommended 1...2 * tcp_close_wait_interval
tcp_deferred_ack_interval: default 50, BSD 200, recommended 200 or 500
tcp_deferred_acks_max: Since 2.6: default 8, recommended ?

Good values for retransmission tuning don't beam into existence from a white source. Rather you should carefully plan an experiment to get decent values. Intervals from another site do not carry on without change to another Solaris system. But they might give you an idea where to start when chosing your own values.

The next part looks at a few parameters having to do with retransmissions, as well.

tcp_slow_start_initial: since 2.5.1 with patch 103582-15 applied: default 1
tcp_dupack_fast_retransmit: default 3

4. path MTU discovery

Whenever a connection is about to be established, the three-way handshake open negotiation, the segment size used will be set to the minimum of (a) the smallest MTU of an outgoing interface, and (b) from MSS announced by the peer. If the remote peer does not announce a MSS, usually the value 536 will be assumed. If path MTU discovery is active, all outgoing PDUs have the IP option DF (don't fragment) set.

If the ICMP error message fragmentation needed is received, a router on the way to the destination needed to fragment the PDU, but was not allowed to do so. Therefore the router discarded the PDU and did send back the ICMP error. Newer router implementations enclose the needed MSS in the error message. If the needed MSS is not included, the correct MSS must be determined by trial and error algorithm.

Due to the internet being a packet switching network, the route a PDU travels along a TCP virtual circuit may change with time. For this reason RFC 1191 recommends to rediscover the path MTU of an active connection after 10 minutes. Improvements of the route can only be noticed by repeated rediscoveries. Unfortunately, Solaris aggressively tries to rediscover the path MTU every 30 seconds. While this is o.k. for LAN environments, it is a grossly impolite behaviour in WANs. Since routes may not change that often, aggressive repetitions of path MTU discoveries leads to unnecessary consumption of channel capacity and elongated service times.

Path MTU discovery is a far reaching and controversial topic when discussing it with local ISPs. But think, the discovery is at the foundation of IPv6. The PSC tuning page argues pro path MTU discovery, especially if you maintain a high-speed or long-delay (e.g. satellite) link.

The recommendation I can give you is not to use the defaults of Solaris < 2.5. Please use path MTU discovery, but tune your system RFC conformant. You may alternatively want to switch off the path MTU discovery all together, though there are few situations where this is necessary.

I was made aware of the fact that in certain circumstances bridges connecting data link layers of differing MTU sizes defeat pMTU discovery. I have to put some more investigation into this matter. If a frame with maximum MTU size is to be transported into the network with the smaller MTU size, it is truncated silently. A bridge does not know anything about the upper protocol levels: A bridge neither fragments IP nor sends an ICMP error.

There may be work-arounds, and the tcp_mss_def is one of them. Setting all interfaces to the minimum shared MTU might help, at the cost of losing performance on the larger MTU network. Using what RFC 1122 calls an IP gateway is a possible, yet expensive solution.

ip_ire_pathmtu_interval: default 30000, since 2.5 600000, recommended 600000
ip_path_mtu_discovery: default 1, recommended 1
tcp_ignore_path_mtu: default 0, recommended 0
tcp_mss_def: default 536, recommended >= 536

5. Further advice, hints and remarks

This section covers a variety of topics, starting with various TCP timers which do not relate to previously mentioned issues. The next subsection throws a quick glance at some erratic behaviour. The final section looks at a variety of parameters which deal with the reservation of resources.

5.1 Common TCP timers

The current subsection covers three important TCP timers. First I will have a look at the keepalive timer. The timer is rather controversial, and some Solari implement them incorrectly. The next parameter limits the twice maximum segment lifetime (2MSL) value, which is connected to the time a socket spends in the TCP state TIME_WAIT. The final entry looks at the time spend in the TCP state FIN_WAIT_2.

tcp_keepalive_interval: default 7200000, recommended 0 <= x <= oo
tcp_close_wait_interval: default 240000 (according to RFC 1122, 2MSL), recommended 60000, possibly lower
tcp_fin_wait_2_flush_interval: BSD 675000, default 675000, recommended 67500 (one zero less)

Sometimes, a Squid running on Solaris (2.5.1) confuses the system utterly. A great number of connection to a varying degree are in CLOSE_WAIT for reasons beyond me. During this phase the proxy is virtually unreachable for HTTP requests though, abnoxiously, it still answers ICP requests. Although lowering the value for tcp_close_wait_interval is only fixing symptoms indirectly, not the cause, it may help overcoming those periods of erratic behaviour faster than the default. The thing needed would be some means to influence the CLOSE_WAIT interval directly.

5.2 Erratic IPX behaviour

I noticed that Solari < 2.6 behave erratically under some conditions, if the IPX ethernet MTU of 1500 is used. Maybe there is an error in the frame assembly algorithm. If you limit yourself to the IEEE 802.3 MTU of 1492 byte, the problem does not seem to appear. A sample startup script with link in /etc/rc2.d can be used to change the MTU of ethernet interfaces after their initialization. Remember to set the MTU for every virtual interface, too!

Note, with a patched Solaris 2.5.1 or Solaris 2.6, the problem does not seem to appear. Limiting your MTU to non-standard might introduce problems with truncated PDUs in certain (admittedly very special) environments. Thus you may want to refrain from using the above mentioned script (always called second script in this document).

Additionally, I strongly suggest the use of a file /etc/init.d/your-tune (always called first script) which changes the tunable parameters. /etc/rcS.d/S31your-tune is a hardlink to this file. The script will be executed during bootup when the system is in single user mode. A killscript is not necessary. The section about startup scripts below reiterates this topic in greater depth.

5.3 Common TCP parameters

The following parameters have little impact on performance, nevertheless I reckon them worth noting here:

ip_forward_src_routed: default 1, recommended 0
ip_icmp_err_interval: default 500, recommended 0
udp_smallest_anon_port
tcp_smallest_anon_port: default 32768, recommended 8192
udp_largest_anon_port: default 65535, recommended: see text

6. Windows, buffers and watermarks

This section is about windows, buffers and watermarks. It is still work in progress. The explanations available to me were very confusing (sigh), though the new Stevens helped to clear up a few things. If you have corrections to this section, please let me know and contribute to an update of the page. Many readers will thank you!

Figure 4: buffers and related issues

Here just a short trip through the network layer in order to explain what happens where. Your application is able to send almost any size of data to the transport layer. The transport layer is either UDP or TCP. The socket buffers are implemented on the transport layer. Depending on your choice of transport protocol, different actions are taken on this level.

TCP: All application data is copied into the socket buffer. If there is insufficient size, the application will be put to sleep. From the socket buffer, TCP will create segments. No chunk exceeds the MSS.
UDP: The socket buffer size of UDP is simply the maximum size of datagram UDP is able to transmit. Larger datagrams ought to elict the EMSGSIZE error response from the socket layer. With UDP implementing an unreliable service, there is no need to keep the datagram in the socket buffer.

The IP layer needs to fragment chunks which are too large. Among the reasons TCP prechunks its segments is the need to avoid fragmentation. IP searches the routing tables for the appropriate interface in order to determine the fragment size and interface.

If the output queue of the datalink layer interface is full, the datagram will be discarded and an error will be returned to IP and back to the transport layer. If the transport protocol was TCP, TCP will try to resend the segment at a later time. UDP should return the ENOBUFS error, but some implementations don't.

To determine the MTU sizes, use the ifconfig -a command. The MTUs are needed for some calculation to be done later in this section. With IPv4 you can determine the MSS from the interface MTU by substracting 20 Bytes for the TCP header and 20 Bytes for the IP header. Keep this in mind, as the calculation will be repeatedly necessary in the text following below.

$ ifconfig -a
lo0: flags=849 mtu 8232
        inet 127.0.0.1 netmask ff000000 
el0: flags=863 mtu 1500
        inet 130.75.215.xxx netmask ffffff00 broadcast 130.75.215.255
        ether xx:xx:xx:xx:xx:xx
hme0: flags=863 mtu 1500
        inet 130.75.5.xxx netmask ffffff00 broadcast 130.75.5.255
qaa0: flags=863 mtu 9180
        inet 130.75.214.xxx netmask ffffff00 broadcast 130.75.214.255
        ether xx:xx:xx:xx:xx:xx
fa0: flags=842 mtu 9188
        inet 0.0.0.0 netmask 0 
        ether xx:xx:xx:xx:xx:xx

I removed the uninteresing things. hme0 is the regular 100 Mbps ethernet interface. The 10 Mbps ethernet interface is called le0. el0 is the ATM LAN emulation (lane) interface. qaa0 is the ATM classical IP (clip) interface. fa0 is the interface that supports Fore's proprietary implementation of native ATM. Fore is the vendor of the installed ATM card. AFAIK you can use this interface to build PVCs or, if you are also using Fore switches, SVCs. You see an unconfigured interface there.

The buffer sizes for sending and receiving TCP segment and for UDP datagrams can be tuned with Solaris. With the help of the netstat command you can obtain an output similar but unlike the following one. The data was obtained on a server which runs a Squid with five dnsserver children. Since the interprocess communcation is accomplished via localhost sockets, you see both, the client side and the server side of each dnsserver child socket.

$ netstat -f inet

 TCP
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q  State
-------------------- -------------------- ----- ------ ----- ------ -------
blau-clip.ssh        challenger-clip.1023 57344     19 63980      0 ESTABLISHED
localhost.38437      localhost.38436      57344      0 57344      0 ESTABLISHED
localhost.38436      localhost.38437      57344      0 57344      0 ESTABLISHED
localhost.38439      localhost.38438      57344      0 57344      0 ESTABLISHED
localhost.38438      localhost.38439      57344      0 57344      0 ESTABLISHED
localhost.38441      localhost.38440      57344      0 57344      0 ESTABLISHED
localhost.38440      localhost.38441      57344      0 57344      0 ESTABLISHED
localhost.38443      localhost.38442      57344      0 57344      0 ESTABLISHED
localhost.38442      localhost.38443      57344      0 57344      0 ESTABLISHED
localhost.38445      localhost.38444      57344      0 57344      0 ESTABLISHED
localhost.38444      localhost.38445      57344      0 57344      0 ESTABLISHED

The columns titled with Swind and Rwind contain values for the size of the respective send- and reception windows, based on the free space available in the receive buffer at each peer. The Swind column contains the offered window size as reported by the remote peer. The Rwind column displays the advertised window size being transmitted to the remote peer.

An application can change the size of the the socket layer buffers with calls to setsockopt with the parameter SO_SNDBUF or SO_RCVBUF. Windows and buffers are not interchangable. Just remember: The buffers have a fixed size - unless you use setsockopt to change. Windows on the other hand depend on the free space available in the input buffer. The minimum and maximum requirements for buffer sizes are tuneable watermarks.

Figure 5: buffers, watermarks and window sizes.

Figure 5 shows the relation of the different buffers, windows and watermarks. I decided to let the send buffer grow from the maximum towards zero, which is just a way of showing things, and does probably not represent the real implementation. I left out the different socket options as the picture is confusing enough.

receive high watermark *_recv_hiwat

SO_RCVBUF

minimum number of maximum sized segments tcp_recv_hiwat_minmss (TCP only)

receive low watermark tcp_recv_lowat (TCP only)

select

poll

SO_RCVLOWAT

transmit high watermark *_xmit_hiwat

SO_SNDBUF

transmit low watermark *_xmit_lowat

select

poll

SO_SNDLOWAT

offered window (TCP only)

Swind

netstat

 to_send := MIN( cwnd, offered window )

advertised window (TCP only)

Rwind

netstat

cwnd (TCP only)

cwnd

congestion avoidance

slow start

cwnd

congestion window maximum size tcp_cwnd_max (TCP only)

not

Squid users should note the following behaviour seen with Solaris 2.6. The default socket buffer sizes which are detected during configuration phase are representative of the values for tcp_recv_hiwat, udp_recv_hiwat, tcp_xmit_hiwat and tcp_xmit_hiwat. Also note that enabling the hit object feature still limits hit object size to 16384 byte, regardless of what your system is able to achieve.

output from Squid 1.1.19 configuration script on a Solaris 2.6 host with the previously mentioned parameters all set to 64000. Please mind that these parameters do not constitute optimal sizes in most environments:
checking Default UDP send buffer size... 64000
checking Default UDP receive buffer size... 64000
checking Default TCP send buffer size... 64000
checking Default TCP receive buffer size... 64000

Buffers and windows are very important if you link via satellite. Due to the daterate possible but the extreme high round-trip delays of a satellite link, you will need very large TCP windows and possibly the TCP timestamp option. Only RFC 1323 conformant systems will achieve these ends. In other words, get a Solaris 2.6. For 2.5 systems, RFC 1323 compliance can be purchased as a Sun Consulting Special.

Window sizes are important for maximum throughput calculations, too. As Stevens shows, you cannot go faster than the window size offered by your peer, divided by the round-trip time (RTT). The lower your RTT, the faster you can transmit. The larger your window, the faster you can transmit. If you intend to employ maximum window sizes, you might want to give tcp_deferred_acks_max another look.

The network research laboratory of the German research network did measurements on satellite links. The RTT for a 10 Mbps link (if I remember correctly) was about 500 ms. A regular system was able to transmit 600 kbps whereas a RFC 1323 conformant system was able to transmit about 7 Mbps. Only bulk data transfer will do that for you.

 (1)   10 Mbps * 0.5 s = 5 Mbit = 625 KB
 (2)   512 KB = 4 Mbit = 0.5 s * 8 Mbps
 (3)   64 KB / 0.5 s = 128 KBps = 1 Mbps

The bandwidth-delay-product can be used to estimate the initial value when tweaking buffer sizes. The buffers then represent the capacity of the link. If we apply the bandwidth-delay-product calculations to the satellite link above, we get the following results: Equation 1 estimates the buffer sizes necessary to fully fill the 10 Mbps link. Equation 2 assumes that the buffer sizes were set to 512 KB, which would yield 8 Mbps. Slight deviation in the experiment may have been caused by retransmissions. Finally, equation 3 estimates the maximum datarate we can use on the satellite link, if limited to 64 KB buffers, e.g. Solaris <= 2.5.1. The 1 Mbps constitute an upper limit, as can be seen by the measured 600 Kbps.

Squid users beware: As long as Squid does not implement HTTP/1.1 persistent connections, you will not get any decent HTTP transmissions via satellite. The average cached object sizes about 13 kbyte, thus you almost never get past the TCP slow start. While this may or may not be a big deal with terrestrial links, but you will never be able to fill a satellite pipe to a satisfactorily degree. Doing things in parallel might help. Only when reaching TCP congestion avoidance you will see any filling of the pipe.

tcp_cwnd_max: default 32768, since 2.? 65535, recommended 65535 for Solaris <= 2.5.1
tcp_recv_hiwat: default 8192, recommended 16384 (see text), Cockroft 32768, maximum 65535
tcp_recv_hiwat_minmss: default 4, no recommendations
udp_recv_hiwat: default 8192, recommended 16384 (see text), maximum 65535
tcp_xmit_hiwat: default 8192, recommended 16384 (see text), Cockroft 32768, maximum 65535
udp_xmit_hiwat: default 8192, recommended 16384, maximum 65535
tcp_xmit_lowat: default 2048, no recommendations
udp_xmit_lowat: default 1024, no recommendations
tcp_max_buf: default 262144, minimum 65536, no immediate recommendations
udp_max_buf: default 262144 (since 2.5), minimum 65536, no immediate recommendations

[New] A few odd remarks at this point, concerning the recommendations given for the transmission buffer sizes. I decreased the recommendations of Adrain Cockroft in favour of a more conservative memory consumption. Also, with an average HTTP object size of 13 KByte, you can expect to fit over 50 % of all objects into the transmission buffer. On the other hand, larger objects which are to be transmitted by a cache or webserver may suffer in certain circumstances. Furthermore, I should recommend a generic transmission buffer size which is double the reception buffer size. This recommendation bases on the fact that unacknowledged segments occupy the send buffer until they are acknowledged.

7. Tuning your system

7.1 Things to watch

Did you reserve enough swap space? You should have at least as much swap as you have main memory. If you have little main memory, even double your swap. Do not be fooled by the result of the vmstat command - read the manpage and realize that the small value for free memory shown there is (usually) correct.

With Solaris there seems to exist a difference between virtually generated processes and real processes. The latter is extremely dependend on the amount of virtual memory. To test the amount of both kinds of processes, try a small program of mine. Do start it at the console, without X and not as priviledged user. The first value is the hardlimit of processes, and the second value the amount of processes you can really create given your virtual memory configuration. Tweaking your ulimit values may or may not help.

7.2 General entries in the file `/etc/system`

The file /etc/system contains various very important resource configurable parameters for your system. You use these tunings to give a heavily loaded system more resources of a certain kind. Unfortunately a reboot is necessary after changing anything. Though one could schedule reboots after midnight, I advice against it. You should always check if your changes have the desired effect, and won't tear down the system.

Adrian Cockroft severly warns against transporting an /etc/system from one system onto another, even worse, onto another hardware platform:

Clean out your /etc/system when you upgrade.

The most frequent changes are limited to the number of file descriptors, because the socket API uses filedescriptors for handling internet connectivity. You may want to look at the hardlimit of filehandles available to you. Proxies like Squid have to count twice for each connection: open request descriptors and either an open file or an open forwarding request descriptors.

You are able to influence the tuning with the reserved word set. Use a whitespace to seperate the key from the keyword. Use an equals sign to separate the value from its key. There are a few examples in the comments of the file.

Please, before you start, make a backup copy of your initial /etc/system. The backup should be located on your root filesystem. Thus, if some parameters fail, you can always supply the alternative, original system file on the boot prompt. The following shows two typically entered parameters:

* these are the defaults of the system
set rlim_fd_max=1024
set rlim_fd_cur=64

WARNING! SUN does not make any guarantees for the correct working of your system, if you use more filedescriptors than 4096. Personally, my old fvwm window manager did quit working alltogether. In my case, I compiled it on a Solaris 2.3 or 2.4 system and transferred it always onwards to a 2.5 system. After compiling the fvwm95, it worked to my satisfaction.

If you experience SEGV core dumps from your select(3c) system call after increasing your file descriptors above 4096, you have to recompile the affected programs. Especially the select(3c) call is known to the Squid users for its bad tempers concerning the maximum number of file descriptors. SUN remarks to this topic:

The default value for FD_SETSIZE (currently 1024) is larger than the default limit on the number of open files. In order to accommodate programs that may use a larger number of open files with select(), it is possible to increase this size within a program by providing a larger definition of FD_SETSIZE before the inclusion of <sys/types.h>.
Note: This does not work as expected. See text below.

I did test this suggestion by SUN, and a friend of mine tried it with Squid Caches. The result was a complete success or diseaster both times, depending on your point of view: If you can live with supplying naked women to your customers instead of bouncing company logos, go ahead and try it. If you really need to access filedescriptors above 1024, don't use select(), use poll() instead! poll() is supposed to be faster with Solaris, anyway. A different source mentions that the redefinition workaround mentioned above works satisfactorily; not for me, neither with Squid.

At the pages of VJ are a some tricks which I incorporated into this paper, too. Personally I am of the opinion that the VJ pages are not as up to date as they could be.

Many parameters of interest can be determined using the sysdef -i command. Please keep in mind that many values are in hexadecimal notation without the 0x prefix. Another very good program to see your system's configuration is sysinfo, the program. Refer to the manpages how to invoke this program.

rlim_fd_cur: default 64, recommended >= 1024
rlim_fd_max: default 1024, recommended >=4096
maxusers: default 249 ~= Megs RAM (Ultra-2/2 CPUs/256 MB), min 8, max 2048, no recommendations
max_nprocs: default 3994 (Ultra-2/2 CPUs/256 MB), no recommendations
maxuprc: default -5 (here: 3989), no recommendations
npty: default 48, no recommendations
pt_cnt: default 48, min 48, max 3000, no recommendations
vac_size: default 16384 (with maxusers 249), recommended: don't set
ufs_ninode: default 4323 = 17*maxusers+90 (with maxusers 249), min 226, max 34906,
ncsize: default 4323 = 17*maxusers+90 (with maxusers 249), min 226, max 34906,
ndquot: default 6484, no recommendations
nstrpush: default 9, no recommendations
strmsgsz: default 65536, no recommendations
strctlsz: default 1024, no recommendations
bufhwm: default 2 % of main memory, no immediate recommendations
use_mxcc_prefetch: default 0 (sun4d) or 1 (sun4m), recommended: see text

Some services use a multitude of caches files like Squid or some News server where names (URLs or articles) are mapped by a hash function to a shallow directory tree, helping the buffer cache and inode caches of the host file system (compared to using unlimited subdirectories like the CERN cache). As well-known in software engineering, the speedup by using the right algorithm usually far exceeds anything you can achieve by fiddling with the hardware or tweaking system parameters. Thus, a new storage scheme for mapped caches should provide food for thought.

7.3 100 Mbit ethernet related entries

Mr. Nebel and Mr. Hüsemann were so kind to give me a few hints concerning 100 Mbit ethernet interfaces and Solaris. It looks as if these cards default to halfduplex operations. In order to switch to full duplex mode, make sure your router can also work full duplex.

hme:hme_adv_100fdx_cap: default 0, recommended 1
hme:hme_adv_100hdx_cap: default 1, recommended 0
hme:hme_adv_autoneg_cap: default 1, recommended ?

A few conditions on incorrectly working 100 Mbit interfaces result in a downgrade to 10 Mbit ethernet, or half-duplex mode. Thus check at all available ends, if you are really getting the data rate you are expecting.

7.4 How to find further entries

There are thousands of further items you can adjust. Every module which has a device in the /dev directory and a module file somewhere in the kernel tree underneath /kernel can be configured with the help of ndd. Wether you have to have superuser priveleges depends on the access mode of the device file.

For instance, there exists a device /dev/hme and a kernel module /kernel/drv/hme. This driver is connected, as you might know, to the 100 Mbit ethernet interface. If you want to know what value you can tweak, you can ask ndd:

ndd /dev/hme \?

Of course, you can only change entries marked for read and write. If you tweaked enough and want to store some configuration as a default at boot time, you can enter your preferred values into the /etc/system file. Just prefix the key with the module name and separate both with a colon. You did see this earlier in the subsection on 100 Mbit ethernet and the System V IPC page.

There is another way to get your hands on the names of keys to tweak. For instance, the System V IPC modules don't have a related device file. This implies that you cannot tweak things with the help of ndd. Nevertheless, you can obtain all clear text strings from the module file in the kernel.

strings -a /kernel/sys/shmsys # possible
nm /kernel/sys/shmsys # recommended

There is a number of strings you are seeing. Most of the strings are either names of function within the module or clear text string passages defined within. Strings starting with shminfo are the names of user tuneable parameters, though. Now, how do you separate tuneable parameters from the other stuff? I really don't know. If you have some knowledge about Sun DDI, you may be able to help me to find a recommendable way, e.g. using _info(9E) and mod_info.

8. Recommended patches

It is utterly necessary to patch you Solaris system, if you didn't already do so! Have a look at the DFN CERT patch mirror or the original source from SUN. There may be a mirror closer to you, e.g. EUNet and FUNET have their own mirrors, if I am informed correctly.

In order to increase your TCP performance, security of websites and fix several severe bugs, do patch! Whoever still runs a Solaris below 2.5 should upgrade to 2.5.1 at least. I am about to find out how good Solaris 2.6 really is, and it is looking very promising.

Please remember to press the Shift button on your netscape navigator while selecting a link. If the patch is not loadable, probably a new release appeared in the meantime. To determine the latter case, have a look at the directories of DFN CERT or SUN . The README file on the DNF-CERT server is kept without a version number and thus always up to date.

ip and ifconfig patch: 103630-09 for Solaris 2.5.1 (README)
tcp patch (only with ip patches): 103582-15 for Solaris 2.5.1 (README)

Any system administrator should know the contents of SUN's patch page. Besides previously mentioned patches for a good TCP/IP performance, you should always consider the security related patches. Also, SUN recommends a set of further patches to complete the support for large IP addresses. You should really include any DNS related patch.

The SUN supplied patches to fix multicast problems with 2.5.1 are incompatible with the TCP patch. Unfortunately, you have to decide between an unbroken multicast and a fixed TCP module. Yes, I am aware that multicast is only possible via UDP, nevertheless the multicast patch replaces the installed TCP module. If you have problems here, ask your SUN partner for a workaround - he will probably suggest upgrading to 2.6.

9. Related books and software

This section started after receiving some information from Christian Grimm and Franz Haberhauer on TCP/IP and performance related literature.

Books

Adrian Cockroft; Sun Performance and Tuning; SUN Microsystems Inc.; 1995; ISBN 0-13-149642-5. Regrettably only up to Solaris 2.4, but most information is still valid for current Solaris systems. The Heise Verlag offers a German translation.
[must read] Adrian Cockroft; Sun Performance and Tuning; 2nd edition; SUN Microsystems Inc.; 04'1998; ISBN 0-13-095249-4. The improved version on performance and tuning, covers quick tips and Solaris 2.6 as well as Java server technologies.
W. Richard Stevens; Advanced Programming in the UNIX Environment; Addison-Wesley Publishing Company; Reading, MA; 1992; ISBN 0-201-56317-7. A German translation is available as: Programmieren in der UNIX-Umgebung; ISBN 3-9319-814-8, 1995.
[must read] W. Richard Stevens; TCP/IP Illustrated, Volume 1 - The Protocols; Addison-Wesley Publishing Company; Reading, MA; 1994; ISBN 0-201-63346-9. A German translation is available.
W. Richard Stevens; TCP/IP Illustrated, Volume 2 - The Implementation; Addison-Wesley Publishing Company; Reading, MA; 1995; ISBN 0-201-63354-X. A German translation is available.
W. Richard Stevens; TCP/IP Illustrated, Volume 3 - T/TCP, HTTP, NNTP, Unix Domain Sockets; Addison-Wesley Publishing Company; Reading, MA; 1994; ISBN 0-201-63495-3. A German translation is available.
W. Richard Stevens; Unix Network Programming, Network APIs: Sockets and XTI; Prentice-Hall Inc.; Upper Saddle River, NJ; 1998; ISBN 0-13-490012-X. A German translation is not available.
Brian Wong; Configuration and Capacity Planning for Solaris Servers; SUN Microsystems Inc.; 199?; ISBN 0-13-349952-9.

Andrew S. Tanenbaum; Computer Networks; I use the 2nd edition; Prentice Hall Inc., 1989, ISBN 0-13162959-X (2nd) and 1996, ISBN 0-13349945-6 (3rd). A German translation of the 2nd edition is availabe: Computer Netzwerke; Wolfram's Fachverlag, 1990, ISBN 3-925328-79-3.
Andrew S. Tanenbaum; Modern Operating Systems; Prentice Hall Inc., 1992, ISBN 0-13588187-0.
Maurice Bach; Design of the Unix Operating System; Prentice Hall, 1986, ISBN 0-13201799-7. A German translation is available.

Internet resources

There is an RFC mirror from MERIT near you, too.
The IANA assignments and related documents constitute resources you should also know about.
Especially to this theme: TCP Performance Monitoring und -Tuning.
Sun on the Net article about The slow start and delayed ACK.
Several articles by Adrian Cockroft on Solaris Performance and Tuning:

Monitoring of webservers.
The slides concerning Solaris 2.5 Performance Update.
An overview on the Sunworld columns including the contributions by Adrian Cockroft concerning performance (he is the guru!).
How does Solaris 2.6 improve performance stats and Web performance?.
What are the tunable kernel parameters for Solaris 2?

From Sun's TCP/IP and Data Communications Administration Guide:

SunWorld articles by Jim Mauro on System-V-IPC:

The Papers published by the Pittsburgh Supercomputing Center Networking Group constitute a highly recommended reading, for instance (see previous link for a complete list):

J. Semke, J. Mahdavi, M. Mathis, "Automatic TCP Buffer Tuning", Computer Communication Review, a publication of ACM SIGCOMM, volume 28, number 4, October 1998 (abstract).
J. Mahdavi, "Enabling High Performance Data Transfers on Hosts: (Notes for Users and System Administrators)", Technical note, Revised: December 1997.
M. Mathis, J. Semke, J. Mahdavi, T. Ott, "The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm",Computer Communication Review, volume 27, number3, July 1997. (abstract).
M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow, "TCP Selective Acknowledgement Options," Internet Request for Comments 2018 (rfc2018.txt) October 1996.

VJ contains descriptions, but may be a trifle outdated. A good starting point for hints on tuning.
Hints about the sizing of Java servers and simple webservers. More about application issues than generic performance.

Mentioned RFCs, also Internet resources:

RFC 793 Transmission Control Protocol (TCP, STD 7).
RFC 1122 Requirements for Internet hosts - communication layers (STD 3).
RFC 1123 Requirements for Internet hosts - application and support (STD 3), updated by RFC 2181.
RFC 1323 TCP Extensions for High Performance.
RFC 1700 Assigned numbers (STD 2), outdated, use the IANA assignments instead!
RFC 1918 Address Allocation for Private Internets.
RFC 2001 TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms.
RFC 2181 Clarifications to the DNS Specification.
RFC 2324 Hyper Text Coffee Pot Control Protocol (HTCPCP/1.0 :-)

Unmentioned, but important Internet resources, for Web services. Compare with Duane Wessel's required reading list for Squid developers, and W3C's change history of HTTP:

RFC 1945 Hypertext Transfer Protocol - HTTP/1.0.
RFC 2068 Hypertext Transfer Protocol - HTTP/1.1

PostScript.gz,

Text.gz,

Word97.gz.

RFC 2069 An Extension to HTTP : Digest Access Authentication.
RFC 2186 Internet Cache Protocol (ICP), version 2.
RFC 2187 Application of Internet Cache Protocol (ICP), version 2.
RFC 2227 Simple Hit-Metering and Usage-Limiting for HTTP.

Also of interest in the regard of webservices may be a bunch of related drafts, partially expired, still sprouting with ideas. Compare with the IETF - Hypertext Transfer Protocol (HTTP) Working Group published documents and W3C's change history of HTTP:

Access-restricted, HTTP/1.1 Cache Control Extension (draft-melve-cachecontrol-00.txt).
HTTP Connection Management (draft-ietf-http-connection-00.txt).

Freeware by SUN - no support offered!

A description of the SE-Toolkits including zoom, a tool with a traffic light like display of system states and the percollator.
For your SE-Toolkit exists an additional TCP tool.
Das SUN proctool.

more software

The Solstice product SyMON 1.1 für Enterprise Server E2, Ex000, SS1000, SC2000 can be obtained for free from SUN.

other servers

An overview over the configuration and load on www.sun.com.
www.infoseek.com - an example for what is possible with a E4000 configured as search engine.

10. Uncovered material

There are a bunch of parameters which I didn't cover in the sections above, but some of which may be worth looking at:

tcp_ip_abort_linterval: default 180000
tcp_ip_notify_cinterval: default 10000
tcp_ip_notify_interval: default 10000
tcp_rexmit_interval_extra: Since 2.6: default 0

11. Startup scripts

For the important tweakable parameters exist startup scripts for Solaris. Only the first script is really necessary.

The first script changed all parameters deemed necessary and described in the previous sections. The file should be called something like /etc/init.d/your-tune and you must link (hardlinks preferred, symbolic links are o.k.) /etc/rcS.d/S31your-tune to the init.d file.

Please read the script carefully before installing. It is a rather easy shell script. The piping and awking isn't as bad as it looks:

The first line set the PATH to standard values and prints a message. For all messages which are not to contain a linefeed, we have to use the UCB echo.
The variable $osver is set with the operating system major and minor version number times ten: Solaris 2.6 will set $osver to 260 and Solaris 2.5.1 will be counted as 251.
$patch looks into the installed kernel TCP module, because it mustn't be assumed that /var is already mounted. The result is either 0 for an unpatched system (or some error in the pipeline), or the applied TCP patch level. For non-2.5.1 systems, you have to change this line to your needs. All 2.5.1 system (Sparc, x86 and PPC) will be recognized.
The next if tree just prints a message about the patch found.
Depending on your OS version and the installed patch, we either have a single value of tcp_conn_req_max to tweak, or the twin values of tcp_conn_req_max_q and tcp_conn_req_max_q0.
The section about retransmission related tweaks follows.
Depending on your OS version and the installed patch level, we may have a parameter tcp_slow_start_initial to tweak.
The final sections are indicated by their respective comments and messages.

your needs

The second script just changes the MTU of le0 from the IPX to the IEEE 802.3 size. The meaning is shown further up. The script is not strictly necessary, and reports about odd behaviour may have ceeded with a patched 2.5.1 or a 2.6.

If you intend to go ahead with this script, the file is called /etc/init.d/your-tune2 and you need to create a link to it (hard or soft, as above) as /etc/rc2.d/S90your-tune2. Please mind that GNU awk is used in the script, normal awk does not seem to work satisfactorily.