In this chapter we describe NFS, the Network File System, another popular application that provides transparent file access for client applications. The building block of NFS is Sun RPC: Remote Procedure Call, which we must describe first.
Nothing special need be done by the client program to use NFS. The kernel detects that the file being accessed is on an NFS server and automatically generates the RPC calls to access the file.
Our interest in NFS is not in all the details on
file access, but in its use of the Internet protocols, especially
29.2 Sun Remote Procedure Call
Most network programming is done by writing application programs that call system-provided functions to perform specific network operations. For example, one function performs a TCP active open, another performs a TCP passive open, another sends data across a TCP connection, another sets specific protocol options (enable TCP's keepalive timer), and so on. In Section 1.15 we mentioned that two popular sets of functions for network programming (called APIs) are sockets and TLI. The API used by the client and the API used by the server can be different, as can the operating systems running on the client and server. It is the communication protocol and application protocol that determine if a given client and server can communicate with each other. A Unix client written in C using sockets and TCP can communicate with a mainframe server written in COBOL using some other API and TCP, if both hosts are connected across a network and both have a TCP/IP implementation.
Typically the client sends commands to the server, and the server sends replies back to the client. All the applications we've looked at so far - Ping, Traceroute, routing daemons, and the clients and servers for the DNS, TFTP, BOOTP, SNMP, Telnet, FTP, and SMTP-are built this way.
RPC, Remote Procedure Call, is a different way of doing network programming. A client program is written that just calls functions in the server program. This is how it appears to the programmer, but the following steps actually take place.
The network programming done by the stubs and the RPC library routines uses an API such as sockets or TLI, but the user application-the client program, and the server procedures called by the client-never deal with this API. The client application just calls the server procedures and all the network programming details are hidden by the RPC package, the client stub, and the server stub. An RPC package provides numerous benefits.
Details of RPC programming are provided in Chapter 18 of [Stevens 1990]. Two popular RPC packages are Sun RPC and the RPC package in the Open Software Foundation's (OSF) Distributed Computing Environment (DCE). Our interest in RPC is to see what the procedure call and procedure return messages look like for the Sun RPC package, since it's used by the Network File System, which we describe in this chapter. Version 2 of Sun RPC is defined in RFC 1057 [Sun Microsystems 1988a].
Sun RPC comes in two flavors. One version is built using the sockets API and works with TCP and UDP. Another, called TI-RPC (for "transport independent"), is built using the TLI API and works with any transport layer provided by the kernel. From our perspective the two are the same, although we talk only about TCP and UDP in this chapter.
Figure 29.1 shows the format of an RPC procedure call message, when UDP is used.
The IP and UDP headers are the standard ones we showed earlier (Figures 3.1 and 11.2). What follows after the UDP header is defined by the RPC package.
The transaction ID (XID) is set by the client and returned by the server. When the client receives a reply it compares the XID returned by the server with the XID of the request it sent. If they don't match, the client discards the message and waits for the next one from the server. Each time the client issues a new RPC, it changes the XID. But if the client retransmits a previously sent RPC (because it hasn't received a reply), the XID does not change.
The call variable is 0 for a call, and 1 for a reply. The current RPC version is 2. The next three variables, program number, version number, and procedure number, identify the specific procedure on the server to be called.
The credentials identify the client. In some instances nothing is sent here, and in other instances the numeric user ID and group IDs of the client are sent. The server can look at the credentials and determine if it will perform the request or not. The verifier is used with Secure RPC, which uses DES encryption. Although the credentials and verifier are variable-length fields, their length is encoded as part of the field.
Following this are the procedure parameters. The format of these depends on the definition of the remote procedure by the application. How does the receiver (the server stub) know the size of the parameters? Since UDP is being used, the size of the UDP datagram, minus the length of all the fields up through the verifier, is the size of the parameters. When TCP is used instead of UDP, there is no inherent length, since TCP is a byte stream protocol, without record boundaries. To handle this, a 4-byte length field appears between the TCP header and the XID, telling the receiver how many bytes comprise the RPC call. This allows the RPC call message to be sent in multiple TCP segments, if necessary. (The DNS uses a similar technique; see Exercise 14.4.)
Figure 29.2 shows the format of an RPC reply. This is sent by the server stub to the client stub, when the remote procedure returns.
The XID in the reply is just copied from the XID in the call. The reply is 1, which we said differentiates this message from a call. The status is 0 if the call message was accepted. (The message can be rejected if the RPC version number isn't 2, or if the server cannot authenticate the client.) The verifier is used with secure RPC to identify the server.
The accept status is 0 on success. A nonzero
value can indicate an invalid version number or an invalid procedure
number, for example. As with the RPC call message, if TCP is used
instead of UDP, a 4-byte length field is sent between the TCP
header and the XID.
29.3 XDR: External Data Representation
XDR, External Data Representation, is the standard used to encode the values in the RPC call and reply messages-the RPC header fields (XID, program number, accept status, etc.), the procedure parameters, and the procedure results. Having a standard way of encoding all these values is what lets a client on one system call a procedure on a system with a different architecture. XDR is defined in RFC 1014 [Sun Microsystems 1987].
XDR defines numerous data types and exactly how they
are transmitted in an RPC message (bit order, byte order, etc.).
The sender must build an RPC message in XDR format, then the receiver
converts the XDR format into its native representation. We see,
for example, in Figures 29.1 and 29.2, that all the integer values
we show (XID, call, program number, etc.) are 4-byte integers.
Indeed, all integers occupy 4 bytes in XDR. Other data types supported
by XDR include unsigned integers, booleans, floating point numbers,
fixed-length arrays, variable-length arrays, and structures.
29.4 Port Mapper
The RPC server programs containing the remote procedures use ephemeral ports, not well-known ports. This requires a "registrar" of some form that keeps track of which RPC programs are using which ephemeral ports. In Sun RPC this registrar is called the port mapper.
The term "port" in this name originates from the TCP and UDP port numbers, features of the Internet protocol suite. Since TI-RPC works over any transport layer, and not just TCP and UDP, the name of the port mapper in systems using TI-RPC (SVR4 and Solaris 2.2, for example) has become rpcbind. We'll continue to use the more familiar name of port mapper.
Naturally, the port mapper itself must have a well-known port: UDP port 111 and TCP port 111. The port mapper is also just an RPC server program. It has a program number (100000), a version number (2), a TCP port of 111, and a UDP port of 111. Servers register themselves with the port mapper using RPC calls, and clients query the port mapper using RPC calls. The port mapper provides four server procedures:
When an RPC server program starts, and is later called by an RPC client program, the following steps take place.
If TCP is being used, the client does an active open to the server's TCP port number, and then sends an RPC call message across the connection. The server responds with an RPC reply message across the connection.
The program rpcinfo(8) prints out the port mapper's current mappings. (It calls the port mapper's PMAPPROC_DUMP procedure.) Here is some typical output:
|sun % /usr/etc/rpcinfo -p|
|100005||1||tcp||702||mountd||mount daemon for NFS|
|100021||1||tcp||709||niockmgr||NFS lock manager|
We see that some programs do support multiple versions, and each combination of a program number, version number, and protocol has its own port number mapping maintained by the port mapper.
Both versions of the mount daemon are accessed through
the same TCP port number (702) and the same UDP port number (699),
but each version of the lock manager has its own port number.
29.5 NFS Protocol
NFS provides transparent file access for clients to files and filesystems on a server. This differs from FTP (Chapter 27), which provides file transfer. With FTP a complete copy of the file is made. NFS accesses only the portions of a file that a process references, and a goal of NFS is to make this access transparent. This means that any client application that works with a local tile should work with an NFS file, without any program changes whatsoever.
NFS is a client-server application built using Sun RPC. NFS clients access tiles on an NFS server by sending RPC requests to the server. While this could be done using normal user processes - that is, the NFS client could be a user process that makes explicit RPC calls to the server, and the server could also be a user process-NFS is normally not implemented this way for two reasons. First, accessing an NFS tile must be transparent to the client. Therefore the NFS client calls are performed by the client operating system, on behalf of client user processes. Second, NFS servers are implemented within the operating system on the server for efficiency. If the NFS server were a user process, every client request and server reply (including the data being read or written) would have to cross the boundary between the kernel and the user process, which is expensive.
In this section we look at version 2 of NFS, as documented in RFC 1094 [Sun Microsystems 1988b]. A better description of Sun RPC, XDR, and NFS is given in [X/Open 1991]. Details on using and administering NFS are in [Stern 1991]. The specifications for version 3 of the NFS protocol were released in 1993, which we cover in Section 29.7.
Figure 29.3 shows the typical arrangement of an NFS client and an NFS server. There are many subtle points in this figure.
Most Unix hosts can operate as either an NFS client, an NFS server, or both. Most PC implementations (MS-DOS) only provide NFS client implementations. Most IBM mainframe implementations only provide NFS server functions.
NFS really consists of more than just the NFS protocol. Figure 29.4 shows the various RPC programs normally used with NFS.
|Program number||Version numbers||Number of|
The versions we show in this figure are the ones found on systems such as SunOS 4.1.3. Newer implementations are providing newer versions of some of the programs. Solaris 2.2, for example, also supports versions 3 and 4 of the port mapper, and version 2 of the mount daemon. SVR4 also supports version 3 of the port mapper.
The mount daemon is called by the NFS client host before the client can access a filesystem on the server. We discuss this below.
The lock manager and status monitor allow clients to lock portions of files that reside on an NFS server. These two programs are independent of the NFS protocol because locking requires state on both the client and server, and NFS itself is stateless on the server. (We say more about NFS's statelessness later.) Chapters 9, 10, and 11 of [X/Open 1991] document the procedures used by the lock manager and status monitor for file locking with NFS.
A fundamental concept in NFS is the file handle. It is an opaque object used to reference a file or directory on the server. The term opaque denotes that the server creates the file handle, passes it back to the client, and then the client uses the file handle when accessing the file. The client never looks at the contents of the file handle-its contents only make sense to the server.
Each time a client process opens a file that is really a file on an NFS server, the NFS client obtains a file handle for that file from the NFS server. Each time the NFS client reads or writes that file for the user process, the file handle is sent back to the server to identify the file being accessed.
Normal user processes never deal with file handles - it is the NFS client code and the NFS server code that pass them back and forth. In version 2 of NFS a file handle occupies 32 bytes, although this increases with version 3 to 64 bytes.
Unix servers normally store the following information in the file handle: the filesystem identifier (the major and minor device numbers of the filesystem), the i-node number (a unique number within a filesystem), and an i-node generation number (a number that changes each time an i-node is reused for a different file).
The client must use the NFS mount protocol to mount a server's filesystem, before the client can access files on that filesystem. This is normally done when the client is bootstrapped. The end result is for the client to obtain a file handle for the server's file-system.
Figure 29.5 shows the sequence of steps that takes place when a Unix client issues the mount (8) command, specifying an NFS mount.
The following steps take place.
This implementation technique puts all the mount processing, other than the mount system call on the client, in user processes, instead of the kernel. The three programs we show-the mount command, the port mapper, and the mount daemon-are all user processes. As an example, on our host sun (the NFS client) we execute
sun # mount -t nfs bsdi:/usr /nfs/bsdi/usr
This mounts the directory /usr on the host bsdi (the NFS server) as the local file-system /nfs/bsdi/usr. Figure 29.6 shows the result.
When we reference the file /nfs/bsdi/usr/rstevens/hello.c on the client sun we are really referencing the file /usr/rstevens/hello.c on the server bsdi.
The NFS server provides 15 procedures, which we now describe. (The numbers we use are not the same as the NFS procedure numbers, since we have grouped them according to functionality.) Although NFS was designed to work between different operating systems, and not just Unix systems, some of the procedures provide Unix functionality that might not be supported by other operating systems (e.g., hard links, symbolic links, group owner, execute permission, etc.). Chapter 4 of [Stevens 1992] contains additional information on the properties of Unix filesystems, some of which are assumed by NFS.
These procedure names actually begin with the prefix NFSPROC_, which we've dropped.
NFS was originally written to use UDP, and that's what all vendors provide. Newer implementations, however, also support TCP. TCP support is provided for use on wide area networks, which are getting faster over time. NFS is no longer restricted to local area use.
The network dynamics can change drastically when going from a LAN to a WAN. The round-trip times can vary widely and congestion is more frequent. These characteristics of WANs led to the algorithms we examined with TCP - slow start and congestion avoidance. Since UDP does not provide anything like these algorithms, either the same algorithms must be put into the NFS client and server or TCP should be used.
The Berkeley Net/2 implementation of NFS supports either UDP or TCP. [Macklem 1991] describes this implementation. Let's look at the differences when TCP is used.
Over time, additional vendors plan to support NFS
29.6 NFS Examples
Let's use tcpdump to see which NFS procedures are invoked by the client for typical file operations. When tcpdump detects a UDP datagram containing an RPC call (call equals 0 in Figure 29.1) with a destination port of 2049, it decodes the datagram as an NFS request. Similarly if the UDP datagram is an RPC reply (reply equals 1 in Figure 29.2) with a source port of 2049, it decodes the datagram as an NFS reply.
Our first example just copies a file to the terminal using the cat(l) command, but the file is on an NFS server:
|sun % cat /nfs/bsdi/usr/rstevens/hello.c||copy file to terminal|
|printf ("hello, world\n");|
On the host sun (the NFS client) the filesystem /nfs/bsdi/usr is really the /usr file-system on the host bsdi (the NFS server), as shown in Figure 29.6. The kernel on sun detects this when cat opens the file, and uses NFS to access the file. Figure 29.7 shows the tcpdump output.
|1||0.0||sun.7aa6 > bsdi. nfs: 104 getattr|
|2||0.003587 (0.0036)||bsdi.nfs > sun.7aa6: reply ok 96|
|3||0.005390 (0.0018)||sun.7aa7 > bsdi.nfs: 116 lookup "rstevens"|
|4||0.009570 (0.0042)||bsdi.nfs > sun.7aa7: reply ok 128|
|5||0.011413 (0.0018)||sun.7aa8 > bsdi.nfs: 116 lookup "hello.c"|
|6||0.015512 (0.0041)||bsdi.nfs > sun.7aa8: reply ok 128|
|7||0.018843 (0.0033)||sun.7aa9 > bsdi.nfs: 104 getattr|
|8||0.022377 (0.0035)||bsdi.nfs > sun.7aa9: reply ok 96|
|9||0.027621 (0.0052)||sun.7aaa > bsdi.nfs: 116 read 1024 bytes @ 0|
|10||0.032170 (0.0045)||bsdi.nfs > sun.7aaa: reply ok 140|
When tcpdump decodes an NFS request or reply, it prints the XID field for the client, instead of the port number. The XID field in lines 1 and 2 is 0x7aa6.
The filename /nfs/bsdi/usr/rstevens/hello.c is processed by the open function in the client kernel one element at a time. When it reaches /nfs/bsdi/usr it detects that this is a mount point to an NFS mounted filesystem.
In line 1 the client calls the GETATTR procedure to fetch the attributes of the server's directory that the client has mounted (/usr). This RPC request contains 104 bytes of data, exclusive of the IP and UDP headers. The reply in line 2 has a return value of OK and contains 96 bytes of data, exclusive of the IP and UDP headers. We see in this figure that the minimum NFS message contains around 100 bytes of data.
In line 3 the client calls the LOOKUP procedure for the file rstevens and receives an OK reply in line 4. The LOOKUP specifies the filename rstevens and the file handle that was saved by the kernel when the remote filesystem was mounted. The reply contains a new file handle that is used in the next step.
In line 5 the client does a LOOKUP of hello.c using the file handle from line 4. It receives another file handle in line 6. This new file handle is what the client uses in lines 7 and 9 to reference the file /nfs/bsdi/usr/rstevens/hello.c. We see that the client does a LOOKUP for each component of the pathname that is being opened.
In line 7 the client does another GETATTR, followed by a READ in line 9. The client asks for 1024 bytes, starting at offset 0, but receives less. (After subtracting the sizes of the RPC fields, and the other values returned by the READ procedure, 38 bytes of data are returned in line 10. This is indeed the size of the file hello.c.)
In this example the user process knows nothing about these NFS requests and replies that are being done by the kernel. The application just calls the kernel's open function, which causes 3 requests and 3 replies to be exchanged (lines 1-6), and then calls the kernel's read function, which causes 2 requests and 2 replies (lines 7-10). It is transparent to the client application that the file is on an NFS server.
As another simple example we'll change our working directory to a directory that's on an NFS server, and then create a new directory:
|sun % cd /nfs/bsdi/usr/rstevens||change working directory|
|sun % mkdir Mail||and create a directory|
Figure 29.8 shows the tcpdump output.
|1||0.0||sun.7ad2 > bsdi.nfs: 104 getattr|
|2||0.004912 ( 0.0049)||bsdi.nfs > sun.7ad2: reply ok 96|
|3||0.007266 ( 0.0024)||sun.7ad3 > bsdi.nfs: 104 getattr|
|4||0.010846 ( 0.0036)||bsdi.nfs > sun.7ad3: reply ok 96|
|5||35.769875 (35.7590)||sun.7ad4 > bsdi.nfs: 104 getattr|
|6||35.773432 ( 0.0036)||bsdi.nfs > sun.7ad4: reply ok 96|
|7||35.775236 ( 0.0018)||sun.7ad5 > bsdi.nfs: 112 lookup "Mail"|
|8||35.780914 ( 0.0057)||bsdi.nfs > sun.7ad5: reply ok 28|
|9||35.782339 ( 0.0014)||sun.7ad6 > bsdi.nfs: 144 mkdir "Mail"|
|10||35.992354 ( 0.2100)||bsdi.nfs > aun.7ad6: reply ok 128|
Changing our directory causes the client to call the GETATTR procedure twice (lines 1-4). When we create the new directory, the client calls the GETATTR procedure (lines 5 and 6), followed by a LOOKUP (lines 7 and 8, to verify that the directory doesn't already exist), followed by a MKDIR to create the directory (lines 9 and 10). The reply of OK in line 8 doesn't mean that the directory exists. It just means the procedure returned, tcpdump doesn't interpret the return values from the NFS procedures. It normally prints OK and the number of bytes of data in the reply.
One of the features of NFS (critics of NFS would call this a wart, not a feature) is that the NFS server is stateless. "The server does not keep track of which clients are accessing which files. Notice in the list of NFS procedures shown earlier that there is not an open procedure or a close procedure. The LOOKUP procedure is similar to an open, but the server never knows if the client is really going to reference the file after the client does a LOOKUP.
The reason for a stateless design is to simplify the crash recovery of the server after it crashes and reboots.
In the following example we are reading a file from an NFS server when the server crashes and reboots. This shows how the stateless server approach lets the client "not know" that the server crashes. Other than a time pause while the server crashes and reboots, the client is unaware of the problem, and the client application is not affected.
On the client sun we start a cat of a long file (/usr/share/lib/termcap on the NFS server svr4), disconnect the Ethernet cable during the transfer, shut down and reboot the server, then reconnect the cable. The client was configured to read 1024 bytes per NFS read. Figure 29.9 shows the tcpdump output.
|1||0.0||sun.7ade > svr4.nfs: 104 getattr|
|2||0.007653 ( 0.0077)||svr4.nfs > sun.7ade: reply ok 96|
|3||0.009041 ( 0.0014)||sun.7adf > svr4.nfs: 116 lookup "share"|
|4||0.017237 ( 0.0082)||svr4.nfs > sun.7adf: reply ok 128|
|5||0.018518 ( 0.0013)||sun.7ae0 > svr4.nfs: 112 lookup "lib"|
|6||0.026802 ( 0.0083)||svr4.nfs > sun.7ae0: reply ok 128|
|7||0.028096 ( 0.0013)||sun.7ael > svr4.nfs: 116 lookup "termcap"|
|8||0.036434 ( 0.0083)||svr4.nfs > sun.7ael: reply ok 128|
|9||0.038060 ( 0.0016)||sun.7ae2 > svr4.nfs: 104 getattr|
|10||0.045821 ( 0.0078)||svr4.nfs > sun.7ae2: reply ok 96|
|11||0.050984 ( 0.0052)||sun.7ae3 > svr4.nfs: 116 read 1024 bytes @ 0|
|12||0.084995 ( 0.0340)||svr4.nfs > sun.7ae3: reply ok 1124|
|128||3.430313 ( 0.0013)||sun.7b22 > svr4.nfs: 116 read 1024 bytes @ 64512|
|129||3.441828 ( 0.0115)||svr4.nfs > sun.7b22: reply ok 1124|
|130||4.125031 ( 0.6832)||sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536|
|131||4.868593 ( 0.7436)||sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728|
|132||4.993021 ( 0.1244)||sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536|
|133||5.732217 ( 0.7392)||sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728|
|134||6.732084 ( 0.9999)||sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536|
|135||7.472098 ( 0.7400)||sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728|
|136||10.211964 ( 2.7399)||sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536|
|137||10.951960 ( 0.7400)||sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728|
|138||17.171767 ( 6.2198)||sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536|
|139||17.911762 ( 0.7400)||sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728|
|140||31.092136 (13.1804)||sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536|
|141||31.831432 ( 0.7393)||sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728|
|142||51.090854 (19.2594)||sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536|
|143||51.830939 ( 0.7401)||sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728|
|144||71.090305 (19.2594)||sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536|
|145||71.830155 ( 0.7398)||sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728|
|167||291.824285 ( 0.7400)||sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728|
|168||311.083676 (19.2594)||sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536|
|169||311.149476 ( 0.0658)||arp who-has sun tell svr4|
|170||311.150004 ( 0.0005)||arp reply sun is-at 8:0:20:3:f6:42|
|171||311.154852 ( 0.0048)||svr4.nfs > sun.7b23: reply ok 1124|
|172||311.156671 ( 0.0018)||sun.7b25 > svr4.nfs: 116 read 1024 bytes @ 66560|
|173||311.168926 ( 0.0123)||svr4.nfs > sun.7b25: reply ok 1124|
Lines 1-10 correspond to the client opening the file. The operations are similar to those shown in Figure 29.7. In line II we see the first READ of the file, with 1024 bytes of data returned in line 12. This continues (a READ of 1024 followed by a reply of OK) through line 129.
In lines 130 and 131 we see two requests that time out and are retransmitted in lines 132 and 133. The first question is why are there two read requests, one starting at offset 65536 and the other starting at 73728? The client kernel has detected that the client application is performing sequential reads, and is trying to prefetch data blocks. (Most Unix kernels do this read-ahead.) The client kernel is also running multiple NFS block I/O daemons (biod processes) that try to generate multiple RPC requests on behalf of clients. One daemon is reading 8192 bytes starting at 65536 (in 1024-byte chunks) and the other is performing the read-ahead of 8192 bytes starting at 73728.
Client retransmissions occur in lines 130-168. In line 169 we see the server has rebooted, and it sends an ARP request before it can reply to the client's NFS request in line 168. The response to line 168 is sent in line 171. The client READ requests continue.
The client application never knows that the server crashes and reboots, and except for the 5-minute pause between lines 129 and 171, this server crash is transparent to the client.
To examine the timeout and retransmission interval in this example, realize that there are two client daemons with their own timeouts. The intervals for the first daemon (reading at offset 65536), rounded to two decimal points, are: 0.68, 0.87, 1.74, 3.48, 6.96, 13.92, 20.0, 20.0, 20.0, and so on. The intervals for the second daemon (reading at offset 73728) are the same (to two decimal points). It appears that these NFS clients are using a timeout that is a multiple of 0.875 seconds with an upper bound of 20 seconds. After each timeout the retransmission interval is doubled: 0.875, 1.75,3.5, 7.0, and 14.0.
How long does the client retransmit? The client has two options that affect this. First, if the server filesystem is mounted hard, the client retransmits forever, but if the server filesystem is mounted soft, the client gives up after a fixed number of retransmissions. Also, with a hard mount the client has an option of whether to let the user interrupt the infinite retransmissions or not. If the client host specifies interruptibility when it mounts the server's filesystem, if we don't want to wait 5 minutes for the server to reboot after it crashes, we can type our interrupt key to abort the client application.
An RPC procedure is called idempotent if it can be executed more than once by the server and still return the same result. For example, the NFS read procedure is idempotent. As we saw in Figure 29.9, the client just reissues a given READ call until it gets a response. In our example the reason for the retransmission was that the server had crashed. If the server hasn't crashed, and the RPC reply message is lost (since UDP is unreliable), the client just retransmits and the server performs the same READ again. The same portion of the same file is read again and sent back to the client.
This works because each READ request specifies the starting offset of the read. If there were an NFS procedure asking the server to read the next N bytes of a file, this wouldn't work. Unless the server is made stateful (as opposed to stateless), if a reply is lost and the client reissues the READ for the next N bytes, the result is different. This is why the NFS READ and WRITE procedures have the client specify the starting offset. The client maintains the state (the current offset of each file), not the server.
Unfortunately, not all filesystem operations are idempotent. For example, consider the following steps: the client NFS issues the REMOVE request to delete a file; the server NFS deletes the file and responds OK; the server's response is lost; the client NFS times out and retransmits the request; the server NFS can't find the file and responds with an error; the client application receives an error saying the file doesn't exist. This error return to the client application is wrong-the file did exist and was deleted.
The NFS operations that are idempotent are: GETATTR, STATES, LOOKUP, READ, WRITE, READLINK, and READDIR. The procedures that are not idempotent are: CREATE, REMOVE, RENAME, LINK, SYMLINK, MKDIR, and RMDIR. SETATTR is normally idempotent, unless it's being used to truncate a file.
Since lost responses can always happen with UDP, NFS servers need a way to handle the nonidempotent operations. Most servers implement a recent-reply cache in which they store recent replies for the nonidempotent operations. Each time the server receives a request, it first checks this cache, and if a match is found, returns the previous reply instead of calling the NFS procedure again. [Juszczak 1989] provides details on this type of cache.
This concept of idempotent server procedures applies
to any UDP-based application, not just NFS. The DNS, for example,
provides an idempotent service. A DNS server can execute a resolver's
request any number of times with no ill effects (other than wasted
29.7 NFS Version 3
During 1993 the specifications for version 3 of the NFS protocol were released [Sun Microsystems 1994]. Implementations are expected to become available during 1994.
Here we summarize the major differences between versions 2 and 3. We'll refer to the two as V2 and V3.
RPC is a way to build a client-server application so that it appears that the client just calls server procedures. All the networking details are hidden in the client and server stubs, which are generated for an application by the RPC package, and in the RPC library routines. We showed the format of the RPC call and reply messages, and mentioned that XDR is used to encode the values, allowing RPC clients and servers to run on machines with different architectures.
One of the most widely used RPC applications is Sun's NFS, a heterogeneous file access protocol that is widely implemented on hosts of all sizes. We looked at NFS and the way that it uses UDP and TCP. Fifteen procedures define the NFS Version 2 protocol.
A client's access to an NFS server starts with the mount protocol, returning a file handle to the client. The client can then access files on the server's filesystem using that file handle. Filenames are looked up on the server one element at a time, returning a new file handle for each element. The end result is a file handle for the file being referenced, which is used in subsequent reads and writes.
NFS tries to make all its procedures idempotent, so that the client can just reissue a request if the response gets lost. We saw an example of this with a client reading a file while the server crashed and rebooted.
29.1 In Figure 29.7 we saw that tcpdump interpreted the packets as NFS requests and replies, printing the XID. Can tcpdump do this for any RPC request or reply?
29.2 On a Unix system, why do you think RPC server programs use ephemeral ports and not well-known ports?
29.3 An RPC client calls two server procedures. The first server procedure takes 5 seconds to execute, and the second procedure takes 1 second to execute. The client has a timeout of 4 seconds. Draw a time line of what's exchanged between the client and server. (Assume it takes no time for messages from the client to the server, and vice versa.)
29.4 What would happen in the example shown in Figure 29.9 if, while the NFS server were down, its Ethernet card were replaced?
29.5 When the server reboots in Figure 29.9, it handles the request starting at byte offset 65536 (lines 168 and 171), and then handles the next request starting at offset 66560 (lines 172 and 173). What happened to the request starting at offset 73728 (line 167)?
29.6 When we described idempotent NFS procedures we gave an example of a REMOVE reply being lost in the network. What happens in this case if TCP is used, instead of UDP?
29.7 If the NFS server used an ephemeral port instead of 2049, what would happen to an NFS client when the server crashes and reboots?
29.8 Reserved port numbers (Section 1.9) are scarce, since there are a maximum of 1023 per host. If an NFS server requires its clients to have reserved ports (which is common) and an NFS client using TCP mounts N filesystems on N different servers, does the client need a different reserved port number for each connection?