When we replaced our NetApp filer cluster with a RHEL-based NFS cluster, we ran into a few surprises.
On our filer, we exported a mount point to most of our hosts as read-only. A select group of clients are able to mount that same export as read-write. We also had a second mount point that was read-write for all hosts.
This was not possible with our Linux-based NFS server. We could not have different export options for directories on the same filesystem. This was disappointing, but we were able to work around it.
acpid and fencing
Fencing refers to one node in a cluster isolating another node. When the secondary node needs to pick up the NFS service due to a malfunction of the primary, you don’t want the primary to continue reading and writing to the disk array. So RedHat Custer provides mechanisms for the secondary node to fence the primary. In our case, we’re using IPMI as the fencing mechanism. The secondary will issue an IPMI power off command to the primary, which should shut it down hard, the equivalent of pulling the power out of the primary.
We found that when the secondary would try to fence the primary, the primary would go down, but it would come back up, more like a soft reset than a hard power off.
The culprit? acpid. When we turned off this service, the IPMI commands worked as expected.
Low default number of NFS processes
RHEL defaults to 8 NFSd processes. We need our server to handle about 30-40 clients. We bumped up the number of processes to 32 by editing /etc/sysconfig/NFS to set RPCNFSDCOUNT=32.
This was a huge problem for us. Our NFS server was chugging along just fine, when we tried to do an rsync of the entire volume from our production server to our backup server. We had the two machines on the same switch (1Gbps). As soon as the rsync process started sending data, the server stopped serving NFS data — all the connected clients hung trying to read data. We killed the rsync process, and the system became responsive within about 30 seconds.
We later repeated the experiment using ‘cp’ instead of ‘rsync’. After a while, we saw the same thing — the server stopped sending NFS data.
On the advice of a Penguin engineer, we went into our switch management interface and turned off flow control on the NFS server’s switch ports.
Once we did that, we were able to run our rsync smoothly. We saw that we were pushing about 480Mbps during the rsync — that’s a ton of data any way you slice it, so it’s not hard to imagine that flow control might have kicked in at those data rates.
By default, linux filesystems will record the last access time of every file that is read. This means that you’re writing back to the file server every time you read a file over NFS. For a system that is very read-heavy and does not need accurate atime values, this leads to a lot of unnecessary writing, which will incur a big performance cost.
To keep the NFS client from sending atime updates back to the NFS server, you specify noatime and nodiratime in the /etc/fstab mount options.
One thing that might not be so obvious is that your NFS server will need these options specified when it mounts the local volume that it will be serving. Otherwise, you won’t get the full benefit of the optimization — you’ll eliminate the atime updates between NFS client and server, but the server will still be processing all the atime updates to disk, which will cut into your throughput.
Caching of file attributes
The NFS specification allows clients to cache file attributes (like owner, modified time, etc.) for short periods of time. That means that not all your clients will see changes that are made to a file at exactly the same time. Generally, though, they will all see the changes with a few seconds (maybe 30 seconds max).
But what we saw was much worse — we saw clients reading stale data for 60 minutes or more. This seems to stem from a bug in the Linux NFS implementation where if you write to a file in one directory and then move it into another directory on top of an older file with the same name, the client will never see the changes made to the file.
We had structured an application to build a set of files in a working directory, and then when they were all built, quickly move them into the live directory (we did this because it takes a while to write all the files, and we didn’t want clients seeing a partially written set of files). We obviously had to refactor the way this application did its file writing.