upstream

published

At work we’ve been struggling with a very annoying problem. We have a server running Windows Unified Data Storage Server 2003. This server holds all of the network home directories for our users, and is mounted via CIFS by all of the Windows XP computers in our computer labs. We have a computational server running Red Hat Enterprise Linux v4 AS which mounts the home directories from our file server using NFS. We plan to also have GNU/Linux clients in our computer labs at some point, and these will also (likely) use NFS to mount the home directories.

On the file server, we have a volume in which three directories exist: Faculty, Staff, and Student. Each of these three is a separate NFS share, such that NFS client computers could elect to mount only those home directories needed. On our computational server, we mount all three as /home/faculty, home/staff, and home/student. We had some difficulty getting permissions to work correctly, such that permissions were the same regardless of whether one was accessing their home directory from a Windows client or from a GNU/Linux client. That was ultimately resolved through a lot of trial-and-error.

The real problem we’re experiencing has to do with quotas. We enforce a 20GB quota per user. Quotas work as expected from the Windows clients using CIFS. The GNU/Linux clients using NFS, however, exhibited really weird behavior: one could easily exceed the quota, but the data in excess of the quota would be randomly truncated! Whether I was using dd if=/dev/zero of=zero to generate enormously big files, or I was simply copying the same batch of Knoppix DVD ISOs, the GNU/Linux clients would permit me to exceed my quota. Looking at the file sizes on the GNU/Linux client showed them all being the expected size. Looking at the directory contents from the file server, though, showed that the files were not the expected size! After some period (usually about 10 minutes) the GNU/Linux client “caught up” with the file server and correctly reported the truncated sizes for the files (logging off and then logging on again also worked to “refresh” the GNU/Linux client).

The Windows clients terminate file write operations when the quota is hit. This is the behavior we expected of the GNU/Linux clients, too. This was a major problem. Many of our graduate students use the computational server for long-running jobs with extremely large data sets. Imagine they were to spend several hours on a job only to have the process appear to work, when in fact they had exceeded their quota and the data was silently discarded. Better to terminate early, when the quota is reached, than to fail silently and give the false appearance of success.

I reported this problem to HP, the manufacturer of the server we purchased. After a few false starts getting routed to the appropriate engineering group, I finally started working with someone who took the time to really investigate the problem. We exchanged numerous emails. I sent him a variety of screenshots and packet captures, and patiently explained again and again what was happening. Finally, today, I received the following email:

"Our Engineers have identified the problem: The windows nfs server does not correctly report how much data has been written to the disk. We'll be reporting that to Microsoft."

If I’m lucky, I’ll be able to follow the status of the problem through HP, though to be honest I’m not expecting any feedback at all at this point. It’ll likely be included in some hotfix or service pack, and I won’t know about it until I read the release notes, or search microsoft.com for the appropriate terms. If only I could have access to a public bug tracker, with an RSS feed, so I could keep up-to-date on all the development related to this bug.


home / about / archive / RSS