Microsoft Windows Unified Data Storage Server 2003

published 2008-03-04

Last summer, my employer purchased a Hewlett Packard DL585 server running Microsoft Windows Unified Data Storage Server 2003. This server was selected and purchased to provide unified home directories to all our users across Windows and GNU/Linux machines. Windows users would connect via CIFS, and GNU/Linux users would connect via NFS. When a user logs in to a GNU/Linux machine, they have the exact same home directory as when they log in from a Windows XP workstation.

I followed these instructions for integrating our GNU/Linux server into the Active Directory domain. The server is a computation server, to which our users connect via ssh. We had some trouble initially getting Active Directory authentication working. The VAR from whom we purchased the file server sent out a technician who spent two days on-site with us, one of which was spent almost entirely (on hold) on the phone with Microsoft support. At long last, we finally got all the pieces of the puzzle assembled, and users could authenticate to the GNU/Linux server using their Active Directory credentials.

After a few weeks, I had reports that a user lost some of their files while using the GNU/Linux server. It took me a couple days of investigation to finally figure out what the problem was. When I did, I was horrified. You see, all of our users have a 20GB disk quota, meaning that they can save only 20GB of data into their home directory. The file server enforces this quota, and we had no problems with this configuration for our Windows workstations. Remember, Windows workstations communicate with the file server via the CIFS protocol. What I discovered was that the file server did not seem to be enforcing the quota for GNU/Linux machines connecting via the NFS protocol. At first, it looked like a user could exceed their quota – and indeed, they could for a short period of time, provided they stayed logged in. If the user logged out, or after a few minutes of inactivity, the file server would truncate to zero bytes those files that had exceed the quota. What was worse, a file opened for writing before the quota was reached would also be truncated if the size of that file grew to cause the user to reach their quota. What we had was silent data loss. The purpose of this server is for long-running batch computations, some of which generate enormous amounts of data: the possibility that someone’s research job running for days or weeks could silently lose its dataset was catastrophically bad.

We had a few problems getting HP support engaged properly, mostly due to the front-line support folks not really understanding my problem report, and dispatching me to the wrong group. After that was cleared up, we finally connected with a couple of level 3 support engineers who took complete ownership of the problem. They issued a few hotfixes for us (one to resolve bluescreens when you enable logging for NFS on the file server – yeah, that was a fun day), and walked us through a number of diagnostic steps. In the end, we collected several gigs worth of packet traces and memory dumps. Along the way, we identified a few more problems, for which additional hotfixes were issued. After a few weeks, the HP engineer confirmed for me that they were able to reproduce the quota problem for which I had originally called: the Microsoft NFS server code did not enforce quotas properly, resulting in silent data loss.

The issue was escalated upstream to Microsoft on November 6, 2007. As of today, Microsoft has confirmed that they can reproduce the problem; but they’re telling HP that they will not commit to a specific date by which the problem will be fixed. Microsoft suggests that we work around the problem by using the “sync” mount option for our NFS clients. Yes, that works, but imposes a non-trivial performance penalty, which can be a real problem for the intended use of this server: lots of grad students crunching numbers and spooling data sets to disk for analysis.

I’ve spent much of the last 24 hours in meetings with my coworkers and on conference calls with our VAR, HP’s escalation support manager, the level 3 HP engineer who owns this issue, HP pre-sales engineers, and HP product engineers trying to figure out how Hewlett Packard can resolve this situation for us. It’s clear they sold us something that doesn’t do what we all thought it would – and should – do. I’m relieved that HP is involved to this level to make things right. It’s a shame, though, that the real problem can’t be fixed upstream and pushed out to us. Instead, we’re looking at a complete overhaul of our storage solution, and a substantial new investment of time and energy.

We have two options on the table: an HP MSA or EVA series SAN with clustered GNU/Linux servers running Polyserve clustered file system for “Enterprise File System Clustered Gateway”, or one of the SAN backends with two independent servers running Windows and GNU/Linux for CIFS and NFS shares, respectively. The former option preserves our unified home directory configuration, while the latter unambiguously connects the server operating system with the same client system. The drawback to the separate server solution, though, is that users would have two separate home directories, and double the quota we had originally intended to give them (we’re ruling out the notion that GNU/Linux users split their quota capacity between servers, or that only folks who ask for it get a Linux account, or other such management nightmares). We don’t yet have final pricing on either option, because we’re looking at a buy-back situation for our current hardware, plus hopefully some modest discount to make up for the time and energy we’ve spent fighting this problem for so long. Price may well be the deciding factor.

What would you do, if you had to support 1,000+ students and about 100 faculty and staff?