When I started working at OSU, our department had three physical facilities connected by a single virtual LAN. The network performance was acceptable, most of the time, because most stations had only 100 megabit ethernet cards, and our uplink was as much (if not less).
Last October we opened our new building, and phased out two of the previous buildings. This left us with two buildings on our VLAN. Our new building has all brand-new HP ProCurve switches providing gigabit ethernet to every port, 10 gig fiber between floors, and a gig uplink to the campus network. The new computers for our student labs all have gig ethernet cards. Unfortunately, the link connecting our two buildings was only 100 megabit.
Our new building is used by all our public computer labs, plus all the faculty and staff. The network appliance providing our storage was in the other building. So all of our connections were being funneled through the 100 megabit connection, leading to extremely long login times, and occasional bursts of network latency. Last week my boss and I finally moved the filer and remaining servers from the old building into the new building. One would think this would make things significantly better for our network performance, and it did, but there was yet one piece that was still a significant bottleneck.
Prior to the move, we had a single /22 network for all our hosts. After the move, we acquired an additional /24 network (discontiguous from our primary pool). Our plan was to move all the student lab computers to the new /24, freeing up a large swath of IP addresses in our primary space. The campus network admins provisioned the new network for us, and configured routing on our upstream router. All worked as it needed to, but in a rather inefficient way: the /24 hosts need to connect to our filer in the /22 network, which means packets need to travel to the upstream router, and then immediately back into our network. The switch that connects our building to the upstream router is basically handling twice us much traffic as it might need to, as it sends packets up to the router and then immediately receives those packets again. Indeed, it was showing 100% utilization for most of the day today. I can only assume it’s been at 100% utilization since we moved the filer into our building (and likely even before that).
Our switches can do layer 3 routing, so it seems kind of silly to send all the traffic up a hop to a router, only to come right back. Configuring our core switch to do the routing wasn’t hard at all. Getting this configuration deployed throughout our building proved slightly more complicated. My first attempt to do so knocked all the student lab machines offline. I thought I could finalize the configurations quickly enough as to not cause too many problems, but a swelling body of irate students outside my office door convinced me otherwise. My boss was not thrilled with my enthusiasm. I aborted my plans, and set about preparing for a more graceful cut-over.
Around 5:40 PM tonight we made the switch. I executed a few commands on each of the affected switches, and then updated a few routes on our primary servers. With the exception of one minor – and easily corrected – oversight, everything worked great. Within moments the switch connecting us to our upstream router was reporting less than 20% utilization. Now that we’re not hammering our upstream connection all day every day we’ll be able to detect and react to abnormal traffic spikes. And most important to our student population, they’re not all trying to share a single gigabit connection to our storage appliance!