OK thats a good diagram,
Now you need to indicate where the hottest links are.
Where do your servers connect in and mark the number of hosts per uplink (this is the sort of thing a switch port mapper can do).
For instance, you diagram show Core1 Sw1 as having 8 switches peering off the link! Thats potentially 192 hosts
How big is that link and what type of traffic runs over it?
Where is your Internet gateway/Proxy in relation to the diagram? as obviously this is were the largest abount of Runt and odd sized packets will flow as you say all of site B has to cross the link to get to the internet but how many more hops does a packet have to go through to reach the internet?
Obviously the shortest route and least amount of hops is the preferred route.
So, Client Node > Switch > Distribution Switch > Server, 3 hops ideal but your topology implies that some packets may need to jump 5 or 6 hops before it reaches a server or the gateway thats a lot of latency?
As your all L2 its one big collission domain as well so one node throws its toys out of the pram you have Blackpool illuminations to deal with.
The 48 Mb throughput capability you referred to is just 24 x 1Gb ports at full duplex thats not really throughput is it, that is the capacity.
The switching throughput is calculated by the number of simultaneous conversations the switch can pass.
A single 24 port switch can hold 12 seperate conversations at the same time eg, port 1 -2, 3-4, 4-5 etc or any combination of the active ports.
This is where chassis switches and stacks come into their own because the fabric beween all of the ports allows more conversations between available ports before you need to break out onto an uplink.
(Note:- some switch vendors claim throughputs of greater than 700-800gb switch fabric but of course this is only when the chassis is fitted with the maximum number of ports and of course with 10GBe slots these numbers go even higher so dont belive everything you read in the brochure)!
Your L2 topology will always be blighted by your interconnecting links and the broadcast traffic.
Maximizing the trunks (normally limited to 4 per LAG) will help, but say you have 3 switches all linked together thats 72 ports.
4 Ports lost to trunk1 on switch A, 8Ports lost to trunk2&3 on switch B and another 4 lost to trunk4 on Switch C.
You have lost 16 Ports to switch inter connects before connecting a server or distributing packets elswhere in the campus!
Thats why you really need a L3 Core/Aggregation Switch preferably stacked or in a chassis.
All ports then appear to be on the same switch and as many ports as you like are available to uplink to all corners of the campusand to your servers.
For Inter Site connections between your two switch arrays use 4 x 1Gb trunk or 2 x 10Gbe and watch out for the physical limits of short reach fibre modules and multimode cabling.
As for the burned out DLinks, these suffered the same fate as many vendors who got caught out with the old Capacitor Cancer.
Caps would leak and blow, on the next power cycle they either crowbar the PSU and self destruct or send all sorts of spurious AC into the delicate switches Circuits causing untold mayhem.
Fortunately most of the vendors have learned from their capacitor mistakes and the later offerings from the budget vendors are all much improved with this being backed by lifetime warranties I call it the Skoda effect...
Heh thankfully none of the dlinks are actually burned out - I'm a little bit of a self-confessed expert on capacitor plague and can spot it a mile off! These switches are only circua 6 months old and were bought in on no budget to replace the shedload of failing Procurves (at the time we didn't realise they were on a lifetime warranty) and 3com 4400s (which oddly enough were stackable). Fortunately they were just released at the time of purchase so suspect the capacitor plague thing for the DGS1210 models won't exist - however the 1224T's may be another story.
Basically the selling point was - well, we could buy 30 of them for under 3 grand including fibre gbics thrown in. At the time the idea was basically to sort the failures, which it did, and keep us running ideally for a year until we could afford to sort the cores out. They will be absolutely fine as edge devices.
It looks like we may be able to free some of our training budget to get a couple of decent core switches however that's a "hope" and I'm looking at some of the more basic Cisco L3 stuff - however another concern is that this basic L3 kit is Linksys - manufactured cheaper, by cheaper people, with a Cisco badge on. Unlike Skoda however, I'm concerned on the quality of them at £800 a pop.
Internet router is on the right hand side site (Site 2). Fibre link between the site is a single gigabit fibre although it is 8 cores and we have recently had the ends terminated, however there's a break at an end somewhere and don't have the funds to get it solved properly. Then we'd at least trunk it once up.
Thanks for the description on the throughput - that's been something that's bugging me, as it's sometimes next to impossible to get any figures out of sales/manufacturer sites. Those Cisco switches I'm looking at are the SG300-52 , thinking one per site as the main core. Throughput compared to the other small business end switches are a fair way higher (104GbPS as opposed to 17.6) at a reasonable price. Any opinions on these?
Only to say that our core is 4x stacked HP E5500GI's. These have now been super-seeded by HP E4800's which we have running as near edge switches IIRC the 4800's have a switching capacity of 190Gb/s and a throughput of 100million packets per second. I think you should be looking at higher than that for reliable 1gb/s to the desktop (depending upon the size of yournetwork). Having said that, one of hte things we do to keep traffic away from the core is to do some of the routing on the near edge, that is to say that the near edge switches are the default route for the subsection VLAN's that they control, thus only traffic destined for other VLAN's ever reaches the core and the core needs to do less VLAN/routing.
Originally Posted by synaesthesia
Dammit it's difficult trying to find that compromise between price and performance/reliability.
Spent the evening narrowing down to the following:
HP V1910-48g - Layer 3 switch with 104Gbps switching capacity ~ £400. Seems a bit cheap, especially for L3. I know the line between modern L2 kit and L3 is fine at least but that still concerns me.
Dlink DGS-3100-48 - 116 Gbps switch capacity Layer 2. I know it's D-Link but really the reliability so far has been good other than this little issue. About £730 for this one.
ZyXEL GS2200-48. Again only a layer 2 switch with 1000Gbps capacity. 74.4mpps fowarding rate. Around £700
And the Cisco SG300-52. Around £650, specs as mentioned previously.
As you can see the prices are very constrained - but we need to work with what we have. Trust me, both myself and my NM have seen some of the budgets thrown around these forums and we've gotten shocked at just how much money some of you guys have to spend for far smaller schools - and some still complain it's not enough!
I intend to spend evenings for the rest of this week learning the art of VLANs - although my understanding so far is basic (let's call it a virtual switch and be done with it) I keep hearing things along the lines of it can help us cut traffic going to irrelevant places yet I don't understand how it bridges the correct traffic. Much to learn, so little time to learn. And of course, those budget restrictions also throttle any training ;)
Hmm. Bit of reading up on various posts here and elsewhere, I'm beginning to suspect that we probably need to do a bit of vlanning. I know there's a lot of broadcast traffic generated by our printers so I might spend some time tomorrow with wireshark, mirroring the server ports and one of the printers. It's all flat currently, and it may well make sense to vlan off the servers and printers. But before I even consider that, I/we need to actually learn about it as above!
Speaking my mind, but I honestly think you will need to fess up to SLT and tell them that they will need to put in more in order to get the reliability. My mantra has always been to invest in the core network infrastructure first, then everything else second. Stuff like major microsoft upgrades, and the hardware updates that precede them, should (IMO) always play second fiddle to the core network - chances are you'd be able to afford it as well with less regular MS payments, for example by perpetual licensing.
Originally Posted by synaesthesia
VLAN's will help in the long run, but it is swings and roundabouts - you'll need a decent core to cope with the vlans because the core switches still need to make some of decisions about the broadcast traffic. As I said in my earlier post you can offload some of that onto the near edge, but you will then need near edge switches and core switches that can cope with vlans and routing protocols.
Originally Posted by synaesthesia
I'd want to start by keeping the servers and printers on the existing network, and putting different areas of the school on separate vlans, it keeps things simpler rather than trying to move all the printers/servers to another network.
If the money really isn't available now then I would try and choose the best switches you can (I've been brought up on Cisco,3com and now HP) with a view to moving them to the near edge and replacing them at the earliest possibility. Keeping the same manufacturer is always good, but I would be inclined to leave the dlink at the edge in this case and looking towards hp/cisco/juniper for the core/near edge.
How deep into site 2 ??
Originally Posted by synaesthesia
This is typical of the type of issues you have described.
The fact that your site 2 site link is a single fibre is at the very root of your problem. A gigabit it maybe but that wont help you much with 50,000 conversations running across it all throwing 2k packets into the mix.
Like I said, now mark on your map exactly where your file servers are located and your gateway, imagine you were a node on the farthest away switch from the internet.
How many hops is it from your node to the gateway?
Then look at a node that is furthest from your file servers, how many hops from your workstation to the profiles folder?
Finally, imagine a packet from the node farthest from the internet gateway and a packet from the node farthest from the file server where to they collide? How many segments do they have to share to reach each others destinations?
By trunking the inersite links you not only have much higher bandwidth but the ability to conduct more bi-directional traffic across those links. You have effectively widened the M1 from London to Birmingham :D easing congestion and allowing more traffic to get from A to B without collisions. File server traffic can pass at the same time as internet traffic.
With your single fibre, despite being 1Gb, packets will be queuing up to pass each other on the ingress/egress ports and VLans will not help you as they are just tagged packets waiting in the same queues.
I would probably use something like a Netgear GSM7328S or 7352S each side of the campus stack them using 10GBe (Existing Fibre permitting) they also support upto 8 links per trunk for really busy uplinks without going Fibre.
The HP5400zl is another good option but costs a lot more per port than the Netgear solution.
Use 10GBe Direct attach cables to get your physical & virtual Host Servers into the core, these work out a lot cheaper than buying SFP+ optical modules and OM3 patch cords.
The D-Link issue we had was not capacitor related, they worked fine for a shor while after a power reset in all temperature ranges (within reason) and we opend a couple up to check. It is a quality issue.
The hp V series is the budget line with no lifetime warrenty but probably your best bet from a bad bunch. I agree with Cybernerd though, go to SMT and tell them they are joking, you are buying yourself more problems down the line by not doing it propperly and if they are going to force you to go that way they need to understand that the fail that follows is their fault.
Would not touch linksys or DLink. We have also had the same random port fail issues over time with their higher (D-link) end devices like the 3100 series.
If you still have a stack of procurves that you got replaced under warrenty it may be worth looking at putting more of them into service depending on what model they are.
VLANS are only going to help if your traffic can be easily segmented (servers in multiple areas) and could lead to an increase in traffic. The real benifit is in limiting the scope of failure with stuff like loopbacks and also limiting broadcasts. If you still have DLinks in the way reseting their configs every so often be ready for things to collaps if they are important to your VLAN structure.
Ok, cheers. We could moan at SMT as much as we liked but it's completely besides the point - it's not that *we* as an ICT department don't have the money, it's *we* as an entire school that do not have the money. I suspect we may plump for the HP Layer 3 (budget) switches - as said, we're basically paying for this out of money that we have personally made the school by supporting others. Effectively that's money *I* personally make the school that we allocate for training so I can assure you that this is something we're not taking lightly.
Part of the plan with our new network (vanilla system) is to split DHCP responsibilities per site, so the DC at Site 2 dishes out DHCP requests there and the PDC dishes it out at Site 1.
All core services are directly connected to the cores at both sites - that goes for sites and servers with exception to a couple of our backup fileservers. Routers, main servers, virtualised systems etc are at the core.
I may well expand that diagram (it's made in D-Links management software mostly automatically) to include a rough idea of nodes. There are around 480 stations in total (both sites) 3 DC's, 2 ESXi systems and about 4 other random servers - that reminds me, I need to be clearing off to upgrade ram in one of the dcs!
Good, so many times it is a case of the school not priorotising the funds to get the systems that it needs and then going off that the stuff they got for the meager budget they offered is not quite up to it.
Getting a good idea of your traffic patterns and possibly allocating the resources closer to their consumption (like a media server near media studies will take a lot of the load off your network.
You could also look into SNMP monitering of switches (if they support it to a decent level) to tell you how much load each link is experiencing and how much memory/cpu is being used on the switches throughout the day to look for obvious bottlenecks. I think nagios does this and lots of the monitering packages do it for a cost.
I'm supprised that 480 stations are taking such a toll on the network as in the greater scheme of things its not that many.
I'd update your diagram with how many stations are in each area and then look at where those stations have to go to get their data. It could be inoptimal pathing. If you have implememted STP and have multiple redundant links but have not set up propper priorities on the switches it could also be reating a shambles with traffic being directed through many switches un-nessisarily if it things that some random edge switch is the core (and thus trying to make all traffic flow thorough there).
Cheers Synack, much appreciated. The D-link software (Dview) I mentioned works via SNMP and automatically tells us what switches goes to were. I've just pulled a single switch from a location which was losing its settings (notably SNMP) and using that on my own station now to keep an eye - wonder if that was the actual problem! If so, I will be going to D-Link and saying as we've invested so heavily, could we please be furnished with a full copy of D-View ;) It's actually a damned good bit of software.
I shall be spending today using that as a starting point and building on it what is connected to where exactly. It's not a lot of stations but they're very well spread out hence why we're having such a hard time narrowing it all down, then being split site doesn't help the matter much!
I'm currently wiresharking the entire traffic of the old coreswitch (which still has PDC on it) to see what's going to and from that via a dedicated monitoring card. Could be an interesting view!
Thanks for all the help so far everyone, it's been a slow plod through systematically checking everything but I think we'll get there!
I'm confused, have you sorted the auto negotiate problem out? I can't see why too big of a broadcast domain, slow backbone links, or any other factor (given that he hasn't had this problem in the past) would affect the negotiation of port speed between an edge switch and a workstation, unless I'm seriously missing something.
Have you tried isolating a switch and seeing if the problem persists? Personally my money's on a dodgy switch, cable, config, etc, rather than the topology.
That is almost a given - we're fairly certain that is the case but it's a bit of a pain having to wait until the next morning to find out if it works or not! We will be isolating entire sections as a last resort.
As helpful as Dlink are being, I'm now a bit annoyed that they are shrugging shoulders saying "We've never heard of such problems" yet their own forums are chock full of identical symptoms even on tiny networks. I'm now threatening them big time with a mass RMA - wonder how they take that!
Oh OK, I was just a bit confused why the discussion had moved away from your problem and onto broadcast domains etc.
What about your spare switches; any scope for replacing an affected area with one and then experimenting on that "known faulty" switch? Then try breaking the link to the network, remake it, make a trunk, etc.