Wireless Networks Thread, [SOLVED] Entire Network Down - 100% Network Utilization - Please Help! in Technical; Alright guys. So as you can imagine, network goes down, I'm hoping to get this resolved soon. Here's what happened.
10th May 2008, 09:52 AM #1
[SOLVED] Entire Network Down - 100% Network Utilization - Please Help!
Alright guys. So as you can imagine, network goes down, I'm hoping to get this resolved soon. Here's what happened.
It's 12:30 P.M. [Do YOU know what you're networks' doing?]. Everything is running great. 3 main switches in the server room, 48 port managed switches [Dell PowerConnect 3348's] and all is well. Everything functions normal, all servers are online, all desktops are happily happenin'. I go out to get a shipment of 50 new machines that arrive and start piling them outside my office. Next thing I know, I have lots of requests saying the entire network is down and nobody can access anything. I quickly head back to the server room wondering if a UPS went down, if a server restarted, if a switch turned off, anything. But what do I see? Absolutely nothing out of the ordinary. Everything is functioning great.
But wait...no it's not. I can't get on the internet. I can't ping ANY computers, I can't remote desktop into the servers, the RDC's I DO have up with servers all fail, and everything is extremely slow. So I call the school board office. They head over with their handy $18,000 fluke meter. They plug it into one of our switches, and it measures our network and quickly throws back at us a 100% Network Utilization. The guy from the board office goes WHOA!!! I've never EVER seen it that high before.
So we try swapping the first switch in the stack under suspicion it may be bad. We put in a 3448, Dell's next model of the 48 port 10/100 PowerConnect switch and take out the 3348. We use patch cables to link them together in a chain setup and see if that works.
In the end, the switch switch [lol] did nothing. I still can't ping any machine in the school or get out of the network. I checked our main router, and it's functioning normally. I restarted the servers and they all appear to be functioning normally. So I think to myself, what would cause 100% network utilization. I noticed that ONE ping got through, but only 1 out of the 4 pings got a reply. So I knew the infrastructure itself was probably ok, but I had the assumption that something was loop backing.
So off I go around the network. Documenting every single wall jack and port in the school [took 7 hours] and checked EVERY switch we have for any type of loopback that might be possible, like a jack plugged into a switch, and another port on the switch plugged into another jack. I also shut down every machine I came across and every network printer I came across so the network can essentially have nothing to broadcast [except for powered nics but since the machine isn't on it's less of a chance of being anything malicious running on the machine itself]. Nothing from what I could see in any of the labs was like that. All the jacks had either a direct connection to a PC, or a connection to a switch, that contained only other connections to PC's, not back to a wall jack.
So here I am guys. It's the weekend, Friday night, schools not in for the weekend so I've got 2 days to work and go in for some overtime. Anyone have any suggestions of what to try next? Thanks a ton you guys. I really appreciate this community and hope we can get this resolved!
-Windows NT Based Network running Windows Server 2003 servers and 250 XP Professional client stations.
-6 Windows Server 2003 servers total
-3 main switches in server room, 7 others around school. All checked thoroughly and restarted.
Take care and have a good weekend. We have multiple IP's to the school so since I can't access the internet inside the school because of extreme lag, I'll plug in to the main external switch plugged into the modem and grab an IP for my laptop if I go in so I can check Edugeek.
Last edited by link470; 11th July 2008 at 03:29 AM.
10th May 2008, 10:05 AM #2
First thing that I would do is chuck wireshark (open source) on a laptop and set it up to sniff network packets. This will tell you what kind of packets are flooding the network and then you will have a better idea of how to deal with them.
It could be a faulty NIC in one of the servers or even one of the powered off PCs as the NICs remain live. The good news is that if it is a faulty NIC there is a good chance that its gibberish broadcasts will contain its MAC address that may help you narrow it down.
If it is just spewing rubbish then I would seporate the switches and do a wireshark packet sniff on each one individually to see if you can track down which switch contains the offending device.
This should help you narrow it down and make the task easier. If your switches are managed you may be able to grab a port utilization reading off them to see if one is generating lots of traffic. You could also check for transmition errors on each port as if it is a faulty NIC or device it may be generating errors when it garbles a packet too badly.
Hope this helps. Good Luck with your hunt.
2 Thanks to SYNACK:
bizzel (10th May 2008), link470 (10th May 2008)
10th May 2008, 10:15 AM #3
Thank you very much! I'll try that tomorrow. Much appreciated!
Originally Posted by SYNACK
Any other ideas anyone to add to the list?
10th May 2008, 10:18 AM #4
Start from the simplest point and work outwards. Get some monitoring software on your laptop if you can't borrow the Fluke again. Something likeSnoop is good, but I would also look at using The Dude to help monitor devices as they come online.
All network hardware, desktops and servers turned off.
Turn on your core switch(es) ... plug in your laptop. Send a few pings to it and your router. Turn on your DCs ... and leave snoop running to see what traffic there is.
Then start up each server ... monitor for 5 minutes between each one.
Now your servers are up I would remove the uplinks to each edge switch before turning them all on. Plug in one uplink at a time and monitor. Some people may prefer to have all the desktops / devices turned on at the same time as you can check both devices and network hardware at the same time, others prefer to go slowly so they have a benchmark of what would be 'normal' traffic.
Again, some prefer to test one edge switch and then unplug it to test another ... others prefer to leave the tested ones connected.
As you slowly start everything up you will see whatever is causing the problem jump in. the above is just a logical way or narrowing down the issue, but the Fluke should have been able to tell you what devices the traffic was originating from or what the destination is. Snoop will also do this for you ... it could save you some time.
Things to check ... Spanning tree ... make sure it is on. If you have some bright spark that has plugged in a loopback then this can cause problems ... I did visit one school where students (after reading up on network design and a teacher mentioning this) decided to loop over 100 ports. Not fun.
SPT is a good way to stop the problem if this is the case, but does not help you find where exactly the loop is ... systematic checks such as the above will.
Other causes ... virus attacks, failing NICs broadcasting like hell, switches needing firmware / OS upgrades.
Thanks to GrumbleDook from:
10th May 2008, 10:54 AM #5
lol, awsome. Sounds good, thanks for the advice! I'll add that to the todo list.
Originally Posted by GrumbleDook
10th May 2008, 11:07 AM #6
Might be looking in the wrong area, but i've found this has caused me problems before (not quite like yours, but close).
Your core switches, if they are managed... check no bugger is using their IP. (We had someone manually set their IP once, clashed with one of our switches and that switch just threw a paddy and practically died).
11th May 2008, 12:58 AM #7
Can you ping two machines that have static ip addresses?
I mean go to one server and see if you can ping another.
Last edited by FN-GM; 11th May 2008 at 01:00 AM.
11th May 2008, 06:10 AM #8
I couldn't originally.
Originally Posted by FN-Greatermanchester
Thank you all for your replies. Much appreciated! I ended up thinking of what you all said and took a laptop into work, ran wireshark, found a TON of packets, like, in the 100,000 range almost instantly. I ended up seperating our switch stacks, isolated it to one switch, and that switch was looped into another switch...twice. Everything is back up and running after disconnecting just one of those cables.
What's strange, is I think it's been like that for quite awhile and nothing ever happened before. I may be wrong, but does this sound possible? As of now, the entire network is up and running again, and I thank you all so much for your support and quick suggestions and replies. I'm just chillin' at home now very happy but still wondering if it's possible that there could have been a delay and that the storm of broadcasting didn't catch on till later? The set up was that the main switch [switch 1 out of 3 switches connected together via gigabit uplinks] was plugged into a spare 4th switch down below that the previous tech had used as a spare because there wasn't enough places to plug things in [the patch panel had more ports than the switches could support in that room]. Only 4 things were plugged in. 2 of them were from patch panel locations to connect wall jacks around the school, and 2 were the redundant connections plugged into switch 1 that after removing 1 of those, everything worked again.
Any ideas if it's possible for a delay to happen and it not really get to the point of this until now? Any ideas of what triggered it so suddenly to be problematic?
Either way, it's up and running. Thanks a ton! I love this place.
11th May 2008, 08:52 AM #9
It can take a while for enough broadcasts to build up to cause a problem, if your network is well segmented and has a low amount of broadcast traffic it could take some time for the system to drown.
Originally Posted by link470
This kind of think happened on one of my networks due to the school accepting the suppliers offer to install the switch themselves for free (arghh). They managed to replace the existing switch that was linked by two trunked 1GB ports but managed to wipe the configuration of the main switch (hard reset). I had a rather purposeful chat with both the school in question and the so called 'professional' suppliers about that one.
If one of your switches did have spanning tree on previously but some event reset the switch to default this could have occoured.
Good to hear that you got it solved.
12th May 2008, 10:23 AM #10
- Rep Power
We had a similar issue on thursday of last week the core switch was locked solid but pings were getting trough some times.
we have fibre links back to the core from all other stacks so it was a quick fix to isolate the area.
after examining the effected area we located the issue to a wall port that had nothing connected to it after further investigation we found that RATS had chewed the cables and coused themn to short.
this freaked out the teacher and pest controllers were called in.
12th May 2008, 10:49 AM #11
If one of our delightful pupils manages to loopback on one of the small 4 ports in a room can take hours to disrupt the entire system.
13th May 2008, 10:32 AM #12
LOL. That's the kind of thing you don't want to laugh at while you're trying to find it and it's actually quite annoying, but when you do that's a totally awsome story to keep lol.
Originally Posted by modcoms
14th November 2008, 06:08 PM #13
WHen this happened to us the other day we went to our central backbone switches and did the following:
1. Unplugged each network cable one at a time from the backbones. About 7 cables in, the backbone switches calmed down. SO we figured it was number 7 which linked to another switch (The Design Tech building).
2. The Design Tech switch was still going like the clappers to we went over to that building.
3. IN that building we took each cable out one at a time. About 14 into the 24 ports the switch calmed down so we traced 14 back to one of the rooms.
4. In that room we discovered a network cable doubled back on itself. We pulled it out and plugged things back in gradually.
This process of starting from the inside and working out seemed to work for us. Narrowing down the buildings, ruling out other switches.
I may have missed some stuff out - cannot remember exactly what we did, but it was along these lines.
14th November 2008, 06:12 PM #14
Might be worth reading up on Spanning Tree Gareth, might have helped you there.
14th November 2008, 07:10 PM #15
Originally Posted by kmount
We've been told by the LEA that SPanning Tree has to be turned off because it affects the Cisco switches that are used to connect us to the broadband network.
I shall find out more - and report the facts and reasons. I cannot remember what they said.
By Scruff in forum Wireless Networks
Last Post: 9th February 2010, 12:53 AM
By atfnet in forum Courses and Training
Last Post: 20th August 2009, 12:45 PM
Last Post: 15th February 2008, 05:22 PM
By woody in forum Windows
Last Post: 1st December 2007, 07:27 PM
By broc in forum Network and Classroom Management
Last Post: 10th July 2007, 12:54 PM
Users Browsing this Thread
There are currently 1 users browsing this thread. (0 members and 1 guests)