Ok, I haven't really made progress with this issue but have noticed a potential error in my DNS that I'd like clarified please.
my dns is hosted on server 1 (the one that hosts my redirection shares and drives etc and the one I'm having issues with) and replicated to server 2.
When I look at the forwarders in server 1 the forwarder that is setup to my isp can't ressolve... Then when I look at server 2 the forwarder is set to point to server 1's IP address. I'm no DNS expert (as you can tell...) but I believe this is incorrect. Surely both servers forwarders should be set to either my ISP or a public server?
Could the fact that this seems incorrect be contributing to the issues I've found myself experiencing with server 1? I haven't changed the forwarders since they were setup last year, way before I started having the issues (last week) but I don't know how long the forwarder to my isp hasn't been able to ressolve.
This would make sense because server 1 is then so busy trying to ressolve external dns requests that it is overloaded (just my thoughts, correct me if I'm wrong).
I'm considering changing both servers (server 1 and server 2) forwarders to 22.214.171.124
Can anyone advise me please?
Dan.I hate to say this but you are really jumping all over the place.
You should read up on the OSI model OSI model - Wikipedia, the free encyclopedia and apply it to your problem.
First rule out cabling and hardware issues like you have done, Then look at potential layer 2 issues that could result from STP problems etc. Then layer 3 routing problems. When you've got to the top of the OSI model and nothing has fixed it - then start looking at layer 7 (DNS) problems.
Thanks Cybernerd, I shall have a read :)
We had this problem a bit back and wr had to do the following.
Plan how to eliminate whats ok.
Are all your switches flashing instead of twinkling. This may suggest you jabe a loop. Tree spanning will help stop this.
Check the event viewer of the server.if you have a 2008 server the os does extra thinks and if your hardware does not like it it looses connection. Check you have the latest drivers and firmware. Try disabling the bits here you dont us.
Check servers are replicating dcdiag and netdiag may help.
Disable different bramches of the network to see if problem goes.
Sniff the network traffic and anaylise it with am older version of capsa. This will show you all sorts of things like arping machines conficts and other things. We found allsorts includingachines looking for servers that had gone bad switches nic card etc.
Dan, I agree here, you need to be systematic in tracing this problem. You need to isolate it to resolve it and that's going to require focus.
Originally Posted by CyberNerd
If I were in your shoes, I would first see if I could isolate the problem either to the server itself (it could in fact be the problem if it is the only thing affected by networking issues) or the switch (if it is not just the server that's affected). If you find that it does not seem to be isolated to the switch or server, then the job gets much harder and you will need to snoop your network (wireshark) and also use a tool to measure bandwidth and traffic on the network (source, destination, type, etc.) such as MRTG, Cacti, or PRTG.
From your recent posts,I'm beginning to think the problem may be the server itself. For one thing you have an AD server also acting as a file server. That's generally not something you want to do. It could be a faulty NIC on the server. Then, there is the possibility that Sophos is causing problems on the server. We experienced so many problems with Sophos last year we abandoned it for Avast! But, only go down this road lookingnfor issues with the server if you can either isolate the issues to this server OR exclude the switch as a problem (through testing) - otherwise, you're just chasing rabbits down a hole.
OT: Is there a edugeek guide to toubleshooting? Assuming one has a firm grasp of the all the various components in your system (i.e. can map the software/hardware to the OSI model) I'm quite fond of the half-splitting technique. Secrets of a super geek: Use half splitting to solve difficult problems | TechRepublic.
But the first question always to ask is "what has changed", followed by "what happens when we roll back the change".
After the words of wisdom I took time out to reflect on the issue. I was getting nowhere because I agree that due to worrying etc I was Not checking things logically.
I've managed to kick everyone off the system today from 5pm for testing so am going to plug the servers into a new gig switch with a few clients for testing. One of the things that fails (only since last Thursday) is the overnight backup which runs from server 2 to backup files on server 1. It's been failing because the communication drops to server 1 whilst in progress so it will be interesting to see if it runs tonight not going through all our network infrastructure.
Thanks for all your support guys, I'll keep you posted :)
Right, I think I've made some progress and determined that it is switch/cabling related or a dodgy NIC.
I plugged the servers and a few clients into their own gig switch and left them running overnight. Our backup between servers had been failing due to the connection dropping, this would happen after about 10 minutes of the backup job starting. Got in this morning to a full backup complete message :)
On Tuesday I'm going to further investigate the switches etc and will keep you updated.
Many thanks for all your advice guys.
If you used the same patch leads, then it is likely the switch. (since unless you swapped out the NICs on your servers and clients, they wont have changed either.)
Try PSPING (SysInternals) in TCP mode, for a reasonably large number of pings. Run a few times across your isolated 'known good' switch from clients to servers, recording the results. Then run it across the suspect infrastructure (using the same clients and servers). Compare the results.
Meanwhile have a putty session into the relevant switch (with session logging enabled). Make sure the switch is logging to the console. With a bit of luck you should be able to correlate packet drops with ports flapping, or spanning tree recalculations, or just general errors.
If this is impacting T+L I would be tempted to negotiate with SLT access over the weekend to investigate, and ensure that there is at least a workaround solution in place for Tuesday morning (i.e. that switch that worked in last night's testing). It might also be worth contacting the manufacturer of the potentially faulty switch today for support. Opening the call with something like: "When I use your switch I get dropped connections, when I use a different, isolated one with the same clients and servers I do not get dropped connections. This is impacted all of our [nnnn] users, who are unable to work when the connections drop. Please help!"
Originally Posted by IT_Man_Dan
The testing you have done would seem to show that the server and server NIC itself is OK. It doesn't prove that the switch is bad, but it does appear to indicate the problem is network related.
You only have a limited subset of servers (I assume you have more than 2 servers) and clients in your temporary test environment. The switch may be fine and you may actually have a client with a faulty NIC or that is compromised and flooding the network with traffic.
My recommendation is to setup a network monitor (MRTG, Cacti, or free trial versions of PRTG or OpManager) so that you can see what is going on in your network. If you find nothing abnormal with that, then your switch may be faulty or the firmware may be buggy. At that point, I would reboot the switch, and run tests. Then upgrade the firmware and test again.
That's what weekends are made for in IT - testing and troubleshooting....
Thanks for your suggestions and advice, I have now resolved this issue that turned out to be a very simple cause. It turns out that when we experienced the power cut the main server building stayed offline for a few hours after the other areas of the school came up.
We have a few Unifi points around the school and one wasn't properly configured with static IP. It therefore took an IP address from DHCP form another server, the same IP as the main server (which stayed offline the longest)! To my horror I found no reservation in DHCP for the server IP (now rectified) and have now statically assigned the correct IP to the AP. Problems solved!
I can't believe it was a duplicate IP, or that there was no reservation in DHCP for the server.
I'm now going through all the DHCP settings etc for this network to ensure it's all correctly configured.