Last weekend I finally got around to setting up Layer 3 routing on my core switch (a ProCurve 5308xl) and re-addressed the entire network to split it up into sensible(?) VLANs, with 1 subnet per VLAN, and the switch routing traffic between them.
Since then, everything works perfectly about 95% of the time: clients pick up DHCP addresses, everything uses the switch as the default gateway, and traffic passes to and fro normally. Then every so often (once an hour or so?), something inexplicable happens. Basically, the switch will suddenly decide it cannot route to some (but not all) addresses from one VLAN to another.
Example 1: I'm sitting on a client on VLAN 6, and I'm browsing BBC News (via our gateway router on VLAN 7). Then I go to EduGeek and find it won't load. I start a PING, and this is what I see:
Code:
Pinging www.edugeek.net [176.9.42.234] with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
Reply from 176.9.42.234: bytes=32 time=40ms TTL=55
Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
Reply from 176.9.42.234: bytes=32 time=41ms TTL=55
Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
Reply from 176.9.42.234: bytes=32 time=40ms TTL=55
Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
Basically, after less than a minute of PING, the remote address will start responding and everything works again. It always comes back quite quickly; so far I've never had a delay of more than about a minute. It will then be fine again for some time - hours or more - before the exact same thing happens again, maybe with the same address, maybe with a different one.
It's like it just forgets how to reach an address, then after nagging it for a minute or two, it suddenly remembers and starts passing traffic.
This also happens (less frequently) with LAN addresses, so it's not just Internet routing that is affected. Basically, anything that has to cross to a different VLAN or be routed out by the switch seems to be at risk, but not in any pattern I can identify.
Here's what my investigations have uncovered so far:
- At all times when this is happening, I can initiate a PING from the core switch via a telnet console session, and the affected address responds to the switch instantly, while still not responding to the client.
- I can always PING other addresses in the same subnet/VLAN as the client, including the switch, which is acting as the default gateway for each subnet.
- It seems to be only individual addresses that are affected at any one time, rather than whole VLANs or subnets: I've had instances where I can't ping 192.168.2.1, but I can ping 192.168.2.2.
- When an address becomes unreachable, it becomes unreachable for ALL clients on all all VLANs (including the default VLAN) simultaneously, except for the core switch (which can always reach everything).
- I can't tell if it literally only affects a single IP at a time. I don't think I've had a situation yet where multiple addresses are affected, but it lasts for such a short time that it's tricky to do much testing before the problem disappears again.
After 2 days of poring over forums and switch config, I am almost at my wit's end. This switch has been in use without L3 without any issues, suggesting its something to do with my config, but I can't work out what, especially as it works as designed most of the time.
Here is the config I'm using:
Code:
; J4819A Configuration Editor; Created on release #E.11.21
hostname "Core"
max-vlans 16
time daylight-time-rule Western-Europe
module 7 type J4821A
module 8 type J4821A
module 6 type J4878A
module 4 type J4820A
module 3 type J4820A
module 1 type J4820A
module 2 type J4820A
module 5 type J4878A
interface H1
no lacp
exit
interface H2
no lacp
exit
interface H3
no lacp
exit
interface H4
no lacp
exit
trunk H1-H4 Trk1 LACP
sntp server 192.168.3.1
ip routing
ip udp-bcast-forward
timesync sntp
sntp unicast
snmp-server community "public" Unrestricted
vlan 1
name "DEFAULT_VLAN"
untagged A1-A5,A7-A24,B1-B5,B7-B10,B12-B15,B17-B24,C3-C24,D1-D21,D23,E1-E4,F1
-F4,G2-G3,Trk1
ip address 192.168.11.254 255.255.252.0
ip helper-address 192.168.3.22
no untagged A6,B6,B11,B16,C1-C2,D22,D24,G1,G4
ip igmp
exit
vlan 50
name "iSCSI"
no ip address
tagged F2,Trk1
exit
vlan 100
name "NM"
ip address 192.168.100.254 255.255.255.0
ip helper-address 192.168.3.22
tagged E1-E4,F1-F4,G3,Trk1
ip igmp
exit
vlan 200
name "Backup WAN"
untagged D24
no ip address
tagged F4
exit
vlan 2
name "Networking"
untagged C1-C2
ip address 192.168.2.254 255.255.255.0
ip helper-address 192.168.3.22
tagged A15,E1-E4,F1-F4,G3,Trk1
exit
vlan 3
name "Servers"
untagged B6,D22
ip address 192.168.3.254 255.255.255.0
ip helper-address 192.168.3.22
ip forward-protocol udp 192.168.11.255 7
tagged F2,Trk1
ip igmp
exit
vlan 6
name "IT Support"
untagged G4
ip address 192.168.6.254 255.255.255.0
ip helper-address 192.168.3.22
ip forward-protocol udp 192.168.11.255 8992
tagged E1-E4,F1-F4
ip igmp
exit
vlan 4
name "Printers"
untagged A6,B11
ip address 192.168.4.254 255.255.255.0
ip helper-address 192.168.3.22
tagged E1-E4,F1-F4
exit
vlan 7
name "Perimeter"
untagged B16,G1
ip address 192.168.7.254 255.255.255.0
exit
ip route 0.0.0.0 0.0.0.0 192.168.7.1
spanning-tree
spanning-tree Trk1 priority 4
ip multicast-routing
router pim
exit
vlan 1
ip pim all
exit
vlan 6
ip pim all
exit
password manager
Apart from removing a couple of identifying lines (the snmp-server contact and snmp-server location lines), this is the entire config. As you can see, the VLANs are entirely port-based at the moment, with no ACLs - all of that is coming later. The VLANs that don't have IP addresses configured on the switch are meant to be that way, as there are some VLANs I don't want to route. I've eliminated the multicast routing/PIM as a cause, as the problem happens even with that section removed.
Oh, and yes, VLAN 1 is meant to have a 192.168.8.0/22 range rather than /24. I know it looks odd but it's meant to be that way, and again, the problem still occurs if it's a /24 range.
Here's the routing table, which to me looks as expected:
Code:
IP Route Entries
Destination Gateway VLAN Type Sub-Type Metric Dist.
------------------ --------------- ---- --------- ---------- ---------- -----
0.0.0.0/0 192.168.7.1 7 static 1 1
127.0.0.0/8 reject static 0 250
127.0.0.1/32 lo0 connected 0 0
192.168.2.0/24 Networking 2 connected 0 0
192.168.3.0/24 Servers 3 connected 0 0
192.168.4.0/24 Printers 4 connected 0 0
192.168.6.0/24 IT Support 6 connected 0 0
192.168.7.0/24 Perimeter 7 connected 0 0
192.168.8.0/22 DEFAULT_VLAN 1 connected 0 0
192.168.100.0/24 NM 100 connected 0 0
CPU and memory usage seems fine: typically more than 50% free memory, and CPU fluctuates between 2-40%, but not seen it go higher.
I'm about a day away from calling in a consultant at this point, which will make me extremely grumpy. If anyone here can give me a clue, it would be very very welcome.