+ Post New Thread
Page 1 of 2 12 LastLast
Results 1 to 15 of 26
Wired Networks Thread, What's wrong with my ProCurve 5308xl (or its config)? in Technical; Last weekend I finally got around to setting up Layer 3 routing on my core switch (a ProCurve 5308xl) and ...
  1. #1

    AngryTechnician's Avatar
    Join Date
    Oct 2008
    Posts
    3,730
    Thank Post
    698
    Thanked 1,211 Times in 761 Posts
    Rep Power
    394

    What's wrong with my ProCurve 5308xl (or its config)?

    Last weekend I finally got around to setting up Layer 3 routing on my core switch (a ProCurve 5308xl) and re-addressed the entire network to split it up into sensible(?) VLANs, with 1 subnet per VLAN, and the switch routing traffic between them.

    Since then, everything works perfectly about 95% of the time: clients pick up DHCP addresses, everything uses the switch as the default gateway, and traffic passes to and fro normally. Then every so often (once an hour or so?), something inexplicable happens. Basically, the switch will suddenly decide it cannot route to some (but not all) addresses from one VLAN to another.

    Example 1: I'm sitting on a client on VLAN 6, and I'm browsing BBC News (via our gateway router on VLAN 7). Then I go to EduGeek and find it won't load. I start a PING, and this is what I see:

    Code:
    Pinging www.edugeek.net [176.9.42.234] with 32 bytes of data:
    Request timed out.
    Request timed out.
    Request timed out.
    Request timed out.
    Request timed out.
    Request timed out.
    Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
    Reply from 176.9.42.234: bytes=32 time=40ms TTL=55
    Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
    Reply from 176.9.42.234: bytes=32 time=41ms TTL=55
    Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
    Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
    Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
    Reply from 176.9.42.234: bytes=32 time=40ms TTL=55
    Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
    Reply from 176.9.42.234: bytes=32 time=39ms TTL=55
    Basically, after less than a minute of PING, the remote address will start responding and everything works again. It always comes back quite quickly; so far I've never had a delay of more than about a minute. It will then be fine again for some time - hours or more - before the exact same thing happens again, maybe with the same address, maybe with a different one.

    It's like it just forgets how to reach an address, then after nagging it for a minute or two, it suddenly remembers and starts passing traffic.

    This also happens (less frequently) with LAN addresses, so it's not just Internet routing that is affected. Basically, anything that has to cross to a different VLAN or be routed out by the switch seems to be at risk, but not in any pattern I can identify.

    Here's what my investigations have uncovered so far:

    1. At all times when this is happening, I can initiate a PING from the core switch via a telnet console session, and the affected address responds to the switch instantly, while still not responding to the client.
    2. I can always PING other addresses in the same subnet/VLAN as the client, including the switch, which is acting as the default gateway for each subnet.
    3. It seems to be only individual addresses that are affected at any one time, rather than whole VLANs or subnets: I've had instances where I can't ping 192.168.2.1, but I can ping 192.168.2.2.
    4. When an address becomes unreachable, it becomes unreachable for ALL clients on all all VLANs (including the default VLAN) simultaneously, except for the core switch (which can always reach everything).
    5. I can't tell if it literally only affects a single IP at a time. I don't think I've had a situation yet where multiple addresses are affected, but it lasts for such a short time that it's tricky to do much testing before the problem disappears again.


    After 2 days of poring over forums and switch config, I am almost at my wit's end. This switch has been in use without L3 without any issues, suggesting its something to do with my config, but I can't work out what, especially as it works as designed most of the time.

    Here is the config I'm using:

    Code:
    ; J4819A Configuration Editor; Created on release #E.11.21
    
    hostname "Core"
    max-vlans 16
    time daylight-time-rule Western-Europe
    module 7 type J4821A
    module 8 type J4821A
    module 6 type J4878A
    module 4 type J4820A
    module 3 type J4820A
    module 1 type J4820A
    module 2 type J4820A
    module 5 type J4878A
    interface H1
       no lacp
    exit
    interface H2
       no lacp
    exit
    interface H3
       no lacp
    exit
    interface H4
       no lacp
    exit
    trunk H1-H4 Trk1 LACP
    sntp server 192.168.3.1
    ip routing
    ip udp-bcast-forward
    timesync sntp
    sntp unicast
    snmp-server community "public" Unrestricted
    vlan 1
       name "DEFAULT_VLAN"
       untagged A1-A5,A7-A24,B1-B5,B7-B10,B12-B15,B17-B24,C3-C24,D1-D21,D23,E1-E4,F1
    -F4,G2-G3,Trk1
       ip address 192.168.11.254 255.255.252.0
       ip helper-address 192.168.3.22
       no untagged A6,B6,B11,B16,C1-C2,D22,D24,G1,G4
       ip igmp
       exit
    vlan 50
       name "iSCSI"
       no ip address
       tagged F2,Trk1
       exit
    vlan 100
       name "NM"
       ip address 192.168.100.254 255.255.255.0
       ip helper-address 192.168.3.22
       tagged E1-E4,F1-F4,G3,Trk1
       ip igmp
       exit
    vlan 200
       name "Backup WAN"
       untagged D24
       no ip address
       tagged F4
       exit
    vlan 2
       name "Networking"
       untagged C1-C2
       ip address 192.168.2.254 255.255.255.0
       ip helper-address 192.168.3.22
       tagged A15,E1-E4,F1-F4,G3,Trk1
       exit
    vlan 3
       name "Servers"
       untagged B6,D22
       ip address 192.168.3.254 255.255.255.0
       ip helper-address 192.168.3.22
       ip forward-protocol udp 192.168.11.255 7
       tagged F2,Trk1
       ip igmp
       exit
    vlan 6
       name "IT Support"
       untagged G4
       ip address 192.168.6.254 255.255.255.0
       ip helper-address 192.168.3.22
       ip forward-protocol udp 192.168.11.255 8992
       tagged E1-E4,F1-F4
       ip igmp
       exit
    vlan 4
       name "Printers"
       untagged A6,B11
       ip address 192.168.4.254 255.255.255.0
       ip helper-address 192.168.3.22
       tagged E1-E4,F1-F4
       exit
    vlan 7
       name "Perimeter"
       untagged B16,G1
       ip address 192.168.7.254 255.255.255.0
       exit
    ip route 0.0.0.0 0.0.0.0 192.168.7.1
    spanning-tree
    spanning-tree Trk1 priority 4
    ip multicast-routing
    router pim
       exit
    vlan 1
       ip pim all
       exit
    vlan 6
       ip pim all
       exit
    password manager
    Apart from removing a couple of identifying lines (the snmp-server contact and snmp-server location lines), this is the entire config. As you can see, the VLANs are entirely port-based at the moment, with no ACLs - all of that is coming later. The VLANs that don't have IP addresses configured on the switch are meant to be that way, as there are some VLANs I don't want to route. I've eliminated the multicast routing/PIM as a cause, as the problem happens even with that section removed.

    Oh, and yes, VLAN 1 is meant to have a 192.168.8.0/22 range rather than /24. I know it looks odd but it's meant to be that way, and again, the problem still occurs if it's a /24 range.

    Here's the routing table, which to me looks as expected:

    Code:
                                  IP Route Entries
    
      Destination        Gateway         VLAN Type      Sub-Type   Metric     Dist.
      ------------------ --------------- ---- --------- ---------- ---------- -----
      0.0.0.0/0          192.168.7.1     7    static               1          1
      127.0.0.0/8        reject               static               0          250
      127.0.0.1/32       lo0                  connected            0          0
      192.168.2.0/24     Networking      2    connected            0          0
      192.168.3.0/24     Servers         3    connected            0          0
      192.168.4.0/24     Printers        4    connected            0          0
      192.168.6.0/24     IT Support      6    connected            0          0
      192.168.7.0/24     Perimeter       7    connected            0          0
      192.168.8.0/22     DEFAULT_VLAN    1    connected            0          0
      192.168.100.0/24   NM              100  connected            0          0
    CPU and memory usage seems fine: typically more than 50% free memory, and CPU fluctuates between 2-40%, but not seen it go higher.

    I'm about a day away from calling in a consultant at this point, which will make me extremely grumpy. If anyone here can give me a clue, it would be very very welcome.
    Last edited by AngryTechnician; 24th July 2012 at 09:02 PM.

  2. #2

    Michael's Avatar
    Join Date
    Dec 2005
    Location
    Birmingham
    Posts
    9,262
    Thank Post
    242
    Thanked 1,568 Times in 1,250 Posts
    Rep Power
    340
    I can't see anything wrong with your config at first look.

    Have you tried updating the firmware and/or resetting the switch?

  3. #3

    AngryTechnician's Avatar
    Join Date
    Oct 2008
    Posts
    3,730
    Thank Post
    698
    Thanked 1,211 Times in 761 Posts
    Rep Power
    394
    Firmware is the latest (E.11.21). Have done a reload a couple of times now and the problem remains.

  4. #4


    Join Date
    Jan 2006
    Posts
    8,202
    Thank Post
    442
    Thanked 1,032 Times in 812 Posts
    Rep Power
    339
    I've seen similar problems which turned out to be a loop somewhere.

    I don't know the commands on procurves, but basically what you should do is:

    1) From the core switch, ping something that occasionally drops off the network.
    2) Use a command on the procurve to determine mac address (display arp) in comware.
    3) Then find the port or aggregation group that is showing that address (display mac-address xxxx) in comware will show the port associated with that mac address
    4) Now ping offending device when it goes offline
    5) Again do the equivalent of "display mac-address xxxxx” to determine port or aggregation group when the problem device is offline. Make a note if the port or aggregation group changes. This will show you the ports/aggregation groups that are causing issues.
    6) Repeat on the other switches to determine loop.
    7) Fix the loop or shutdown port

    edit: xxxx is the mac address shown in display arp - not the ip address.
    Last edited by CyberNerd; 24th July 2012 at 10:38 PM.

  5. 2 Thanks to CyberNerd:

    AngryTechnician (25th July 2012), speckytecky (27th July 2012)

  6. #5
    DMcCoy's Avatar
    Join Date
    Oct 2005
    Location
    Isle of Wight
    Posts
    3,432
    Thank Post
    10
    Thanked 488 Times in 428 Posts
    Rep Power
    111
    Have a play with mac-age-time try something like 600 seconds.

  7. Thanks to DMcCoy from:

    AngryTechnician (25th July 2012)

  8. #6
    Abaddon's Avatar
    Join Date
    Mar 2006
    Location
    Middlesex
    Posts
    593
    Thank Post
    70
    Thanked 68 Times in 63 Posts
    Rep Power
    59
    I've GOT a consultant in at the moment - I'll pass this over his desk in the morning and see if he can spot anything obvious(to him).. He is extremely good, and I've had him in every year to help out with some switch configuration or other. He's doing some stuff that means he probably has time for this, so I'll let you know if he has any ideas..

  9. #7

    AngryTechnician's Avatar
    Join Date
    Oct 2008
    Posts
    3,730
    Thank Post
    698
    Thanked 1,211 Times in 761 Posts
    Rep Power
    394
    @CyberNerd: A loop had crossed by mind already - wouldn't STP catch that? I have it enabled on all switches. And wouldn't I have seen problems before enabling L3 routing?
    @DMcCoy: Will have a look at mac-age-time tomorrow, thanks.
    @Abaddon: Hopefully he won't charge you extra for looking at someone else's problems, but thanks!

  10. #8
    Jona's Avatar
    Join Date
    May 2007
    Location
    Cranleigh
    Posts
    467
    Thank Post
    14
    Thanked 50 Times in 48 Posts
    Rep Power
    23
    If your losing connectivity for between 60 and 30 seconds it's often because Spanning Tree is setting a port to the blocked state as it's detecting a loop somewhere.

    Depending on how your vlan ports, trunks, etc are set-up this appear to be layer 3 issues when it's actually a layer 2.

    When a machine is in a failure state can it ping it's gateway ip on the switch? Can other places?

    Have you checked your switch logs? Usually command: show log on a procurve.

  11. #9

    Join Date
    Jan 2009
    Posts
    109
    Thank Post
    3
    Thanked 21 Times in 16 Posts
    Rep Power
    15
    I'd also suspect STP but noticing you have some iSCSI traffic on there as well I'd check for multicast issues.

  12. #10


    Join Date
    Jan 2006
    Posts
    8,202
    Thank Post
    442
    Thanked 1,032 Times in 812 Posts
    Rep Power
    339
    Quote Originally Posted by AngryTechnician View Post
    @CyberNerd: A loop had crossed by mind already - wouldn't STP catch that? I have it enabled on all switches. And wouldn't I have seen problems before enabling L3 routing?
    As @Destinova and @Jona said - it could be STP causing the issue. The procedure that I proposed would help determine this.

  13. #11

    Theblacksheep's Avatar
    Join Date
    Feb 2008
    Location
    In a house.
    Posts
    1,934
    Thank Post
    138
    Thanked 290 Times in 210 Posts
    Rep Power
    193
    For what its worth we don't use STP on our HP switches (5304's and a 5412) I seem to remember one of the consultants (a while back now) recommended mdix on ports instead of STP for loopback prevention.
    Last edited by Theblacksheep; 25th July 2012 at 08:21 AM.

  14. #12
    Jona's Avatar
    Join Date
    May 2007
    Location
    Cranleigh
    Posts
    467
    Thank Post
    14
    Thanked 50 Times in 48 Posts
    Rep Power
    23
    @Theblacksheep I'm pretty sure auto-mdix won't deal with a loopback properly, it's usually used to support straight vs. crossover cables see: Medium Dependent Interface - Wikipedia, the free encyclopedia

    It may mitigate to some extent against the scenario where a student plugs two network points together if the two network points are in the same switch. I can't see anyway it can mitigate against an actual switching loop, where you accidentally connect a loop of switches which in a complex network is surprisingly easy to do!

  15. #13

    Theblacksheep's Avatar
    Join Date
    Feb 2008
    Location
    In a house.
    Posts
    1,934
    Thank Post
    138
    Thanked 290 Times in 210 Posts
    Rep Power
    193
    Quote Originally Posted by Jona View Post
    @Theblacksheep I'm pretty sure auto-mdix won't deal with a loopback properly, it's usually used to support straight vs. crossover cables see: Medium Dependent Interface - Wikipedia, the free encyclopedia

    It may mitigate to some extent against the scenario where a student plugs two network points together if the two network points are in the same switch. I can't see anyway it can mitigate against an actual switching loop, where you accidentally connect a loop of switches which in a complex network is surprisingly easy to do!
    AFIK auto is the default. Sorry, class end-points are set to mdix. Maybe its coincidental we've not had any loopbacks.

  16. #14

    AngryTechnician's Avatar
    Join Date
    Oct 2008
    Posts
    3,730
    Thank Post
    698
    Thanked 1,211 Times in 761 Posts
    Rep Power
    394
    Quote Originally Posted by Jona View Post
    If your losing connectivity for between 60 and 30 seconds it's often because Spanning Tree is setting a port to the blocked state as it's detecting a loop somewhere.

    Depending on how your vlan ports, trunks, etc are set-up this appear to be layer 3 issues when it's actually a layer 2.

    When a machine is in a failure state can it ping it's gateway ip on the switch? Can other places?

    Have you checked your switch logs? Usually command: show log on a procurve.
    Thing is, when I lose access to a a website (e.g. EduGeek), I still get access to other websites (e.g. BBC) and I can still ping our Internet gateway, AND the site is still pingable from the switch via telnet. Unless I drastically misunderstand STP, isn't the only way STP could block websites is if it blocked the port the gateway was on, which would affect all websites simultaneously, and even the switch wouldn;t be able to get through?

    I've checked the logs and they don't indicate any ports being blocked. When a client fails, it can still ping it's gateway IP.

    I will take a careful look at which ports a MAC address is resolving to when this happens as @CyberNerd suggests, but I'm fairly sure I've looked at this already when websites have been affected and the gateway MAC appears on the same port every time. Will look closely at what happens when it's a LAN address.

    @Destinova: What sort of multicast issues are you suspicious of? I do have a multicast IPTV server running (though I actually shut it off temporarily last night as I began to scrape the bottom of the barrel for ideas of what the problem could be). When it's running, it's on a separate VLAN to the iSCSI and there is no routing in or out of the iSCSI VLAN.
    Last edited by AngryTechnician; 25th July 2012 at 08:35 AM.

  17. #15

    SYNACK's Avatar
    Join Date
    Oct 2007
    Posts
    11,078
    Thank Post
    853
    Thanked 2,677 Times in 2,271 Posts
    Blog Entries
    9
    Rep Power
    769
    Have you tried setting up extended logging for layer3 operations and ACLs. It might be that DNS is getting blocked or that routing is dropping the routes randomly.

SHARE:
+ Post New Thread
Page 1 of 2 12 LastLast

Similar Threads

  1. What is wrong with my brain. How do I remove this double entry?
    By reggiep in forum EduGeek Joomla 1.5 Package
    Replies: 2
    Last Post: 27th January 2009, 01:26 PM
  2. What the heck am I doing wrong with my menus?
    By reggiep in forum EduGeek Joomla 1.5 Package
    Replies: 3
    Last Post: 17th June 2008, 10:23 PM
  3. What's wrong with CS3 and what's Adobe doing?
    By beeswax in forum IT News
    Replies: 2
    Last Post: 16th August 2007, 11:32 PM
  4. Replies: 5
    Last Post: 7th February 2006, 03:02 PM

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •