Hi folks, please help me regain a little sanity, and reduce further hair loss!

I have a case open with Netgear about this but thought it worth asking here in case anyone has seen anything similar.

We've just replaced a single core switch with two, split over two sites and linked via 10Gb fibre. There is an M5300-52G3 and an M5300-28GF3 linked as a remote stack via 10Gb fibre and managed as a single stack. Each has a number of GS7xxTP PoE switches attached for wifi access and some GS748Ts for wired access. The ports that the access points plug into on the 748TPs are configured as tagged for VLANS 10, 20, 24, 28, 32, untagged for VLAN 3 and have a PVID of 3. The 748TP trunks back to the stack have all VLANS tagged. On the stack trunk ports connected to the 748TPs the VLANS are all set as tagged.

VLAN SSID/Auth type Subnet Purpose
3 N/A WiFi AP management
10 Dom/PSK and Dom2/RADIUS 10.xx.yy.0/22 Main campus VLAN
20 Student/RADIUS Student wifi
24 Staff/RADIUS Staff wifi
28 Guest/PSK Guest wifi
32 Tablet/PSK College Tablets

There is a Windows Server 2008 R2 DC running DHCP for all scopes and NAP for RADIUS authentication. It is plugged in to a port on the 52G3 that is a member of VLAN 10 only and has a PVID of 10.

The switch stack is configured for DHCP relay and is using static routing between all of the VLANS mentioned so far.

The issue I'm having is that everything on the PSK authenticated networks and also on the RADIUS authenticated parts of VLAN 10 is fine and dandy but on VLANs 20 and 24 a lot of RADIUS traffic seems to be going missing. Access points connected via the 52G3 mostly seem to be working okay (95% of clients authenticate successfully) but RADIUS connections from access points connected via the 28GF3 don't seem to be making it over the stack link to the server; most but not all fail. Consequently almost nobody can connect to the correct wifi on that site. Bizarrely, a few RADIUS requests do manage to make it across so it's unlikely to be a misconfiguration, else none would make it. We're currently working around the issue by attaching staff to a less secure wifi VLAN that uses PSK.

Our previous core switch, a forerunner of the M5300-52G3 called GSM7352Sv1, had exactly the same configuration apart from stacking and worked flawlessly so I'm convinced that the server and wifi side of things is not at fault since nothing has changed on them. We replaced the single core switch to reduce the number of fibre runs that traveled a long way over a main road and back again and terminated most of the fibre on the M5300-28GF3.

Anyone got any ideas what might be going on here? I'm completely stumped!