Got a bit of an ongoing issue with or 2 hyper-v (scvmm) hosts connected to our dell md3200i san
i'm getting thousands of error ID: 9, 39, 139 on both hyper-v servers, plus the SAN itself is kicking out disconnection errors
everything seems to be functioning okay, but services (cluster drives) drop out briefly, usually out of hours thankfully..
both of our servers have 8 1gb/lan ports and the san has two controllers, so 2 x 4ports
the configuration for each server is:
3ports aggregated to 3gbps on our main network subnet 172.16.*.*
1port connected to our other hyper-v server on the subnet 172.15.*.*
4 ports seperately connected to our SAN switch with different subnets for each port so..
this should give us failover capacity if one controller goes down, or if a NIC fails.
by the way, both Servers run server 2008R2 SP1 with failover clustering and the latest version of SCVMM.
the san is reporting everything as ok, all adapters are IP4 with jumbo frames (9000) enabled.
server adapters are a mix of broadcom and Intel...and they are Dell R610's if that helps.
oh and the san switch is a D-link 24port gigabit managed switch (with jumbo enabled) and separate from the main network.
if anyone has any suggestions I would be very appreciative, If you need any information please let me know.
I just realised, this might be more suited to the virtualisation section, although...it is hardware.
Last edited by sacrej; 28th February 2011 at 12:35 PM.
Okay, due to the servers having 2 x 4 port nic's (intel and broadcom) we put two iscsi ports on the intel and 2 on the broadcom (Intel Gigabit ET Quad Port server adapter) and (Broadcom BCM5709C NetXtreme II) in the hope that if one card completely failed, then there would still be 2 ports serving ISCSI traffic on the server.
the following settings are enabled on the Broadcom adapters:
IPv4 Checksum Offload - tx/rx on
IPv4 Large Send Offload - on
Jumbo MTU - 9000
RSS queues - 8
priority and vlan- enabled
Rcv Buffer - 750
Recieve side scaling - on
TCP Connection offload - on
Transmit buffer -1500
Enable PME - disabled
flow control - Rx/Tx
Gigabit master slave mode - auto detect
header data split - disabled
interrupt moderation - enabled
I moderation rate - Adaptive
Ipv4 checksum offload - Rx/Tx
jumbo packets - 9014bytes
large send offload - on
duplex - auto negotiation
log link state event - on
max rss cpu's - 8
preferred numa node- default
priority and VLAN - enabled
rcv buffer- 250
recieve side scaling - on
RSS queues - 1 queue
TCP checksum offload - rx/tx
trasmit buffer - 512
UDP checksum offload - rx/tx
Virtual Machine queues - disabled
we use the Broadcom advanced control suite 3 as well, where 2 broadcom and one intel are bonded to make the primary LAN connection
the failover link cable uses a intel socket
which leaves 2 x intel and 2 x broadcom for the ISCSI.
according to BACS3 the broacom adapters have the following offload capabilities: TOE,LSO,CO,RSS
intel adapters show:LSO,CO,RSS
and yes, all 4 nic's are showing traffic, although some more than others (reported on the SAN side too)
also, all ISCSI cables are brand new 0.5m cat6.
Thank you for your fast response.
Last edited by sacrej; 28th February 2011 at 02:38 PM.
Am I right in thinking that all 4 links are dropping at once, across both cards? If so, it would seem to rule out host hardware as an issue. That still leaves host software, the switch, or the SAN, unfortunately.
Are you using the Microsoft iSCSI connector on the Hyper-V hosts or a third party? Also, what model is the D-Link switch?
Last edited by AngryTechnician; 28th February 2011 at 03:25 PM.
well i'm not sure, in the SAN event log i'm getting 3 errors at 26minutes past and 3 errors at 56 minutes past every hour, 1 of these errors is for RAID controller 1 and 2 are for controller 0)
the switch is this
and yes, we are using the Microsoft connector, but it was setup via the Dell software.
roughly every 30minutes (give or take a few here and there)
which is strange because on the client they are reporting the errors at slightly later times (about 5minutes later) - one server reports 2 errors every 30mins the other reports either 8 or 11 (alternates)
one thing to add, only one server is generally dropping drives at the moment (named Hyperv-2)
i.e. "Cluster Shared Volume 'Volume6' ('Cluster Disk 6') is no longer available on this node because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished."
Just questions, sorry no answers. Looking at you network settings.
Are you using Jumbo Frames ? Ifs so why are jumbo frames set to different MTU? (9000/9014) All devices on the iSCSI network should have the same MTU. (Switch/iSCSI storage/HyperV).
Also double check Flow Control, are your switches configured correctly?
I`m not sure why the Intel adapters are showing 9014, they were definitely set as 9000 originally, as I was the one that configured them all, but I can't just try and change this on the fly though as its a live system with about 11 vm's running.
The switch is set to support jumbo and flow control, that was manually set for all designated ports.
There are the usual system tasks, but nothing that fits in with the times of the problems we are getting.
If it helps, I can provide some kind of temporary remote access so you can have a look.. as I realise its quite a complicated problem to try and describe.