82573 xfers pause, no watchdog timeouts, DCGDIS ineffective (7.2-R)

View: New views
6 Messages — Rating Filter:   Alert me  

82573 xfers pause, no watchdog timeouts, DCGDIS ineffective (7.2-R)

by Royce Williams :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

We have servers with dual 82573 NICs that work well during low-throughput activity, but during high-volume activity, they pause shortly after transfers start and do not recover.  Other sessions to the system are not affected.

These systems are being repurposed, jumping from 6.3 to 7.2.  The same system and its kin do not exhibit the symptom under 6.3-RELEASE-p13.  The symptoms appear under freebsd-updated 7.2-RELEASE GENERIC kernel with no tuning.

Previously, we've been using DCGDIS.EXE (from Jack Vogel) for this symptom.  The first system to be repurposed accepts DCGDIS with 'Updated' and subsequent 'update not needed', with no relief.  

Notably, there are no watchdog timeout errors - unlike our various Supermicro models still running FreeBSD 6.x.  All of our other 7.x Supermicro flavors had already received the flash update and haven't show the symptom.

Details follow.

Kernel:

rand# uname -a
FreeBSD rand.acsalaska.net 7.2-RELEASE-p4 FreeBSD 7.2-RELEASE-p4 #0: Fri Oct  2 12:21:39 UTC 2009     root@...:/usr/obj/usr/src/sys/GENERIC  i386

sysctls:

rand# sysctl dev.em
dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 6.9.6
dev.em.0.%driver: em
dev.em.0.%location: slot=0 function=0
dev.em.0.%pnpinfo: vendor=0x8086 device=0x108c subvendor=0x15d9 subdevice=0x108c class=0x020000
dev.em.0.%parent: pci13
dev.em.0.debug: -1
dev.em.0.stats: -1
dev.em.0.rx_int_delay: 0
dev.em.0.tx_int_delay: 66
dev.em.0.rx_abs_int_delay: 66
dev.em.0.tx_abs_int_delay: 66
dev.em.0.rx_processing_limit: 100
dev.em.1.%desc: Intel(R) PRO/1000 Network Connection 6.9.6
dev.em.1.%driver: em
dev.em.1.%location: slot=0 function=0
dev.em.1.%pnpinfo: vendor=0x8086 device=0x108c subvendor=0x15d9 subdevice=0x108c class=0x020000
dev.em.1.%parent: pci14
dev.em.1.debug: -1
dev.em.1.stats: -1
dev.em.1.rx_int_delay: 0
dev.em.1.tx_int_delay: 66
dev.em.1.rx_abs_int_delay: 66
dev.em.1.tx_abs_int_delay: 66
dev.em.1.rx_processing_limit: 100

kenv:

rand# kenv | grep smbios | egrep -v 'socket|serial|uuid|tag|0123456789'
smbios.bios.reldate="03/05/2008"
smbios.bios.vendor="Phoenix Technologies LTD"
smbios.bios.version="6.00"
smbios.chassis.maker="Supermicro"
smbios.planar.maker="Supermicro"
smbios.planar.product="PDSMi "
smbios.planar.version="PCB Version"
smbios.system.maker="Supermicro"
smbios.system.product="PDSMi"


The system is not yet production, so I can invasively abuse it if needed.  The other systems are in production under 6.3-RELEASE-p13 and can also be inspected.

Any pointers appreciated.

Royce


_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 82573 xfers pause, no watchdog timeouts, DCGDIS ineffective (7.2-R)

by Jeremy Chadwick :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Nov 12, 2009 at 10:36:16AM -0900, Royce Williams wrote:
> We have servers with dual 82573 NICs that work well during low-throughput activity, but during high-volume activity, they pause shortly after transfers start and do not recover.  Other sessions to the system are not affected.

Please define "low-throughput" and "high-volume" if you could; it might
help folks determine where the threshold is for problems.

> These systems are being repurposed, jumping from 6.3 to 7.2.  The same system and its kin do not exhibit the symptom under 6.3-RELEASE-p13.  The symptoms appear under freebsd-updated 7.2-RELEASE GENERIC kernel with no tuning.
>
> Previously, we've been using DCGDIS.EXE (from Jack Vogel) for this symptom.  The first system to be repurposed accepts DCGDIS with 'Updated' and subsequent 'update not needed', with no relief.  
>
> Notably, there are no watchdog timeout errors - unlike our various Supermicro models still running FreeBSD 6.x.  All of our other 7.x Supermicro flavors had already received the flash update and haven't show the symptom.
>
> Details follow.
>
> Kernel:
>
> rand# uname -a
> FreeBSD rand.acsalaska.net 7.2-RELEASE-p4 FreeBSD 7.2-RELEASE-p4 #0: Fri Oct  2 12:21:39 UTC 2009     root@...:/usr/obj/usr/src/sys/GENERIC  i386
>
> sysctls:
>
> rand# sysctl dev.em
> dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 6.9.6
> dev.em.0.%driver: em
> dev.em.0.%location: slot=0 function=0
> dev.em.0.%pnpinfo: vendor=0x8086 device=0x108c subvendor=0x15d9 subdevice=0x108c class=0x020000
> dev.em.0.%parent: pci13
> dev.em.0.debug: -1
> dev.em.0.stats: -1
> dev.em.0.rx_int_delay: 0
> dev.em.0.tx_int_delay: 66
> dev.em.0.rx_abs_int_delay: 66
> dev.em.0.tx_abs_int_delay: 66
> dev.em.0.rx_processing_limit: 100
> dev.em.1.%desc: Intel(R) PRO/1000 Network Connection 6.9.6
> dev.em.1.%driver: em
> dev.em.1.%location: slot=0 function=0
> dev.em.1.%pnpinfo: vendor=0x8086 device=0x108c subvendor=0x15d9 subdevice=0x108c class=0x020000
> dev.em.1.%parent: pci14
> dev.em.1.debug: -1
> dev.em.1.stats: -1
> dev.em.1.rx_int_delay: 0
> dev.em.1.tx_int_delay: 66
> dev.em.1.rx_abs_int_delay: 66
> dev.em.1.tx_abs_int_delay: 66
> dev.em.1.rx_processing_limit: 100
>
> kenv:
>
> rand# kenv | grep smbios | egrep -v 'socket|serial|uuid|tag|0123456789'
> smbios.bios.reldate="03/05/2008"
> smbios.bios.vendor="Phoenix Technologies LTD"
> smbios.bios.version="6.00"
> smbios.chassis.maker="Supermicro"
> smbios.planar.maker="Supermicro"
> smbios.planar.product="PDSMi "
> smbios.planar.version="PCB Version"
> smbios.system.maker="Supermicro"
> smbios.system.product="PDSMi"
>
>
> The system is not yet production, so I can invasively abuse it if needed.  The other systems are in production under 6.3-RELEASE-p13 and can also be inspected.
>
> Any pointers appreciated.
>
> Royce

For what it's worth as a comparison base:

We use the following Supermicro SuperServers, and can confirm that no
such issues occur for us using RELENG_6 nor RELENG_7 on the following
hardware:

Supermicro SuperServer 5015B-MTB - amd64 - Intel 82573V + Intel 82573L
Supermicro SuperServer 5015M-T+B - amd64 - Intel 82573V + Intel 82573L
Supermicro SuperServer 5015M-T+B - amd64 - Intel 82573V + Intel 82573L
Supermicro SuperServer 5015M-T+B - i386  - Intel 82573V + Intel 82573L
Supermicro SuperServer 5015M-T+B - i386  - Intel 82573V + Intel 82573L

The 5015B-MTB system presently runs RELENG_8 -- no issues there either.

Relevant server configuration and network setup details:

- All machines use pf(4).
- All emX devices are configured for autoneg.
- All emX devices use RXCSUM, TXCSUM, and TSO4.
- We do not use polling.
- All machines use both NICs simultaneously at all times.
- All machines connected to an HP ProCurve 2626 switch (100mbit,
  full-duplex ports, all autoneg).
- We do not use Jumbo frames.
- No add-in cards (PCI, PCI-X, nor PCIe) are used in the systems.
- All of the systems had DCGDIS.EXE run on them; no EEPROM settings
  were changed, indicating the from-the-Intel-factory MANC register
  in question was set properly.

Relevant throughput details per box:

- em0 pushes ~600-1000kbit/sec at all times.
- em1 pushes ~100-200kbit/sec at all times.
- During nightly maintenance (backups), em1 pushes ~2-3mbit/sec
  for a variable amount of time.
- For a full level 0 backup (which I've done numerous times), em1
  pushes 60-70mbit/sec without issues.

I've compared your sysctl dev.em output to that of our 5015M-T+B systems
(which use the PDSMi+, not the PDSMi, but whatever), and ours is 100%
identical.

All of our 5015M-T+B systems are using BIOS 1.3, and the 5015B-MTB
system is using BIOS 1.30.

If you'd like, I can provide the exact BIOS settings we use on the
machines in question; they do deviate from the factory defaults a slight
bit, but none of the adjustments are "tweaks" for performance or
otherwise (just disabling things which we don't use, etc.).

--
| Jeremy Chadwick                                   jdc@... |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 82573 xfers pause, no watchdog timeouts, DCGDIS ineffective (7.2-R)

by Jack Vogel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It is critically important on these systems that you get the latest BIOS on
them, so
maybe that's the difference between you two.  I am going to be putting out a
new
em driver to CURRENT soon, it might be an option to try that as well, it
sounds
like a hang, management/os race in the driver is a possibility.

Jack


On Thu, Nov 12, 2009 at 12:47 PM, Jeremy Chadwick
<freebsd@...>wrote:

> On Thu, Nov 12, 2009 at 10:36:16AM -0900, Royce Williams wrote:
> > We have servers with dual 82573 NICs that work well during low-throughput
> activity, but during high-volume activity, they pause shortly after
> transfers start and do not recover.  Other sessions to the system are not
> affected.
>
> Please define "low-throughput" and "high-volume" if you could; it might
> help folks determine where the threshold is for problems.
>
> > These systems are being repurposed, jumping from 6.3 to 7.2.  The same
> system and its kin do not exhibit the symptom under 6.3-RELEASE-p13.  The
> symptoms appear under freebsd-updated 7.2-RELEASE GENERIC kernel with no
> tuning.
> >
> > Previously, we've been using DCGDIS.EXE (from Jack Vogel) for this
> symptom.  The first system to be repurposed accepts DCGDIS with 'Updated'
> and subsequent 'update not needed', with no relief.
> >
> > Notably, there are no watchdog timeout errors - unlike our various
> Supermicro models still running FreeBSD 6.x.  All of our other 7.x
> Supermicro flavors had already received the flash update and haven't show
> the symptom.
> >
> > Details follow.
> >
> > Kernel:
> >
> > rand# uname -a
> > FreeBSD rand.acsalaska.net 7.2-RELEASE-p4 FreeBSD 7.2-RELEASE-p4 #0: Fri
> Oct  2 12:21:39 UTC 2009     root@...:/usr/obj/usr/src/sys/GENERIC
>  i386
> >
> > sysctls:
> >
> > rand# sysctl dev.em
> > dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 6.9.6
> > dev.em.0.%driver: em
> > dev.em.0.%location: slot=0 function=0
> > dev.em.0.%pnpinfo: vendor=0x8086 device=0x108c subvendor=0x15d9
> subdevice=0x108c class=0x020000
> > dev.em.0.%parent: pci13
> > dev.em.0.debug: -1
> > dev.em.0.stats: -1
> > dev.em.0.rx_int_delay: 0
> > dev.em.0.tx_int_delay: 66
> > dev.em.0.rx_abs_int_delay: 66
> > dev.em.0.tx_abs_int_delay: 66
> > dev.em.0.rx_processing_limit: 100
> > dev.em.1.%desc: Intel(R) PRO/1000 Network Connection 6.9.6
> > dev.em.1.%driver: em
> > dev.em.1.%location: slot=0 function=0
> > dev.em.1.%pnpinfo: vendor=0x8086 device=0x108c subvendor=0x15d9
> subdevice=0x108c class=0x020000
> > dev.em.1.%parent: pci14
> > dev.em.1.debug: -1
> > dev.em.1.stats: -1
> > dev.em.1.rx_int_delay: 0
> > dev.em.1.tx_int_delay: 66
> > dev.em.1.rx_abs_int_delay: 66
> > dev.em.1.tx_abs_int_delay: 66
> > dev.em.1.rx_processing_limit: 100
> >
> > kenv:
> >
> > rand# kenv | grep smbios | egrep -v 'socket|serial|uuid|tag|0123456789'
> > smbios.bios.reldate="03/05/2008"
> > smbios.bios.vendor="Phoenix Technologies LTD"
> > smbios.bios.version="6.00"
> > smbios.chassis.maker="Supermicro"
> > smbios.planar.maker="Supermicro"
> > smbios.planar.product="PDSMi "
> > smbios.planar.version="PCB Version"
> > smbios.system.maker="Supermicro"
> > smbios.system.product="PDSMi"
> >
> >
> > The system is not yet production, so I can invasively abuse it if needed.
>  The other systems are in production under 6.3-RELEASE-p13 and can also be
> inspected.
> >
> > Any pointers appreciated.
> >
> > Royce
>
> For what it's worth as a comparison base:
>
> We use the following Supermicro SuperServers, and can confirm that no
> such issues occur for us using RELENG_6 nor RELENG_7 on the following
> hardware:
>
> Supermicro SuperServer 5015B-MTB - amd64 - Intel 82573V + Intel 82573L
> Supermicro SuperServer 5015M-T+B - amd64 - Intel 82573V + Intel 82573L
> Supermicro SuperServer 5015M-T+B - amd64 - Intel 82573V + Intel 82573L
> Supermicro SuperServer 5015M-T+B - i386  - Intel 82573V + Intel 82573L
> Supermicro SuperServer 5015M-T+B - i386  - Intel 82573V + Intel 82573L
>
> The 5015B-MTB system presently runs RELENG_8 -- no issues there either.
>
> Relevant server configuration and network setup details:
>
> - All machines use pf(4).
> - All emX devices are configured for autoneg.
> - All emX devices use RXCSUM, TXCSUM, and TSO4.
> - We do not use polling.
> - All machines use both NICs simultaneously at all times.
> - All machines connected to an HP ProCurve 2626 switch (100mbit,
>  full-duplex ports, all autoneg).
> - We do not use Jumbo frames.
> - No add-in cards (PCI, PCI-X, nor PCIe) are used in the systems.
> - All of the systems had DCGDIS.EXE run on them; no EEPROM settings
>  were changed, indicating the from-the-Intel-factory MANC register
>  in question was set properly.
>
> Relevant throughput details per box:
>
> - em0 pushes ~600-1000kbit/sec at all times.
> - em1 pushes ~100-200kbit/sec at all times.
> - During nightly maintenance (backups), em1 pushes ~2-3mbit/sec
>  for a variable amount of time.
> - For a full level 0 backup (which I've done numerous times), em1
>  pushes 60-70mbit/sec without issues.
>
> I've compared your sysctl dev.em output to that of our 5015M-T+B systems
> (which use the PDSMi+, not the PDSMi, but whatever), and ours is 100%
> identical.
>
> All of our 5015M-T+B systems are using BIOS 1.3, and the 5015B-MTB
> system is using BIOS 1.30.
>
> If you'd like, I can provide the exact BIOS settings we use on the
> machines in question; they do deviate from the factory defaults a slight
> bit, but none of the adjustments are "tweaks" for performance or
> otherwise (just disabling things which we don't use, etc.).
>
> --
> | Jeremy Chadwick                                   jdc@... |
> | Parodius Networking                       http://www.parodius.com/ |
> | UNIX Systems Administrator                  Mountain View, CA, USA |
> | Making life hard for others since 1977.              PGP: 4BD6C0CB |
>
> _______________________________________________
> freebsd-stable@... mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."
>
_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 82573 xfers pause, no watchdog timeouts, DCGDIS ineffective (7.2-R)

by Royce Williams-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Nov 12, 2009 at 11:47 AM, Jeremy Chadwick
<freebsd@...> wrote:
> Please define "low-throughput" and "high-volume" if you could; it might
> help folks determine where the threshold is for problems.

My definitions are pretty subjective/operational, but for what it's worth:

- "low" is interactive SSH, DNS lookups, and pings;
- "high" is a single unthrottled rsync session.

>> rand# sysctl dev.em
>> dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 6.9.6

>> dev.em.0.%pnpinfo: vendor=0x8086 device=0x108c subvendor=0x15d9 subdevice=0x108c class=0x020000

>> kenv:
>>
>> rand# kenv | grep smbios | egrep -v 'socket|serial|uuid|tag|0123456789'
>> smbios.bios.reldate="03/05/2008"

> For what it's worth as a comparison base:
>
> We use the following Supermicro SuperServers, and can confirm that no
> such issues occur for us using RELENG_6 nor RELENG_7 on the following
> hardware:

[good cross-check list snipped]

The problem system is a 5015M-MF.  We are running 5015M-MT+ and
5015T-PR on RELENG_6 and 7, both without the symptom.

> Relevant server configuration and network setup details:
>
> - All machines use pf(4).
> - All emX devices are configured for autoneg.
> - All emX devices use RXCSUM, TXCSUM, and TSO4.
> - We do not use polling.
> - All machines use both NICs simultaneously at all times.
> - All machines connected to an HP ProCurve 2626 switch (100mbit,
>  full-duplex ports, all autoneg).
> - We do not use Jumbo frames.
> - No add-in cards (PCI, PCI-X, nor PCIe) are used in the systems.
> - All of the systems had DCGDIS.EXE run on them; no EEPROM settings
>  were changed, indicating the from-the-Intel-factory MANC register
>  in question was set properly.

No firewall is active on the problem system, and none of this back
have been DCGDIS-ified, but otherwise, our setup is identical.

> I've compared your sysctl dev.em output to that of our 5015M-T+B systems
> (which use the PDSMi+, not the PDSMi, but whatever), and ours is 100%
> identical.
>
> All of our 5015M-T+B systems are using BIOS 1.3, and the 5015B-MTB
> system is using BIOS 1.30.

The repurposed system is at 1.3 (03/05/2008) - flashed prior to
install. The production 6.3 systems are using 1.1 (or 1.1A, would have
to reboot to check, but the date is 10/27/2005).

> If you'd like, I can provide the exact BIOS settings we use on the
> machines in question; they do deviate from the factory defaults a slight
> bit, but none of the adjustments are "tweaks" for performance or
> otherwise (just disabling things which we don't use, etc.).

We're running similarly as well.

I might be able to retire another system of this batch and install
7.2, but leave the BIOS update off, to see if it makes a difference.

Royce
_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 82573 xfers pause, no watchdog timeouts, DCGDIS ineffective (7.2-R)

by Royce Williams-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Nov 12, 2009 at 2:18 PM, Royce Williams
<royce.williams@...> wrote:
> On Thu, Nov 12, 2009 at 11:47 AM, Jeremy Chadwick
>> - All machines connected to an HP ProCurve 2626 switch (100mbit,
>>  full-duplex ports, all autoneg).

> No firewall is active on the problem system, and none of this back
> have been DCGDIS-ified, but otherwise, our setup is identical.

Er, s/back/batch/g, and it's not a ProCurve. ;-)  But we are also
usually full-duplex and autoneg on both sides.

Based on new (embarrassing) information, I'll leave it to Jack to
decide whether or not he wants to pursue this further.

The problem box is sitting in my grotty mini-lab, with a subnet
partially serviced by a 10M hub.  Guess which Ethernet cable I picked
up.  Guess what happens when I move the system to a 100M/full
connection.

As my cow-orker put it, "You and the other four people on Earth using
that NIC on 10M hubs" can probably find workarounds.  My apologies for
the noise, though it's theoretically possible that the root cause
might still need addressing.

Jack, let me know if you want me to do any testing for you.  Or I can
always send you my hub. ;-)

Royce
_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 82573 xfers pause, no watchdog timeouts, DCGDIS ineffective (7.2-R)

by Jack Vogel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

LOL, glad the problem has been resolved, and no thanks, I do not need
to pursue this any further.

I also want to thank Jeremy for his help and data!!

Thanks guys and good evening,

Jack


On Thu, Nov 12, 2009 at 6:56 PM, Royce Williams <royce.williams@...>wrote:

> On Thu, Nov 12, 2009 at 2:18 PM, Royce Williams
> <royce.williams@...> wrote:
> > On Thu, Nov 12, 2009 at 11:47 AM, Jeremy Chadwick
> >> - All machines connected to an HP ProCurve 2626 switch (100mbit,
> >>  full-duplex ports, all autoneg).
>
> > No firewall is active on the problem system, and none of this back
> > have been DCGDIS-ified, but otherwise, our setup is identical.
>
> Er, s/back/batch/g, and it's not a ProCurve. ;-)  But we are also
> usually full-duplex and autoneg on both sides.
>
> Based on new (embarrassing) information, I'll leave it to Jack to
> decide whether or not he wants to pursue this further.
>
> The problem box is sitting in my grotty mini-lab, with a subnet
> partially serviced by a 10M hub.  Guess which Ethernet cable I picked
> up.  Guess what happens when I move the system to a 100M/full
> connection.
>
> As my cow-orker put it, "You and the other four people on Earth using
> that NIC on 10M hubs" can probably find workarounds.  My apologies for
> the noise, though it's theoretically possible that the root cause
> might still need addressing.
>
> Jack, let me know if you want me to do any testing for you.  Or I can
> always send you my hub. ;-)
>
> Royce
>
_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."