Synchronization bug, crash on multi-core CPUs

View: New views
5 Messages — Rating Filter:   Alert me  

Synchronization bug, crash on multi-core CPUs

by Enno Rehling :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've run into serious problems with our game crashing on fast
multicore-CPUs. This seems to be somewhat related to the report in
http://sourceforge.net/tracker/index.php?func=detail&aid=1565896&group_id=32783&atid=406494
although the proposed fix there doesn't work (and I don't see quite how
it could work). After a day looking at this and trying some things I'm
no wiser.

The crash happens in AbstractDevice::eventThread(), a function that
handles events sent from other threads. The actual crash is in the
assignment call of this code:

     // Process the events.
      while (!events.empty()) {
        EventPtr event = events.front();
        events.pop();
        processEvent(event.get());
      }

Of course, this all just happens to me when I have optimizations turned
on, and only on one machine that I have no good development environment
on, and only sporadically after 5-10 minutes of playing, so it's damn
hard to debug. And the mix of event handling, object lifetimes,
reference counting, threading and "smart" pointers that combines to
create this is pretty potent.

If anyone can help me out here, I'd be eternally grateful. The deadlines
are making whooshing sounds as they pass by.

Enno.
--
Chess is as elaborate a waste of human intelligence as you can find
outside an advertising agency -- Raymond Chandler


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Audiere-users mailing list
Audiere-users@...
https://lists.sourceforge.net/lists/listinfo/audiere-users

Re: Synchronization bug, crash on multi-core CPUs

by Chad Austin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Enno,

I've tried to reproduce the problem on my home dual Xeon HT and at
work on my Core 2 Duo, but I couldn't do it.  The test program is
test/race/race.cpp (just added in svn).  Do you have a test case?

Regarding our earlier conversation, I think a worthy thing to try is
replacing ++ and -- with InterlockedIncrement and
InterlockedDecrement, respectively, and replacing "int refcount" with
"volatile long refcount".  Although I can't imagine a scenario where
cross-CPU cache coherency would cause a crash.  If that doesn't work,
let's come up with another theory.

Chad

On 4/13/07, Enno Rehling <enno.rehling@...> wrote:

> I've run into serious problems with our game crashing on fast
> multicore-CPUs. This seems to be somewhat related to the report in
> http://sourceforge.net/tracker/index.php?func=detail&aid=1565896&group_id=32783&atid=406494
> although the proposed fix there doesn't work (and I don't see quite how
> it could work). After a day looking at this and trying some things I'm
> no wiser.
>
> The crash happens in AbstractDevice::eventThread(), a function that
> handles events sent from other threads. The actual crash is in the
> assignment call of this code:
>
>      // Process the events.
>       while (!events.empty()) {
>         EventPtr event = events.front();
>         events.pop();
>         processEvent(event.get());
>       }
>
> Of course, this all just happens to me when I have optimizations turned
> on, and only on one machine that I have no good development environment
> on, and only sporadically after 5-10 minutes of playing, so it's damn
> hard to debug. And the mix of event handling, object lifetimes,
> reference counting, threading and "smart" pointers that combines to
> create this is pretty potent.
>
> If anyone can help me out here, I'd be eternally grateful. The deadlines
> are making whooshing sounds as they pass by.
>
> Enno.
> --
> Chess is as elaborate a waste of human intelligence as you can find
> outside an advertising agency -- Raymond Chandler
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Audiere-users mailing list
> Audiere-users@...
> https://lists.sourceforge.net/lists/listinfo/audiere-users
>


--
Chad Austin
http://imvu.com/technology

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Audiere-users mailing list
Audiere-users@...
https://lists.sourceforge.net/lists/listinfo/audiere-users

Re: Synchronization bug, crash on multi-core CPUs

by Enno Rehling :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Chad Austin wrote:
> Hi Enno,
>
> I've tried to reproduce the problem on my home dual Xeon HT and at
> work on my Core 2 Duo, but I couldn't do it.  The test program is
> test/race/race.cpp (just added in svn).  Do you have a test case?

I haven't got a test case, I'm afraid. I have an AMD 4200+ dual-core on
which it doesn't crash, but our PQA can reproduce it on an AMD 4400. The
publisher is running on dual-core Pentium D 2.8, and they get the crash as
well. I run XP64 and they don't but apart from maybe having different
scheduling strategies, that shouldn't make a difference in my opinion.

the good news is that it seems to have worked. I made thosee changes to the
reference counting implementation on Saturday and have tried to crash the
game all morning without an incident. We'll run this through some more QA,
but I'm excited and hopeful.

Patch is attached.

Enno.
--
"I tell the young security people, I say, 'you don’t understand. I am
nowhere near the threat I had hoped to become.'"
- Arlo Guthrie


Index: audiere.h
===================================================================
--- audiere.h (revision 672)
+++ audiere.h (working copy)
@@ -29,6 +29,10 @@
 #include <vector>
 #include <string>
 
+#ifdef WIN32
+#include <windows.h>
+#endif
+
 #ifdef _MSC_VER
 #pragma warning(disable : 4786)
 #endif
@@ -204,17 +208,31 @@
 
   public:
     void ADR_CALL ref() {
-      ++m_ref_count;
+#ifdef WIN32
+      InterlockedIncrement(&m_ref_count);
+#else
+ ++m_ref_count;
+#endif
     }
 
     void ADR_CALL unref() {
+#ifdef WIN32
+      if (InterlockedDecrement(&m_ref_count) == 0) {
+        delete this;
+      }
+#else
       if (--m_ref_count == 0) {
         delete this;
       }
-    }
+#endif
+ }
 
   private:
-    int m_ref_count;
+#ifdef WIN32
+    volatile LONG m_ref_count;
+#else
+    volatile int m_ref_count;
+#endif
   };
 
 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Audiere-users mailing list
Audiere-users@...
https://lists.sourceforge.net/lists/listinfo/audiere-users

Re: Synchronization bug, crash on multi-core CPUs

by Chad Austin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Very nice.  I'll check that change into audiere svn.  Hope it works.

On 4/16/07, Enno Rehling <enno.rehling@...> wrote:

> Chad Austin wrote:
> > Hi Enno,
> >
> > I've tried to reproduce the problem on my home dual Xeon HT and at
> > work on my Core 2 Duo, but I couldn't do it.  The test program is
> > test/race/race.cpp (just added in svn).  Do you have a test case?
>
> I haven't got a test case, I'm afraid. I have an AMD 4200+ dual-core on
> which it doesn't crash, but our PQA can reproduce it on an AMD 4400. The
> publisher is running on dual-core Pentium D 2.8, and they get the crash as
> well. I run XP64 and they don't but apart from maybe having different
> scheduling strategies, that shouldn't make a difference in my opinion.
>
> the good news is that it seems to have worked. I made thosee changes to the
> reference counting implementation on Saturday and have tried to crash the
> game all morning without an incident. We'll run this through some more QA,
> but I'm excited and hopeful.
>
> Patch is attached.
>
> Enno.
> --
> "I tell the young security people, I say, 'you don't understand. I am
> nowhere near the threat I had hoped to become.'"
> - Arlo Guthrie
>
>
> Index: audiere.h
> ===================================================================
> --- audiere.h   (revision 672)
> +++ audiere.h   (working copy)
> @@ -29,6 +29,10 @@
>  #include <vector>
>  #include <string>
>
> +#ifdef WIN32
> +#include <windows.h>
> +#endif
> +
>  #ifdef _MSC_VER
>  #pragma warning(disable : 4786)
>  #endif
> @@ -204,17 +208,31 @@
>
>    public:
>      void ADR_CALL ref() {
> -      ++m_ref_count;
> +#ifdef WIN32
> +      InterlockedIncrement(&m_ref_count);
> +#else
> +               ++m_ref_count;
> +#endif
>      }
>
>      void ADR_CALL unref() {
> +#ifdef WIN32
> +      if (InterlockedDecrement(&m_ref_count) == 0) {
> +        delete this;
> +      }
> +#else
>        if (--m_ref_count == 0) {
>          delete this;
>        }
> -    }
> +#endif
> +       }
>
>    private:
> -    int m_ref_count;
> +#ifdef WIN32
> +    volatile LONG m_ref_count;
> +#else
> +    volatile int m_ref_count;
> +#endif
>    };
>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Audiere-users mailing list
> Audiere-users@...
> https://lists.sourceforge.net/lists/listinfo/audiere-users
>
>


--
Chad Austin
http://imvu.com/technology

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Audiere-users mailing list
Audiere-users@...
https://lists.sourceforge.net/lists/listinfo/audiere-users

Re: Synchronization bug, crash on multi-core CPUs

by Chad Austin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It's checked in, with some changes that export two new functions in
the Audiere API:

  /**
   * Atomically increments a counter.  Returns incremented value.
   */
  inline long AtomicIncrement(volatile long& var) {
    return hidden::AdrAtomicIncrement(var);
  }

  /**
   * Atomically decrements a counter.  Returns decremented value.
   */
  inline long AtomicDecrement(volatile long& var) {
    return hidden::AdrAtomicDecrement(var);
  }


On 4/16/07, Chad Austin <aegis@...> wrote:

> Very nice.  I'll check that change into audiere svn.  Hope it works.
>
> On 4/16/07, Enno Rehling <enno.rehling@...> wrote:
> > Chad Austin wrote:
> > > Hi Enno,
> > >
> > > I've tried to reproduce the problem on my home dual Xeon HT and at
> > > work on my Core 2 Duo, but I couldn't do it.  The test program is
> > > test/race/race.cpp (just added in svn).  Do you have a test case?
> >
> > I haven't got a test case, I'm afraid. I have an AMD 4200+ dual-core on
> > which it doesn't crash, but our PQA can reproduce it on an AMD 4400. The
> > publisher is running on dual-core Pentium D 2.8, and they get the crash as
> > well. I run XP64 and they don't but apart from maybe having different
> > scheduling strategies, that shouldn't make a difference in my opinion.
> >
> > the good news is that it seems to have worked. I made thosee changes to the
> > reference counting implementation on Saturday and have tried to crash the
> > game all morning without an incident. We'll run this through some more QA,
> > but I'm excited and hopeful.
> >
> > Patch is attached.
> >
> > Enno.
> > --
> > "I tell the young security people, I say, 'you don't understand. I am
> > nowhere near the threat I had hoped to become.'"
> > - Arlo Guthrie
> >
> >
> > Index: audiere.h
> > ===================================================================
> > --- audiere.h   (revision 672)
> > +++ audiere.h   (working copy)
> > @@ -29,6 +29,10 @@
> >  #include <vector>
> >  #include <string>
> >
> > +#ifdef WIN32
> > +#include <windows.h>
> > +#endif
> > +
> >  #ifdef _MSC_VER
> >  #pragma warning(disable : 4786)
> >  #endif
> > @@ -204,17 +208,31 @@
> >
> >    public:
> >      void ADR_CALL ref() {
> > -      ++m_ref_count;
> > +#ifdef WIN32
> > +      InterlockedIncrement(&m_ref_count);
> > +#else
> > +               ++m_ref_count;
> > +#endif
> >      }
> >
> >      void ADR_CALL unref() {
> > +#ifdef WIN32
> > +      if (InterlockedDecrement(&m_ref_count) == 0) {
> > +        delete this;
> > +      }
> > +#else
> >        if (--m_ref_count == 0) {
> >          delete this;
> >        }
> > -    }
> > +#endif
> > +       }
> >
> >    private:
> > -    int m_ref_count;
> > +#ifdef WIN32
> > +    volatile LONG m_ref_count;
> > +#else
> > +    volatile int m_ref_count;
> > +#endif
> >    };
> >
> >
> >
> > -------------------------------------------------------------------------
> > This SF.net email is sponsored by DB2 Express
> > Download DB2 Express C - the FREE version of DB2 express and take
> > control of your XML. No limits. Just data. Click to get it now.
> > http://sourceforge.net/powerbar/db2/
> > _______________________________________________
> > Audiere-users mailing list
> > Audiere-users@...
> > https://lists.sourceforge.net/lists/listinfo/audiere-users
> >
> >
>
>
> --
> Chad Austin
> http://imvu.com/technology
>


--
Chad Austin
http://imvu.com/technology

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Audiere-users mailing list
Audiere-users@...
https://lists.sourceforge.net/lists/listinfo/audiere-users