Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
General
GeneralPortable MacsHardwareNetworking
Applications
Mac ApplicationsEudoraFirefox / MozillaInternet ExplorerOutlook ExpressMS OfficeEntourageExcelPowerPointWordVirtual PCMedia PlayerOther MS Products
Programming
Mac ProgrammingCodeWarriorPerl
Country Specific
Australian Mac GroupUK Mac Group

Mac Forum / Programming / CodeWarrior / January 2004



Tip: Looking for answers? Try searching our database.

Performance of Boost shared_ptr with threads

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Martin Taylor - 23 Jan 2004 10:43 GMT
Hi

I am working on a cross-platform project which makes extensive use of
the Boost shared_ptr.  A while back, we made the application threaded
and found that the MacOS build took a big performance hit in areas of
the code where a lot of pointer sharing was going on.

It seems that the problem is with the boost lightweight mutex used with
shared_ptr (in boost/detail/lightweight_mutex.hpp) - many platforms have
an implementation that uses a spinlock, but with CodeWarrior on MacOS
the best we can do is use pthreads.

To show that this is not an insignificant problem, by the way, I made a
bit of test code:

  typedef boost::shared_ptr<int>   ItemRef;
  const int         numItems = 40000;
  vector<ItemRef>   items;
 
  items.reserve (numItems);
  for (int i = 0; i < numItems; ++i)
     items.insert(items.begin(), ItemRef(new int (i)));

Compiled and run with CodeWarrior 8 (mach-o exectutable) this takes 76
seconds on my machine.  
Compiled with gcc, boost uses a spinlock for the lightweight mutex and
the same code takes 27 seconds to exucute. Quite a difference.

So finally I get to the point of this post!  The gcc lightweight mutex
in boost/detail/lwm_gcc.hpp uses a spinlock implemented in assembler in
/usr/include/gcc/darwin/3.3/c++/ppc-darwin/bits/atomicity.h.
If I could lift those assembler routines and compile them with CW,
shared pointers could be nice and fast again. The problem is that I know
next to nothing about asm code, and it seems CW cannot compile asm code
written for gcc.
Can anyone help?  If so, I would be happy to submit the code to boost
for the benefit of all.

Thanks,
Martin
MW Ron - 23 Jan 2004 15:12 GMT
In article
<martin.taylor-41ECD4.10433523012004@newstrial.btopenworld.com>,

Martin,

Howard is taking a skiing week, and he is the expert on this.  If you
can write me or write here again next Monday he'll be able to help.  I
believe there are some things we fixed with locking in CW 9 which may
have better performance too.

Ron

>I am working on a cross-platform project which makes extensive use of
>the Boost shared_ptr.  A while back, we made the application threaded
[quoted text clipped - 34 lines]
>Thanks,
>Martin

Signature

Metrowerks, maker of CodeWarrior   -  "Software Starts Here"  
Ron Liechty - MWRon@metrowerks.com - <http://www.metrowerks.com>

Howard Hinnant - 23 Jan 2004 21:09 GMT
In article
<martin.taylor-41ECD4.10433523012004@newstrial.btopenworld.com>,

> I am working on a cross-platform project which makes extensive use of
> the Boost shared_ptr.  A while back, we made the application threaded
[quoted text clipped - 31 lines]
> Can anyone help?  If so, I would be happy to submit the code to boost
> for the benefit of all.

I will take you up on that offer! :-)

I took a look at boost/detail/lwm_gcc.hpp and I'm not convinced that it
is reliable, at least the scoped_lock constructor:

       explicit scoped_lock(lightweight_mutex & m): m_(m)
       {
           while( !__exchange_and_add(&m_.a_, -1) )
           {
               // m_.a_ == -1 !!!
               __atomic_add(&m_.a_, 1);
               sched_yield();
           }
       }

The design of this mutex is that a value of 0 is locked, and other
values are unlocked.  Say that the mutex is locked by thread A.  Thread
B executes __exchange_and_add which returns 0 but does succeed in
decrementing the mutex to -1.  Thread B enters the loop to wait as it
should.  But thread B gets interrupted by Thread C before it can
increment the mutex backed to its proper locked state (0).  Thread C
executes __exchange_and_add which decrements the mutex to -2 and returns
-1.  Thread C now believes that it has accquired the mutex.  Chaos
ensues.

Therefore instead of writing CodeWarrior PPC versions of
__exchange_and_add and __atomic_add as you requested, I've instead
written a CodeWarrior PPC version of lightweight_mutex based on the
lwarx/stwcx. assembly statements.  The code posted below appears to
execute aproximately 8 times faster than pthreads for an uncontested
lock/unlock cycle.

Please let us know if this works for you, and if not, perhaps we can fix
whatever went wrong.  I would also like to publicly thank Bob Campbell
of Metrowerks who helped me review the PPC assembly.

Hoping to see you over at boost soon! :-)

-Howard

Metrowerks

#include <sched.h>

namespace boost
{

namespace detail
{

class lightweight_mutex
{
private:

   volatile int a_;

   lightweight_mutex(lightweight_mutex const &);
   lightweight_mutex & operator=(lightweight_mutex const &);

public:

   lightweight_mutex(): a_(0)
   {
   }

   class scoped_lock;
   friend class scoped_lock;

   class scoped_lock
   {
   private:

       lightweight_mutex & m_;

       scoped_lock(scoped_lock const &);
       scoped_lock & operator=(scoped_lock const &);

   public:

       explicit scoped_lock(lightweight_mutex & m): m_(m)
       {
           register volatile int* p = &m_.a_;
           register int f;
           register int one = 1;
           asm
           {
           loop:
               lwarx  f, 0, p
               cmpwi  f, 0
               bne-   yield
               stwcx. one, 0, p
               beq+   done
           }
           yield:
           sched_yield();
           goto loop;
           done: ;
       }

       ~scoped_lock()
       {
           m_.a_ = 0;
       }
   };
};

}

}
Martin Taylor - 26 Jan 2004 12:13 GMT
Hi Howard

Thanks for such a fast response, especially considering you had just got
back from holiday!
I did have a couple of problems building with the code you suggested
however, so I would like to suggest the following modification:

namespace boost
{

namespace detail
{

class lightweight_mutex
{
private:

   volatile int a_;

   lightweight_mutex(lightweight_mutex const &);
   lightweight_mutex & operator=(lightweight_mutex const &);

public:

   lightweight_mutex(): a_(0)
   {
   }

   class scoped_lock;
   friend class scoped_lock;

   class scoped_lock
   {
   private:

       lightweight_mutex & m_;

       scoped_lock(scoped_lock const &);
       scoped_lock & operator=(scoped_lock const &);

   public:

       explicit scoped_lock(lightweight_mutex & m);

       ~scoped_lock()
       {
           m_.a_ = 0;
       }
   };
};

inline  
lightweight_mutex::scoped_lock::scoped_lock(lightweight_mutex & m): m_(m)
{
  register volatile int *p = &m_.a_;
  register int f;
  register int one = 1;

  asm
  {
  loop:
     lwarx   f, 0, p
     cmpwi   f, 0
     bne-    yield
     stwcx.  one, 0, p
     beq+    done
     b       loop         // not sure if this is needed
  yield:
     stwcx.  f, 0, p
     b       loop
  }

  done: ;
}

} // namespace detail

} // namespace boost

I moved the body of the constructor out so that it would compile with
CW8.  Also for CFM Carbon apps (which ours is) sched_yield is not
readily available, so I changed the loop to just keep going until it
succeeds.
Would this be an acceptable alternative do you think?

Thanks
Martin

In article
<hinnant-1B4EA8.16090623012004@syrcnyrdrs-03-ge0.nyroc.rr.com>,

> In article
> <martin.taylor-41ECD4.10433523012004@newstrial.btopenworld.com>,
[quoted text clipped - 144 lines]
>
> }
Howard Hinnant - 26 Jan 2004 13:39 GMT
In article
<martin.taylor-0F302A.12130526012004@newstrial.btopenworld.com>,

> inline  
> lightweight_mutex::scoped_lock::scoped_lock(lightweight_mutex & m): m_(m)
[quoted text clipped - 29 lines]
> succeeds.
> Would this be an acceptable alternative do you think?

Hi Martin,

I suspect that on OS 10 the unconditional branch would be ok, but
perhaps not ideal.  If you run it on OS 9, I think it may be possible
that the lack of a yield to the OS could lead to an infinite loop.  Can
you substitute a call to MPYield() in for CFM?

inline
lightweight_mutex::scoped_lock::scoped_lock(lightweight_mutex & m)
   : m_(m)
{
   register volatile int* p = &m_.a_;
   register int f;
   register int one = 1;
   asm
   {
   loop:
       lwarx  f, 0, p
       cmpwi  f, 0
       bne-   yield
       stwcx. one, 0, p
       beq+   done
   }
   yield:
   MPYield();
   goto loop;
   done: ;
}

You may need to include <Multiprocessing.h> for that.

-Howard
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.