|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
LLVM and coroutines/microthreadsI saw this was mentioned briefly last year, but there seemed to be
some confusion as to what coroutines entailed and the thread died out. This technique has an unfortunate number of names, but it does get a lot of use, including popular languages like Ruby. I'm currently working on a programming language called Minnow (http://www.minnow-lang.org). It's an actor-based language, where each actor has its own microthread. In the current implementation I'm manually saving and restoring the stack and live variables by hand to get this effect. After spending months wrestling with these manual continuations, I can safely say they are a total pain in the butt. The story doesn't improve much with 3rd party coroutine libraries, which have their own limitations (for instance, can you resume a microthread on a different OS thread than it was created?). The million dollar question is: can LLVM capture the live variables and stack into a buffer? Or, even just save off register variables to the stack, and give me a way to capture and restore that stack (and pc)? Thanks, Jonathan _______________________________________________ LLVM Developers mailing list LLVMdev@... http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev |
|
|
Re: LLVM and coroutines/microthreadsOn Thu, Apr 16, 2009 at 9:47 AM, Jonathan D. Turner
<jonathan.d.turner@...> wrote: > I saw this was mentioned briefly last year, but there seemed to be > some confusion as to what coroutines entailed and the thread died out. > This technique has an unfortunate number of names, but it does get a > lot of use, including popular languages like Ruby. > > I'm currently working on a programming language called Minnow > (http://www.minnow-lang.org). It's an actor-based language, where > each actor has its own microthread. In the current implementation I'm > manually saving and restoring the stack and live variables by hand to > get this effect. After spending months wrestling with these manual > continuations, I can safely say they are a total pain in the butt. > The story doesn't improve much with 3rd party coroutine libraries, > which have their own limitations (for instance, can you resume a > microthread on a different OS thread than it was created?). First, I will assume that you have read http://www.nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt and if you have not, do so. I am also working on an Actor based language. I initially started with Erlang (too difficult to bind into C++ where I have some required libraries that I need). I then went to Stackless Python (almost stupid-easy to bind with, but the switching was too slow anytime there was a native frame in an Actors callstack (as it did what you are doing, copying the stack around, without a native frame on the callstack switching was pretty fast though, and since Python did not operate over multiple threads simultaneously I was forced to create multiple processes and link them as a mesh, ala Erlang, worked well, but slower then heck due to all of the serialization and such). I then created two implementations in C++, one was messy stack copying as you are currently doing, but it had a nice interface, and the second type was ugly, required a lot of scaffolding, but held all values that needed to be moved around on the heap instead of the stack, it was much faster, and although I would have no qualms with all the scaffolding, others did not even understand what was going on. Honestly, I would use Erlang, but I have a few restrictions. First, I need to bind with a rather large C++ only library, and binding C++ and Erlang (by creating an Erlang driver, or by creating an Erlang compatible C++ node) was a new level of horror (although I do foresee that there could be a really good C++ binding library made, with heavy use of templates to get rid of so much of the ugliness). Second, some of the things I need to use it in are larger apps that I have no control over, so I need to be able to embed it into an app (which Stackless Python was able to do beautifully, made a nice bit of reusable scaffolding code for that). If not for these two restrictions Erlang would be pretty nice, and it is still quite fast (although I would certainly not balk at something closer to C speed for linear code). And yes, I have to agree, ever C/C++ coroutine library sucked beyond all heck, either stack fiddling, or they use crap like Microsoft's micro-thread implementation when on Windows (whoever chose to use that is a moron, did you know that if you tell it that you only need, say 100k of stack space for a micro-thread, it still allocates and gives it memory equal to a system thread, meaning about 4 megs by default, so yea, that does not scale well at all unless you set a compile option to vastly reduce your system thread size as well, which cannot always be done...). Thus enters llvm. As C++ is generally called C with classes (although it is so much more), I started creating a language that could be described as C with Actors, first thing, remove every concept of a global state, so no global variables, etc, I want this to be a 'safe' language, as if someone really needs power they can write something in C/C++ to link it in. Over a few months I actually started replacing a bit of C syntax, so it does not really look purely like C anymore, but it is easier to write in my opinion. Basically I have two types of 'functions' in my language, there are normal C style functions, they use the stack, they are fast, that can be well optimized, etc, however they can only call other direct functions and they cannot suspend execution (also note, this language is cooperatively microthreaded, not pre-emptive, so it requires a bit more work on the programmers part, but I intend to minimize that with new language primitives, but this way it is a lot faster and has a great deal more control over its own execution). The other type is much more free, it can call others can that also suspend, as well as normal functions. When a normal function is called then it is called like any normal C function, uses stack, etc... The other function type (an actor function) when compiled becomes lots of little functions, they are broken up at every call to another actor function (and I intend to add a few more places to break it up, such as analyze loops and if they are heavy then break it up in the loop iterations too). The above linked document describes how I break it up well, with a few slight changes. An actor function also uses the stack, however if any variable has any uses past another actor function call then it is stuffed into a struct and passed to the actor function call as the first parameter. This made coding my front-end notable more difficult since I have to put it in SSA form directly, since the alloca to SSA pass does not work if you 'alloca' from some passed in memory (I really wish that could be fixed, I am thinking of copying that pass to a new name and altering it, would save me so much work). The Actor functions first (hidden) parameter is a pointer to some memory, a stack. When an actor function calls another actor function (meaning that this is the end of the function and the function will be continued at one of the split points), it 'pushes' an SSA variable (that has a use past this point) onto the passed in 'stack', and increments the pointer the size of the memory that the SSA variable took up, like a real stack and so forth. At the resume point it 'pops' all the memory out of the passed in stack back into SSA variables, which llvm can put in registers or put on the real stack. As stated, an Actor function can only be called by other Actor function, but they just make up a whole single callstack on the Actor. However, a normal function (as well as Actor functions) can 'new' a new Actor stack. Basically the Actor that will start the new microthread is passed to a scheduler as well as a constant integer that represents the maximum 'stack' space that the passed in Actor stack will require (reference: http://www.nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt), the scheduler allocates the necessary memory and calls the Actor with the newly created memory passed in (and some scaffolding, the beginning of the 'stack' also contains a pointer to exception handling structures and so forth, I handle exceptions very different in this language, as well as destructors, they are appended on to a list of function calls that will clean up memory, handle exceptions, etc... as necessary, since they all have to be contained in this allocated memory range instead of on the real stack, so my exception handling has some overhead, which is just a memory write of a size of 2*sizeof(void*) anytime a new destructor or exception handler needs to be registers, throwing is faster since that stack is not unwound or read, it just immediately jumps through the handlers, running destructors, until it gets something that wants to stop and handle it). As you can see, an Actor callstack will stay inside this actor until an explicit switch is asked for (and I may add some implicit ones in tight loops and such, still debating on how much I want hidden to the programmer, if I do I will add a keyword to be able to deactivate it for a given loop regardless). A switch can be a basic switch, or a message pass, both are handled differently then in Erlang I would imagine. A basic switch (no message pass) is simple, it simply asks the scheduler for one thing, a pointer to the top of the memory stack of the next microthread that needs to run, it then pops off a function pointer (an actor function that is split up, all the splits have the same signature, just taking a pointer to some untyped memory and another pointer to struct of a certain format, used for message passing described later, but in a switch case it is null), then serializes up the SSA values it needs to keeps itself for its continuation (a switch is also a split point as should be obvious) and gives that to the scheduler to schedule as it sees fit, then executes the popped function pointer passing the new top of its 'stack' to it (then returning so it is always a tail-call, this is why I asked on the list a while back whether indirect function calls are tail-call optimized, this design requires them to be so it is fast). A normal switch, if possible, should be minimized, only occurring on actor function call boundaries, if the programmer wishes to pause then they should handle a message while they are at it. To pass a message is similar to a switch, but it asks a different scheduler for a specific microthread that is actively waiting for a message, if it is not actively waiting (the receiving microthread is on a normal split that does not handle messages for example, or is running on another OS thread) then it gets rather heavy as it then queries the main world handler for if the microthread even exists (it could have returned for example), if it does not exist then the microthread message is deallocated (put back into reusable queue, I may add garbage collection to this part, works fine as is though, just have another actor query it once in a while to see what has not been used in a while and free that memory if so). If it does exist, but is just not waiting for a message then right now the message is given to a freshly created Actor (these ones are designed to be short lived, using almost no stack space, no exception handling and so forth, and there memory is always kept around so they can be reused as necessary) that just tries to keep sending the message every once in a while (these make up a 'fake' queue/mailbox, remember, unbounded nondetermanism the Actor model is, and not everything will need queues, so even queues are modeled as Actors, I do intend to overhaul this design though, thinking of creating another Actor per Actor that is created as necessary that will hold a queue internally to do dispatching, replacing the main actors link to itself so it then handles all messaging). Just like in Erlang it will be good to get all messages (regardless of matching) just to clear things out, although it would be quite possible for a queue Actor to send an exception onto the stack of its main Actor if there are too many messages in queue. Hopefully though there should be few messages queued as most messages sent will be instantly received without any queuing, I need to go over the design though to make sure it will scale well, and make changes as necessary. To get a message the Actor will serialize up its SSA variables and pass its stack to another schedule (that does not actually do any scheduling, it just handles message passing), then ask the main scheduler for the next thing to run and run it as in a normal switch, thus it is a normal switch, just that it will not continue until a message is passed instead of just after some arbitrary time. As stated earlier, when an Actor is created I pre-get all the stack that the function will require, which means all necessary exception handler space, destructor pointer space, and SSA values that need to be saved space, not only for the Actor function itself, but also all other Actor functions it will directly call as well. I also intend to add the ability to call Actor function pointers, and that will only take up sizeof(void*) on the stack since the called actor function will need to create its own extra space on top of that (when the address of an Actor function is taken I intend to wrap it in another function that will create the stack as necessary so I do not have to create two versions of that, possibly large, actor function). When you are waiting for a message in Erlang, you can also specify a timeout, I have not implemented that yet either, but I intend to so when a timeout is specified then another message match is implicitly added that will match a timeout message, and send off a message to a timer scheduler that will send out a message with a timeout of a certain ID that it will send back after a time-delay. If a message is received before the timeout then a cancel message will be sent out (although the timeout message could have been sent first, but not received yet, which is why the ID has to match when it receives it, else the Actor will discard the timeout message, since it was probably old). Do note, in the compiled code much of this handling is just function calls so there is no real overhead. I use no locking primitives for the things that are multi-threaded (such as the timer) with liberal use of instructions such as atomic compare-and-swap (my favorite, well, most used anyway, I also wrote that assembly directly for those in the C++ actor libraries I created long ago). I want every thread to be as busy as possible. Every actor may end up getting resumed on a different thread every time if it had such timing (luck of the draw). I also intend to implement mesh functionality, like in Erlang, and I intend to set that up with more lightweight actors that just act as proxies, so when a pid (in Erlang terms) of an Actor on a remote system is received then a local proxy Actor is setup that will handle the network passing from the local to the remote system between those Actors as the remote proxy actor converts the network messages back into normal messages and passes it to the real Actor. I intend for these proxy actors to be situation dependent, so if you have two programs on the same machine then they will use shared memory to pass between them, perhaps TCP/IP if someone wants a license-free networkable way, but my main network method I intend to create a RakNet binding since it is a great deal faster and easier to handle then TCP/IP with still full reliability, although I intend to make it quite simple to make bindings so a binding to a serial port would be easy to make, or whatever else anyone may need. If I feel up for it I may make a UDP binding with reliability built on since it should give faster speed then TCP/IP with reliability setup in such a way that it matches the system. I do realize that the stuffing of the live SSA values into my passed in stack is not perfectly efficient, but since LLVM supplies no ability to specify a different stack pointer to use for alloca instead of the system stack, I do not have much choice. So far the bit I have created works well enough, just have had no time to work on it the past 4 months due to final year in school + work taking up ever bit of free time I have (and much of my sleep time, I only average 3 hours a night, ugh...). I intend to continue here in about one month since I will have vastly more free time. Hopefully this design I am using you can take and use as you wish as well. Right now I do the function splitting in my own tree, but I have been throwing around the idea to make a module pass to do it at the LLVM level with some annotation information tossed in, would simplify my code quite a bit since LLVM holds user information and so forth. If I do end up making such a pass then it may become useful for your project as well if you choose to use a similar design as I. Although if you manage to create one before I get my free time back in a month, I would not complain. :) On Thu, Apr 16, 2009 at 9:47 AM, Jonathan D. Turner <jonathan.d.turner@...> wrote: > The million dollar question is: can LLVM capture the live variables > and stack into a buffer? Or, even just save off register variables to > the stack, and give me a way to capture and restore that stack (and > pc)? So yes, LLVM can capture the live variable information and stack into a buffer, but it currently has no functionality to do so, thus you have to do it yourself as I do. Although the above way is vastly faster then memcpy'ing entire stacks around (as we do in C++ itself to simulate full coroutines). Also, I apologize for the length, once I start explaining things I find it hard to stop, should have seen how I was going on about how sexpressions are better then xml yesterday at work (and no, we use neither, it was a philosophical argument)... - OvermindDL1 _______________________________________________ LLVM Developers mailing list LLVMdev@... http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev |
|
|
Re: LLVM and coroutines/microthreadsOn Thu, Apr 16, 2009 at 3:21 PM, OvermindDL1 <overminddl1@...> wrote:
> > First, I will assume that you have read > http://www.nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt > and if you have not, do so. I hadn't. That's very similar to what I had tried early on, but found it was actually slower than managing my own stacks lazily (on continuation, the functions check a state variable for whether or not to freeze live variables. The state var is kept pretty warm). I haven't gone nearly as deeply as you have, though. It looks like the article is suggesting that you're keep a pointer and do a bunch of derefs on it in all cases. It just sounds inefficient, unless I'm missing something. > And yes, I have to agree, ever C/C++ coroutine library sucked beyond > all heck, either stack fiddling, or they use crap like Microsoft's Looks like pretty much all the options are going to require bookkeeping, hackery, or both. Even my stack copying suggestion has restrictions on stack size. > timeout, I have not implemented that yet either, but I intend to so > when a timeout is specified then another message match is implicitly > added that will match a timeout message, and send off a message to a > timer scheduler that will send out a message with a timeout of a > certain ID that it will send back after a time-delay. If a message is That's a handy idea. This technique reminds me of Orc, if you haven't seen it you should check it out: http://orc.csres.utexas.edu/ > I use no locking primitives for the things that are multi-threaded > (such as the timer) with liberal use of instructions such as atomic > compare-and-swap (my favorite, well, most used anyway, I also wrote Funny you should mention this. I've spent the better part of the last few weeks trying to get a CAS solution to be rock-solid. Then, it turns out I could make a locky solution that performed similarly to the CAS solution by using tiny (2-3 instructions) locked areas. If you look through Art of Multiprocessor Programming, you'll see that a lot of the solutions use multiple CAS's per action. As best I can tell, if you CAS more than twice per task, you're starting to get into the area of tiny locks performance-wise (at least on Intel platforms). > system is received then a local proxy Actor is setup that will handle > the network passing from the local to the remote system between those > Actors as the remote proxy actor converts the network messages back Sounds good. I've thought about something like this, but shelved the idea for the first release. > So yes, LLVM can capture the live variable information and stack into > a buffer, but it currently has no functionality to do so, thus you > have to do it yourself as I do. Although the above way is vastly > faster then memcpy'ing entire stacks around (as we do in C++ itself to > simulate full coroutines). >From what you're telling me, as far as LLVM is concerned there are ways to do it manually, you just have to pick one that suits the project. There aren't any automatic solutions, nor hooks to attach your own. I admit, I assumed that might be the case since there isn't any "standard" coroutine style. Jonathan _______________________________________________ LLVM Developers mailing list LLVMdev@... http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev |
|
|
Re: LLVM and coroutines/microthreadsOn Thu, Apr 16, 2009 at 2:44 PM, Jonathan D. Turner
<jonathan.d.turner@...> wrote: > It looks like the article is suggesting that you're keep a pointer and > do a bunch of derefs on it in all cases. It just sounds inefficient, > unless I'm missing something. It is just treating it identically to how the real system stack is treated. As stated, most of the code will still keep it in registers and so forth since the front-end outputs the SSA form directly. On Thu, Apr 16, 2009 at 2:44 PM, Jonathan D. Turner <jonathan.d.turner@...> wrote: > Looks like pretty much all the options are going to require > bookkeeping, hackery, or both. Even my stack copying suggestion has > restrictions on stack size. My method has no restriction, and I do have a decorator keyword for the actor function that always causes a new stack to be created at that point as well (good for if a function may use a lot of stack space, but be called rarely, like an alloca'd known-size array). The only real limit is system memory. On Thu, Apr 16, 2009 at 2:44 PM, Jonathan D. Turner <jonathan.d.turner@...> wrote: > That's a handy idea. This technique reminds me of Orc, if you haven't > seen it you should check it out: http://orc.csres.utexas.edu/ I had not, thanks for the link, will look it over shortly. On Thu, Apr 16, 2009 at 2:44 PM, Jonathan D. Turner <jonathan.d.turner@...> wrote: > Funny you should mention this. I've spent the better part of the last > few weeks trying to get a CAS solution to be rock-solid. Then, it > turns out I could make a locky solution that performed similarly to > the CAS solution by using tiny (2-3 instructions) locked areas. If > you look through Art of Multiprocessor Programming, you'll see that a > lot of the solutions use multiple CAS's per action. As best I can > tell, if you CAS more than twice per task, you're starting to get into > the area of tiny locks performance-wise (at least on Intel platforms). The only places that such things are ever needed is when something is pushed onto a queue. In this pattern no memory is nor can be aliased by any other Actor, so things stay very self-contained, so the only synchronization necessary is when pushing something onto a list, which should require only one atomic CAS. I have not yet ran into anything that needed more (although it is possible, the language is no where near finished yet, need some free time again for that) On Thu, Apr 16, 2009 at 2:44 PM, Jonathan D. Turner <jonathan.d.turner@...> wrote:> >From what you're telling me, as far as LLVM is concerned there are > ways to do it manually, you just have to pick one that suits the > project. There aren't any automatic solutions, nor hooks to attach > your own. I admit, I assumed that might be the case since there isn't > any "standard" coroutine style. The one thing I do wish so much is if you could specify your own pre-allocated memory for the alloca instructions, then the passes that raise alloca's to SSA and so forth would just work fine. _______________________________________________ LLVM Developers mailing list LLVMdev@... http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev |
| Free embeddable forum powered by Nabble | Forum Help |