Re: numerical Lua?
Again I want to emphasize that this is a real nice effort. I just would like to see some things improved, primarily in the area of performance. Unfortunately it's not a trivial exercise. You do a lot of type checking, and I don't think you are, or can do lazy evaluation of vector or matrix expressions. (Maybe the LUA experts could weigh in here. Is there any way to do lazy evaluation inside lua?)
My approach to this problem was not nearly as elegant as yours, but it was a heck of a lot faster.
The first thing I did was generate a C include file filled with Macros that handled special case calls to the BLAS and LAPACK libraries, that are commonly used, such as matrix multiplies, triangular matrix back substitutions etc. etc. I had a set of wrapper routines that worked directly off of my main matrix structure.
If you use a garbage collector at the C level it makes the C level matrix operations a lot easier to do. Thus I modified lua and the C code to use the Hans Boehm garbage collector (only two lines of code need to be changed to do this as discussed elsewhere). Now I don't have to worry about memory leaks and the macros make it fairly straightforward to implement any matrix math in C.
My philosophy was that anything that had to loop over matrix elements or loop over multiple matrix operations should be in C and should link to BLAS or routines optimized for BLAS (e.g. Lapack) whenever possible. Thus having a good C api is even more important than the LUA interface.
On the LUA side I want to be able to do things like view the matrices or submatrices and perhaps graph the results out using gnuplot and control the C code, and insert LUA tables of parameters into the C code. Thus one wants to automate the interface to the extent possible (tolua and swig are great for this.)
As a final point, on a windows box, there is a library interface issue. f2c.exe appends an underscore to all fortran routines. However on windows all the BLAS libraries I've used do not do this. Atlas leaves the fortran names unchanged and AMDs acml libraries does the same for their C stubs, but capitilizes the names of the Fortran routines so that every routine has two entry points.
I might be able to help, but some more thought would have to go into how to address the performance issues in a reasonable manner.