|
View:
New views
7 Messages
—
Rating Filter:
Alert me
|
|
|
MNesia Distribution QuestionsHi there folks,
Very quickly about me: - I'm doing a Masters in Software Engineering - I'm looking closely at Mnesia to involve it in my study OK, so here's where I'm at. It's quite early on in my report, but I'm getting a feel for what could be a useful investigation. I am briefly discussing cloud computing, and then in more detail at distributed computation and data storage (along with fault tolerance, high availability etc...). My supervisor and I have agreed that to illustrate the use of a distributed computation system, I am going to perform some large computation on a large dataset, probably using Pig (dataflow language atop of Hadoop). I have been given the university cluster to deploy Hadoop. (Bear with me....) So... now the big question for me right now is where to find a useful competitor (or rather a solution with similar goals). The easy option would be to compare Pig to another Hadoop interface, i.e. Hive, but those results would be pretty uninteresting). So instead, I'm looking into the realm of distributed databases. Now... as far as I'm concerned, the way in which Mnesia distributes the availability of data across nodes is similar to how Hadoop distributes data across the HDFS (Hadoop file system) across nodes). My issue here is, my lack of understanding on how a data query computation is distributed over a network of Mnesia nodes. I have a good understanding of how this is achieved with Hadoop (if there are 10 datanodes, then each will get a tenth of the work), but is there such a thing as parallel query processing with MNesia? Or... is MNesia just a way to very very quickly replicate the availability of data. I hope that you guru's can shed some light on this for me. I'm not aware of exactly how MNesia would deal with a data query where the MNesia network consists of say, 10 nodes? Does a user query just one of the 10, or does a user query the network? I'm really trying to think of a fair and interesting way to compare the concept of a distributed database (MNesia), against a distributed processing engine (Hadoop). There are other things I want to delve into also... For instance, I really need to know more about the difference between CouchDB and MNesia. So far, I can only establish that CouchDB is more useful for networks where nodes are likely to go offline at various times. (Not much knowledge!!). If, however, comparing a distributed database engine against a distributed processing engine is a non starter, let me know of that too !! Many thanks, I would really appreciate some feedback. Rob Stewart |
|
|
Re: MNesia Distribution QuestionsHi Rob,
I guess the best I can do is pointing you to some useful resources: - http://www.erlang.se/publications/mnesia_overview.pdf - http://www.erlang.org/doc/apps/mnesia/Mnesia_chap1.html#1 - http://couchdb.apache.org/ - http://oreilly.com/catalog/9780596158163 - http://wiki.apache.org/couchdb/Frequently_asked_questions#why_no_mnesia I advise you to give a quick look to these references and, if you still are in doubt, ask the mailing list. Hope to be useful. Best regards, Roberto Aloi roberto.aloi@... http://www.erlang-consulting.com --- On Oct 21, 2009, at 3:18 PM, Rob Stewart wrote: > Hi there folks, > Very quickly about me: > - I'm doing a Masters in Software Engineering > - I'm looking closely at Mnesia to involve it in my study > > > OK, so here's where I'm at. It's quite early on in my report, but I'm > getting a feel for what could be a useful investigation. I am briefly > discussing cloud computing, and then in more detail at distributed > computation and data storage (along with fault tolerance, high > availability > etc...). My supervisor and I have agreed that to illustrate the use > of a > distributed computation system, I am going to perform some large > computation > on a large dataset, probably using Pig (dataflow language atop of > Hadoop). I > have been given the university cluster to deploy Hadoop. > > (Bear with me....) > > So... now the big question for me right now is where to find a > useful competitor (or rather a solution with similar goals). The > easy option > would be to compare Pig to another Hadoop interface, i.e. Hive, but > those > results would be pretty uninteresting). So instead, I'm looking into > the > realm of distributed databases. Now... as far as I'm concerned, the > way in > which Mnesia distributes the availability of data across nodes is > similar > to how Hadoop distributes data across the HDFS (Hadoop file system) > across > nodes). My issue here is, my lack of understanding on how a data query > computation is distributed over a network of Mnesia nodes. I have a > good > understanding of how this is achieved with Hadoop (if there are 10 > datanodes, then each will get a tenth of the work), but is there > such a > thing as parallel query processing with MNesia? Or... is MNesia just > a way > to very very quickly replicate the availability of data. > > I hope that you guru's can shed some light on this for me. I'm not > aware of > exactly how MNesia would deal with a data query where the MNesia > network > consists of say, 10 nodes? Does a user query just one of the 10, or > does a > user query the network? I'm really trying to think of a fair and > interesting > way to compare the concept of a distributed database (MNesia), > against a > distributed processing engine (Hadoop). > > There are other things I want to delve into also... For instance, I > really > need to know more about the difference between CouchDB and MNesia. > So far, I > can only establish that CouchDB is more useful for networks where > nodes are > likely to go offline at various times. (Not much knowledge!!). > > If, however, comparing a distributed database engine against a > distributed > processing engine is a non starter, let me know of that too !! > > Many thanks, I would really appreciate some feedback. > > > Rob Stewart ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
|
Re: MNesia Distribution QuestionsHi Roberto,
Very useful links, thanks. From that, I really just have two telling questions, that I have not, as yet, found a clear answer to. 1. And example - I have a compex query to apply to a dataset, of, say 10,000,000 rows in an MNesia database. This database is replicated over 10 nodes in a network. Will the query be split for equal computation to each of the 10 nodes, or will the query be executed on either one random or one selected MNesia node. 2. How, in a most interesting way, could a distributed database, like MNesia be compared (in detail) to a distributed processing engine like Hadoop or Dryad? Taking the example from part 1, I know that Hadoop will map/reduce the job across the 10 nodes for parallel processing. And either Pig or Hive or HBase provide a database-like interface to the Hadoop data. So does MNesia share any common goals with Hadoop and such like? Regards, Rob Stewart 2009/10/21 Roberto Aloi <roberto.aloi@...> > Hi Rob, > > I guess the best I can do is pointing you to some useful resources: > > - http://www.erlang.se/publications/mnesia_overview.pdf > - http://www.erlang.org/doc/apps/mnesia/Mnesia_chap1.html#1 > - http://couchdb.apache.org/ > - http://oreilly.com/catalog/9780596158163 > - http://wiki.apache.org/couchdb/Frequently_asked_questions#why_no_mnesia > > I advise you to give a quick look to these references and, if you still are > in doubt, ask the mailing list. > Hope to be useful. > > Best regards, > > Roberto Aloi > roberto.aloi@... > http://www.erlang-consulting.com > --- > > > > > > On Oct 21, 2009, at 3:18 PM, Rob Stewart wrote: > > Hi there folks, >> Very quickly about me: >> - I'm doing a Masters in Software Engineering >> - I'm looking closely at Mnesia to involve it in my study >> >> >> OK, so here's where I'm at. It's quite early on in my report, but I'm >> getting a feel for what could be a useful investigation. I am briefly >> discussing cloud computing, and then in more detail at distributed >> computation and data storage (along with fault tolerance, high >> availability >> etc...). My supervisor and I have agreed that to illustrate the use of a >> distributed computation system, I am going to perform some large >> computation >> on a large dataset, probably using Pig (dataflow language atop of Hadoop). >> I >> have been given the university cluster to deploy Hadoop. >> >> (Bear with me....) >> >> So... now the big question for me right now is where to find a >> useful competitor (or rather a solution with similar goals). The easy >> option >> would be to compare Pig to another Hadoop interface, i.e. Hive, but those >> results would be pretty uninteresting). So instead, I'm looking into the >> realm of distributed databases. Now... as far as I'm concerned, the way in >> which Mnesia distributes the availability of data across nodes is similar >> to how Hadoop distributes data across the HDFS (Hadoop file system) across >> nodes). My issue here is, my lack of understanding on how a data query >> computation is distributed over a network of Mnesia nodes. I have a good >> understanding of how this is achieved with Hadoop (if there are 10 >> datanodes, then each will get a tenth of the work), but is there such a >> thing as parallel query processing with MNesia? Or... is MNesia just a way >> to very very quickly replicate the availability of data. >> >> I hope that you guru's can shed some light on this for me. I'm not aware >> of >> exactly how MNesia would deal with a data query where the MNesia network >> consists of say, 10 nodes? Does a user query just one of the 10, or does a >> user query the network? I'm really trying to think of a fair and >> interesting >> way to compare the concept of a distributed database (MNesia), against a >> distributed processing engine (Hadoop). >> >> There are other things I want to delve into also... For instance, I really >> need to know more about the difference between CouchDB and MNesia. So far, >> I >> can only establish that CouchDB is more useful for networks where nodes >> are >> likely to go offline at various times. (Not much knowledge!!). >> >> If, however, comparing a distributed database engine against a distributed >> processing engine is a non starter, let me know of that too !! >> >> Many thanks, I would really appreciate some feedback. >> >> >> Rob Stewart >> > > |
|
|
Re: MNesia Distribution QuestionsRob Stewart wrote:
> > 1. And example - I have a compex query to apply to a dataset, of, say > 10,000,000 rows in an MNesia database. This database is replicated over 10 > nodes in a network. Will the query be split for equal computation to each of > the 10 nodes, or will the query be executed on either one random or one > selected MNesia node. If it's one homogeneous table that is simply replicated across all nodes, the query will be executed on the originating node only. If the table is fragmented, a subset of fragments will be identified (if the whole key isn't bound, this will be all fragments), and the query will be executed on all fragments in parallel. The resulting sets will be merged, respecting the ordering semantics of the table (i.e. sorted if it's an ordered_set table, otherwise not). BR, Ulf W -- Ulf Wiger CTO, Erlang Training & Consulting Ltd http://www.erlang-consulting.com ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
|
Re: MNesia Distribution QuestionsMnesia has 4 GB (or 2 GB?) limit per node. Is there a tutorial or doc
about how to create fragmented Mnesia DB, so that a very big DB can be cut into pieces and saved on many nodes? Thanks. On Thu, Oct 22, 2009 at 2:06 AM, Ulf Wiger <ulf.wiger@...> wrote: > Rob Stewart wrote: >> >> 1. And example - I have a compex query to apply to a dataset, of, say >> 10,000,000 rows in an MNesia database. This database is replicated over 10 >> nodes in a network. Will the query be split for equal computation to each >> of >> the 10 nodes, or will the query be executed on either one random or one >> selected MNesia node. > > If it's one homogeneous table that is simply replicated across all nodes, > the query will be executed on the originating node only. > > If the table is fragmented, a subset of fragments will be identified > (if the whole key isn't bound, this will be all fragments), and the > query will be executed on all fragments in parallel. The resulting > sets will be merged, respecting the ordering semantics of the table > (i.e. sorted if it's an ordered_set table, otherwise not). > > BR, > Ulf W > -- > Ulf Wiger > CTO, Erlang Training & Consulting Ltd > http://www.erlang-consulting.com > > ________________________________________________________________ > erlang-questions mailing list. See http://www.erlang.org/faq.html > erlang-questions (at) erlang.org > > ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
|
Re: MNesia Distribution QuestionsNgoc Dao wrote:
> Mnesia has 4 GB (or 2 GB?) limit per node. Is there a tutorial or doc > about how to create fragmented Mnesia DB, so that a very big DB can be > cut into pieces and saved on many nodes? Strictly speaking, Mnesia has no such limit. - The 32-bit VM can only address 4 GB of RAM - Normal linux kernels give the application 3 GB to use. You can get past this using a largekernel. - Dets can address 2GB This means that for 32-bit Erlang: - The total amount of data residing in mnesia ram_copies or disc_copies on a given node cannot exceed 3GB or 4GB, depending on the OS. - The size of a disc_only copies table fragment* cannot exceed 2GB. For 64-bit Erlang, the restriction on disc_only_copies remains, but RAM tables can be much, much bigger. However, most data types will take up twice as much space in 64-bit as in 32-bit. Binaries are one notable exception. * Tables can be fragmented using mnesia_frag. If not fragmented, the 'table fragment' will be the entire table. BR, Ulf W -- Ulf Wiger CTO, Erlang Training & Consulting Ltd http://www.erlang-consulting.com ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
|
|
| Free embeddable forum powered by Nabble | Forum Help |