MNesia Distribution Questions

View: New views
7 Messages — Rating Filter:   Alert me  

MNesia Distribution Questions

by thewonderer57 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi there folks,
Very quickly about me:
- I'm doing a Masters in Software Engineering
- I'm looking closely at Mnesia to involve it in my study


OK, so here's where I'm at. It's quite early on in my report, but I'm
getting a feel for what could be a useful investigation. I am briefly
discussing cloud computing, and then in more detail at distributed
computation and data storage (along with fault tolerance, high availability
etc...). My supervisor and I have agreed that to illustrate the use of a
distributed computation system, I am going to perform some large computation
on a large dataset, probably using Pig (dataflow language atop of Hadoop). I
have been given the university cluster to deploy Hadoop.

(Bear with me....)

So... now the big question for me right now is where to find a
useful competitor (or rather a solution with similar goals). The easy option
would be to compare Pig to another Hadoop interface, i.e. Hive, but those
results would be pretty uninteresting). So instead, I'm looking into the
realm of distributed databases. Now... as far as I'm concerned, the way in
which Mnesia distributes the availability of data across  nodes is similar
to how Hadoop distributes data across the HDFS (Hadoop file system) across
nodes). My issue here is, my lack of understanding on how a data query
computation is distributed over a network of Mnesia nodes. I have a good
understanding of how this is achieved with Hadoop (if there are 10
datanodes, then each will get a tenth of the work), but is there such a
thing as parallel query processing with MNesia? Or... is MNesia just a way
to very very quickly replicate the availability of data.

I hope that you guru's can shed some light on this for me. I'm not aware of
exactly how MNesia would deal with a data query where the MNesia network
consists of say, 10 nodes? Does a user query just one of the 10, or does a
user query the network? I'm really trying to think of a fair and interesting
way to compare the concept of a distributed database (MNesia), against a
distributed processing engine (Hadoop).

There are other things I want to delve into also... For instance, I really
need to know more about the difference between CouchDB and MNesia. So far, I
can only establish that CouchDB is more useful for networks where nodes are
likely to go offline at various times. (Not much knowledge!!).

If, however, comparing a distributed database engine against a distributed
processing engine is a non starter, let me know of that too !!

Many thanks, I would really appreciate some feedback.


Rob Stewart

Re: MNesia Distribution Questions

by Roberto Aloi-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Rob,

I guess the best I can do is pointing you to some useful resources:

- http://www.erlang.se/publications/mnesia_overview.pdf
- http://www.erlang.org/doc/apps/mnesia/Mnesia_chap1.html#1
- http://couchdb.apache.org/
- http://oreilly.com/catalog/9780596158163
- http://wiki.apache.org/couchdb/Frequently_asked_questions#why_no_mnesia

I advise you to give a quick look to these references and, if you  
still are in doubt, ask the mailing list.
Hope to be useful.

Best regards,

Roberto Aloi
roberto.aloi@...
http://www.erlang-consulting.com
---




On Oct 21, 2009, at 3:18 PM, Rob Stewart wrote:

> Hi there folks,
> Very quickly about me:
> - I'm doing a Masters in Software Engineering
> - I'm looking closely at Mnesia to involve it in my study
>
>
> OK, so here's where I'm at. It's quite early on in my report, but I'm
> getting a feel for what could be a useful investigation. I am briefly
> discussing cloud computing, and then in more detail at distributed
> computation and data storage (along with fault tolerance, high  
> availability
> etc...). My supervisor and I have agreed that to illustrate the use  
> of a
> distributed computation system, I am going to perform some large  
> computation
> on a large dataset, probably using Pig (dataflow language atop of  
> Hadoop). I
> have been given the university cluster to deploy Hadoop.
>
> (Bear with me....)
>
> So... now the big question for me right now is where to find a
> useful competitor (or rather a solution with similar goals). The  
> easy option
> would be to compare Pig to another Hadoop interface, i.e. Hive, but  
> those
> results would be pretty uninteresting). So instead, I'm looking into  
> the
> realm of distributed databases. Now... as far as I'm concerned, the  
> way in
> which Mnesia distributes the availability of data across  nodes is  
> similar
> to how Hadoop distributes data across the HDFS (Hadoop file system)  
> across
> nodes). My issue here is, my lack of understanding on how a data query
> computation is distributed over a network of Mnesia nodes. I have a  
> good
> understanding of how this is achieved with Hadoop (if there are 10
> datanodes, then each will get a tenth of the work), but is there  
> such a
> thing as parallel query processing with MNesia? Or... is MNesia just  
> a way
> to very very quickly replicate the availability of data.
>
> I hope that you guru's can shed some light on this for me. I'm not  
> aware of
> exactly how MNesia would deal with a data query where the MNesia  
> network
> consists of say, 10 nodes? Does a user query just one of the 10, or  
> does a
> user query the network? I'm really trying to think of a fair and  
> interesting
> way to compare the concept of a distributed database (MNesia),  
> against a
> distributed processing engine (Hadoop).
>
> There are other things I want to delve into also... For instance, I  
> really
> need to know more about the difference between CouchDB and MNesia.  
> So far, I
> can only establish that CouchDB is more useful for networks where  
> nodes are
> likely to go offline at various times. (Not much knowledge!!).
>
> If, however, comparing a distributed database engine against a  
> distributed
> processing engine is a non starter, let me know of that too !!
>
> Many thanks, I would really appreciate some feedback.
>
>
> Rob Stewart


________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org


Re: MNesia Distribution Questions

by thewonderer57 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Roberto,

Very useful links, thanks.

From that, I really just have two telling questions, that I have not, as
yet, found a clear answer to.

1. And example - I have a compex query to apply to a dataset, of, say
10,000,000 rows in an MNesia database. This database is replicated over 10
nodes in a network. Will the query be split for equal computation to each of
the 10 nodes, or will the query be executed on either one random or one
selected MNesia node.

2. How, in a most interesting way, could a distributed database, like MNesia
be compared (in detail) to a distributed processing engine like Hadoop or
Dryad?  Taking the example from part 1, I know that Hadoop will map/reduce
the job across the 10 nodes for parallel processing. And either Pig or Hive
or HBase provide a database-like interface to the Hadoop data. So does
MNesia share any common goals with Hadoop and such like?


Regards,


Rob Stewart




2009/10/21 Roberto Aloi <roberto.aloi@...>

> Hi Rob,
>
> I guess the best I can do is pointing you to some useful resources:
>
> - http://www.erlang.se/publications/mnesia_overview.pdf
> - http://www.erlang.org/doc/apps/mnesia/Mnesia_chap1.html#1
> - http://couchdb.apache.org/
> - http://oreilly.com/catalog/9780596158163
> - http://wiki.apache.org/couchdb/Frequently_asked_questions#why_no_mnesia
>
> I advise you to give a quick look to these references and, if you still are
> in doubt, ask the mailing list.
> Hope to be useful.
>
> Best regards,
>
> Roberto Aloi
> roberto.aloi@...
> http://www.erlang-consulting.com
> ---
>
>
>
>
>
> On Oct 21, 2009, at 3:18 PM, Rob Stewart wrote:
>
>  Hi there folks,
>> Very quickly about me:
>> - I'm doing a Masters in Software Engineering
>> - I'm looking closely at Mnesia to involve it in my study
>>
>>
>> OK, so here's where I'm at. It's quite early on in my report, but I'm
>> getting a feel for what could be a useful investigation. I am briefly
>> discussing cloud computing, and then in more detail at distributed
>> computation and data storage (along with fault tolerance, high
>> availability
>> etc...). My supervisor and I have agreed that to illustrate the use of a
>> distributed computation system, I am going to perform some large
>> computation
>> on a large dataset, probably using Pig (dataflow language atop of Hadoop).
>> I
>> have been given the university cluster to deploy Hadoop.
>>
>> (Bear with me....)
>>
>> So... now the big question for me right now is where to find a
>> useful competitor (or rather a solution with similar goals). The easy
>> option
>> would be to compare Pig to another Hadoop interface, i.e. Hive, but those
>> results would be pretty uninteresting). So instead, I'm looking into the
>> realm of distributed databases. Now... as far as I'm concerned, the way in
>> which Mnesia distributes the availability of data across  nodes is similar
>> to how Hadoop distributes data across the HDFS (Hadoop file system) across
>> nodes). My issue here is, my lack of understanding on how a data query
>> computation is distributed over a network of Mnesia nodes. I have a good
>> understanding of how this is achieved with Hadoop (if there are 10
>> datanodes, then each will get a tenth of the work), but is there such a
>> thing as parallel query processing with MNesia? Or... is MNesia just a way
>> to very very quickly replicate the availability of data.
>>
>> I hope that you guru's can shed some light on this for me. I'm not aware
>> of
>> exactly how MNesia would deal with a data query where the MNesia network
>> consists of say, 10 nodes? Does a user query just one of the 10, or does a
>> user query the network? I'm really trying to think of a fair and
>> interesting
>> way to compare the concept of a distributed database (MNesia), against a
>> distributed processing engine (Hadoop).
>>
>> There are other things I want to delve into also... For instance, I really
>> need to know more about the difference between CouchDB and MNesia. So far,
>> I
>> can only establish that CouchDB is more useful for networks where nodes
>> are
>> likely to go offline at various times. (Not much knowledge!!).
>>
>> If, however, comparing a distributed database engine against a distributed
>> processing engine is a non starter, let me know of that too !!
>>
>> Many thanks, I would really appreciate some feedback.
>>
>>
>> Rob Stewart
>>
>
>

Re: MNesia Distribution Questions

by Ulf Wiger-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Rob Stewart wrote:
>
> 1. And example - I have a compex query to apply to a dataset, of, say
> 10,000,000 rows in an MNesia database. This database is replicated over 10
> nodes in a network. Will the query be split for equal computation to each of
> the 10 nodes, or will the query be executed on either one random or one
> selected MNesia node.

If it's one homogeneous table that is simply replicated across all
nodes, the query will be executed on the originating node only.

If the table is fragmented, a subset of fragments will be identified
(if the whole key isn't bound, this will be all fragments), and the
query will be executed on all fragments in parallel. The resulting
sets will be merged, respecting the ordering semantics of the table
(i.e. sorted if it's an ordered_set table, otherwise not).

BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org


Re: MNesia Distribution Questions

by Ngoc Dao :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Mnesia has 4 GB (or 2 GB?) limit per node. Is there a tutorial or doc
about how to create fragmented Mnesia DB, so that a very big DB can be
cut into pieces and saved on many nodes?

Thanks.


On Thu, Oct 22, 2009 at 2:06 AM, Ulf Wiger
<ulf.wiger@...> wrote:

> Rob Stewart wrote:
>>
>> 1. And example - I have a compex query to apply to a dataset, of, say
>> 10,000,000 rows in an MNesia database. This database is replicated over 10
>> nodes in a network. Will the query be split for equal computation to each
>> of
>> the 10 nodes, or will the query be executed on either one random or one
>> selected MNesia node.
>
> If it's one homogeneous table that is simply replicated across all nodes,
> the query will be executed on the originating node only.
>
> If the table is fragmented, a subset of fragments will be identified
> (if the whole key isn't bound, this will be all fragments), and the
> query will be executed on all fragments in parallel. The resulting
> sets will be merged, respecting the ordering semantics of the table
> (i.e. sorted if it's an ordered_set table, otherwise not).
>
> BR,
> Ulf W
> --
> Ulf Wiger
> CTO, Erlang Training & Consulting Ltd
> http://www.erlang-consulting.com
>
> ________________________________________________________________
> erlang-questions mailing list. See http://www.erlang.org/faq.html
> erlang-questions (at) erlang.org
>
>

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org


Re: MNesia Distribution Questions

by Ulf Wiger-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ngoc Dao wrote:
> Mnesia has 4 GB (or 2 GB?) limit per node. Is there a tutorial or doc
> about how to create fragmented Mnesia DB, so that a very big DB can be
> cut into pieces and saved on many nodes?

Strictly speaking, Mnesia has no such limit.

- The 32-bit VM can only address 4 GB of RAM

- Normal linux kernels give the application 3 GB to use.
   You can get past this using a largekernel.

- Dets can address 2GB


This means that for 32-bit Erlang:
- The total amount of data residing in mnesia
   ram_copies or disc_copies on a given node cannot
   exceed 3GB or 4GB, depending on the OS.

- The size of a disc_only copies table fragment* cannot
   exceed 2GB.

For 64-bit Erlang, the restriction on disc_only_copies
remains, but RAM tables can be much, much bigger.
However, most data types will take up twice as much space
in 64-bit as in 32-bit. Binaries are one notable exception.

* Tables can be fragmented using mnesia_frag. If not fragmented,
the 'table fragment' will be the entire table.

BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org


Parent Message unknown Re: Re: MNesia Distribution Questions

by wde :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello ,

I found  an old post ( 3 years ago) with this comment :

"
Two major problems with dets as of today (and of yeasterday)
1. 64 bit indexing - really easy to fix
2. repair time - this is a bit harder to fix, but one solution
could be along the following lines.
a) keep the index and the freelist in a different file,
b) have a file soley for the data. This file could be as easily
repaired as a disklog file. One could just chunk through it, one
term at a time, this way even large files will be fast to repair.
c) when an object is put on the freelist, it needs also to be
overwriten
"

What about dets implementation ?

do you consider that there are always the major problems of dets implementation ?

In more general, concerning Mnesia (dets include), on which part of the code
we could work to improve this application ?

Thank you



 



======= le 22/10/2009, 08:15:15 vous écriviez: =======

>Ngoc Dao wrote:
>> Mnesia has 4 GB (or 2 GB?) limit per node. Is there a tutorial or doc
>> about how to create fragmented Mnesia DB, so that a very big DB can be
>> cut into pieces and saved on many nodes?
>
>Strictly speaking, Mnesia has no such limit.
>
>- The 32-bit VM can only address 4 GB of RAM
>
>- Normal linux kernels give the application 3 GB to use.
>   You can get past this using a largekernel.
>
>- Dets can address 2GB
>
>
>This means that for 32-bit Erlang:
>- The total amount of data residing in mnesia
>   ram_copies or disc_copies on a given node cannot
>   exceed 3GB or 4GB, depending on the OS.
>
>- The size of a disc_only copies table fragment* cannot
>   exceed 2GB.
>
>For 64-bit Erlang, the restriction on disc_only_copies
>remains, but RAM tables can be much, much bigger.
>However, most data types will take up twice as much space
>in 64-bit as in 32-bit. Binaries are one notable exception.
>
>* Tables can be fragmented using mnesia_frag. If not fragmented,
>the 'table fragment' will be the entire table.
>
>BR,
>Ulf W
>--
>Ulf Wiger
>CTO, Erlang Training & Consulting Ltd
>http://www.erlang-consulting.com
>
>________________________________________________________________
>erlang-questions mailing list. See http://www.erlang.org/faq.html
>erlang-questions (at) erlang.org
>
>

= = = = = = = = = ========= = = = = = = = = = =
                       
wde
wde@...
24/10/2009