What if an XML file cross boundary of HDFS chunks?

View: New views
3 Messages — Rating Filter:   Alert me  

What if an XML file cross boundary of HDFS chunks?

by meili100 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Does anybody have the similar issue? If you store XML files in HDFS, how can you make sure a chunk reads by a mapper does not contain partial data of an XML segment?

For example:

<title>
<book>book1</book>
<author>me</author>
..............what if this is the boundary of a chunk?...................
<year>2009</year>
<book>book2</book>

<author>me</author>

<year>2009</year>
<book>book3</book>

<author>me</author>

<year>2009</year>
<title>



     



Re: What if an XML file cross boundary of HDFS chunks?

by Jeff Zhang-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Steve,

When you want to read xml, you should provide your custom InputFormat which
extends FileInputFormat.

and override the method isSplitable to not split a file , that means one xml
file for one mapper.


  protected boolean isSplitable(FileSystem fs, Path filename) {
    return false;
  }



Best Regards,

Jeff zhang



On Thu, Oct 29, 2009 at 12:32 PM, Steve Gao <steve.gao@...> wrote:

>
> Does anybody have the similar issue? If you store XML files in HDFS, how
> can you make sure a chunk reads by a mapper does not contain partial data of
> an XML segment?
>
> For example:
>
> <title>
> <book>book1</book>
> <author>me</author>
> ..............what if this is the boundary of a chunk?...................
> <year>2009</year>
> <book>book2</book>
>
> <author>me</author>
>
> <year>2009</year>
> <book>book3</book>
>
> <author>me</author>
>
> <year>2009</year>
> <title>
>
>
>
>
>
>
>

Re: What if an XML file cross boundary of HDFS chunks?

by Oliver Fischer-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Jeff,

does it means, that there is no programmatic possibility to define where
a logical file will be splitted independent of the distribution of it
blocks in the HDFS?

Regards

Oliver

Jeff Zhang schrieb:

> Hi Steve,
>
> When you want to read xml, you should provide your custom InputFormat which
> extends FileInputFormat.
>
> and override the method isSplitable to not split a file , that means one xml
> file for one mapper.
>
>
>   protected boolean isSplitable(FileSystem fs, Path filename) {
>     return false;
>   }


- --
Oliver B. Fischer, Schönhauser Allee 64, 10437 Berlin
Tel. +49 30 44793251, Mobil: +49 178 7903538
Mail: o.b.fischer@... Blog: http://www.swe-blog.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJK7EwBAAoJELeiwuwqd1DGO/wIAJl8wwf6uNgm/ZwsGh8M1xvz
wSEH9sD2cfjUSV3rmpHndKEfSTEOeHvvaJmJn24K9HhB9w8QyDogAgHawCdBY2TE
K27n4wqSGlbLpQz4XmKUOVtFSooeEPUT58Jn2aMAno+nrWHM7oq9tuCJAAYkBexV
wCrc7eE+o55TlAlx+LDWWS9mJrdTNBYqzoHh0gnWsEGm98CWvzn08tNA/L2moJbQ
HZwnWzfgEBKBwAZUOYLFt2GigIYN3GE0pMp33BgjWi91zPwGSk7Bcq7XhObLK7o/
uYxS+s3BTkLy+R6ngjOW1NLvg6STX37FpFNZowDmPt8Bzd8GxAefnqcxkVcnb90=
=t6vV
-----END PGP SIGNATURE-----