revision comment too long in pages-articles.xml

View: New views
6 Messages — Rating Filter:   Alert me  

revision comment too long in pages-articles.xml

by Bugzilla from jcsahnwaldt@gmail.com :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I tried using mwdumper (latest SVN revision 57818)
to import jawiki-20090927-pages-articles.xml [1]
into MySQL, but I got an error:

Data too long for column 'rev_comment'

The problem is that the xml file contains a revision
comment that is 257 bytes long, but the column
accepts at most 255 bytes.

First I was stumped as to how this could happen,
but then I found that on the Wikipedia page, the
comment ends with the byte 'e3', while in the
xml file it ends with 'ef bf bd'. See [2] for details.

I think the cause is something like this:

- Comments are truncated to 255 bytes when they
are stored.

- In this case, this means that a three-byte UTF-8
sequence is cut off after its first byte (hex value e3),
so the comment ends with an invalid one-byte UTF-8
sequence.

- The dump process has to generate valid UTF-8
(otherwise, most XML parsers wouldn't accept
the file), so it replaces the invalid one-byte UTF-8
sequence by the 'replacement character' U+FFFD,
which has the three-byte UTF-8 sequence 'ef bf bd'.
See [3].

- In this case, the comment grows from 255 bytes
to 257 bytes.

How to fix this? I think MediaWiki should make sure
that a comment contains only valid UTF-8 sequences,
even when it is truncated. This may mean that it
has to be truncated to less than 255 bytes.

Alternatively, the dump process could drop invalid
UTF-8 sequences instead of replacing them.

Yet another fix: mwdumper should make sure
that a comment is at most 255 bytes long and
truncate it if necessary.

More details can be found at [2].

Bye,
Christopher

[1] http://download.wikimedia.org/jawiki/20090927/jawiki-20090927-pages-articles.xml.bz2
[2] http://en.wikipedia.org/wiki/User:Chrisahn/CommentTooLong
[3] http://www.utf8-chartable.de/unicode-utf8-table.pl?start=65520

_______________________________________________
Wikitech-l mailing list
Wikitech-l@...
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: revision comment too long in pages-articles.xml

by Aryeh Gregor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Oct 16, 2009 at 3:25 PM, Jona Christopher Sahnwaldt
<jcsahnwaldt@...> wrote:

> How to fix this? I think MediaWiki should make sure
> that a comment contains only valid UTF-8 sequences,
> even when it is truncated. This may mean that it
> has to be truncated to less than 255 bytes.
>
> Alternatively, the dump process could drop invalid
> UTF-8 sequences instead of replacing them.
>
> Yet another fix: mwdumper should make sure
> that a comment is at most 255 bytes long and
> truncate it if necessary.

The silent truncation thing is ridiculous on a lot of levels:

* It creates invalid UTF-8, if MySQL isn't using a utf8 collation
(which has its own problems, and Wikimedia doesn't use it).
* How many characters can be stored depends on how many bytes the
relevant language's writing system happens to be in UTF-8, so
Chinese/Arabic/Hebrew/Greek/Russian/etc. users get <150 characters.
(Unless you use the utf8 collation, which has its own problems, and
Wikimedia doesn't use it.)
* It will cause a fatal error if MySQL is in strict mode.
* It makes it difficult to impossible for a specific wiki to decide to
allow longer edit summaries.  (Personally, I find 255 characters is
often too short.  I think Citizendium is hacked to allow more.)
* The limit in MySQL is counted in bytes (unless you use the utf8
collation, etc.), but HTML maxlength is counted in characters, so we
have no way to effectively limit things client-side without
JavaScript.  Currently we fake it by setting a maxlength of 200
characters, and hoping that that winds up being less than 255 bytes.
That leaves enough breathing room so languages like French don't
overrun, but I assume speakers of
Chinese/Arabic/Hebrew/Greek/Russian/etc. languages are just resigned
to the fact that their edit summaries get unpredictably truncated.
Also, it's unnecessarily small for English, where 255 characters would
usually fit -- in fact enwiki has a Gadget that hacks this up, and
I've sometimes edited up the maxlength manually when I found I wanted
a little more space.

The correct fix is just to make the field TEXT/BLOB so the length
limit is enforced purely in the application.  The same goes for
log_comment, where last I checked we weren't even doing the
maxlength=200 hack.  Are there any objections to finally doing this?
What's the procedure these days for schema changes?  I'd check in
something right now, in fact, except that I have to go in like ten
minutes.

(erm, end rant)

_______________________________________________
Wikitech-l mailing list
Wikitech-l@...
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: revision comment too long in pages-articles.xml

by Bugzilla from jcsahnwaldt@gmail.com :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Oct 16, 2009 at 21:25, Jona Christopher Sahnwaldt
<jcsahnwaldt@...> wrote:
> Yet another fix: mwdumper should make sure
> that a comment is at most 255 bytes long and
> truncate it if necessary.

I implemented this fix / hack and checked it in at

http://dbpedia.svn.sourceforge.net/viewvc/dbpedia?view=rev&revision=1771

Seems to fix that problem for me. Feel free to
copy that code back to mediawiki if you want.

Thanks for mwdumper, by the way! Nice little
piece of code. Quite clean and modular.


Christopher

_______________________________________________
Wikitech-l mailing list
Wikitech-l@...
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: revision comment too long in pages-articles.xml

by mike.lifeguard :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aryeh Gregor wrote:
> The correct fix is just to make the field TEXT/BLOB so the length
> limit is enforced purely in the application.  The same goes for
> log_comment, where last I checked we weren't even doing the
> maxlength=200 hack.  Are there any objections to finally doing this?
> What's the procedure these days for schema changes?  I'd check in
> something right now, in fact, except that I have to go in like ten
> minutes.

There's long been a desire to lengthen the edit summary[1] (and log
reasons too) - why not do that while the patient is open on the
operating table?

- -Mike

[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=4714
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkrY/QIACgkQst0AR/DaKHs0MwCgqjqNB2tUrDsUvSLiXvhDGs/5
hlcAoK1kXARimAKTVXlBSFxWJeQ9Gn6i
=/Kiz
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
Wikitech-l@...
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: revision comment too long in pages-articles.xml

by Bugzilla from jcsahnwaldt@gmail.com :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Oops, I just realized that this is a known bug:

https://bugzilla.wikimedia.org/show_bug.cgi?id=13721

Should have checked there first.

Anyway, allowing longer comments sounds like a good idea.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@...
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: revision comment too long in pages-articles.xml

by Aryeh Gregor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Oct 16, 2009 at 7:08 PM, Mike.lifeguard
<mike.lifeguard@...> wrote:
> There's long been a desire to lengthen the edit summary[1] (and log
> reasons too) - why not do that while the patient is open on the
> operating table?

Yes, of course.  That would be trivial with the schema update in
place, and impossible without.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@...
https://lists.wikimedia.org/mailman/listinfo/wikitech-l