« Return to Thread: CapitalizationStandardEnglish

Re: CapitalizationStandardEnglish

by Pavan Chander :: Rate this Message:

Reply to Author | View in Thread

I agree that having GC pop a warning on to the screen for each possible candidate isn't the right way to go, it would just become extremely annoying and would eventually be dismissed without it being paid the proper attention. I have two possible ideas that could be implemented instead.

1) We have artist and release level data quality, maybe something similar could be used to deal with track(/any entity) capitalization?

2) This may be hackish, (and add to server/ws load?), but what if GC checked the folksonomy tags that were attached to whatever entity was being edited? Certain tags could be treated as "error codes"; so GC would check for any tag with a "gc_" prefix, and then lookup an internal(/external) list of tags and their definitions. 

For example the track "Come On Eileen" might be assigned the tag "gc_on" or "gc_prep", and only after noticing that the track being edited has that tag would GC display a warning explaining why they shouldn't be changing the capitalization etc.

This would prevent the "boy who cried wolf" problem, not add anything complicated to the UI, and I'm hoping that because it's not an obvious/noticeable editing mechanism, it wouldn't be abused by newish editors. 

For a short term fix, the release data quality could be raised to high. I believe capitalization changes don't get pushed through as autoedits in that case, and by putting them to a vote (requiring 4 edits this time FWIW) the edits/editors would have a chance of being caught in the act and then being canceled, instead of being reverted a month down the road.



Pavan Chander // navap


On Tue, Apr 21, 2009 at 12:39 AM, Brian Schweitzer <brian.brianschweitzer@...> wrote:
These are very problematic. 

The old Guess Case basically looked at these, shrugged, and made them all lower cased. 

The new Guess Case tries a little harder; it tries to identify adjectives that would otherwise be prepositions by looking for what I call in the code "sentence/fragment ending punctuation", namely ,.:;/ and so on, as well as end of line.  It's pretty unlikely to be a preposition if it matches "word\s[,\.:;$]" - ie, for "Come On", on is not being used as a preposition. 

The "Come On Eileen" and other such cases are much more difficult; the only real way I could come up with to identify these would need a word list of every single possible partner words for those prepositions phrases, to differentiate "Come on Down Eileen" from "Come On Eileen".  Now, I think creating such a list, or using it if we actually managed to create it, is pretty much an impossibly large task.  If anyone can think of other rules to try to match some more of these, great...  but I don't think GC can do much more to try and be intelligent without very large dictionary lists, which would also slow the code down a bit.  As for noting them, that would essentially require a heads up type of notice for any time any of the words appeared - and the list of words is long enough, and they're common enough, that such heads up notices would be generated so frequently, I would fear users would begin to ignore them entirely.  (The boy who cried wolf type of problem...)

Brian


On Mon, Apr 20, 2009 at 2:56 PM, Christopher Key <cjk32@...> wrote:
Chris B wrote:
> my personal favourite is "Come On Eileen" vs "Come on Eileen" - two
> rather different meanings :) i've just done a search on that and
> lo-and-behold there's plenty of the latter. i suppose the best
> solution would be to politely warn the editor who added/changed that
> release by adding a note to their edit, and then change it back.
> adding a track annotation about it would probably help, as well.
>
Thanks, that certainly does illustrate the problem well, as well as
pointing out the importance of correct stress in spoken English!

Having a dig through the database, there are quite a few that were
incorrectly capitalised until recently.  There are also multiple
references to [1], which points out that the same problem exists for
similar titles.  It's a difficult problem, and I guess that the best
Guess Case could do would be to warn users when dealing with potentially
ambiguous titles.

For original track, I'll revert back to 'The In Set' (I'm assuming that
this is correct), and add a note to the annotation.  That way, if it
gets 'corrected' again, it may also stand some chance of being reverted
back.

Regards,

Chris

_______________________________________________
MusicBrainz-users mailing list
MusicBrainz-users@...
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-users


_______________________________________________
MusicBrainz-users mailing list
MusicBrainz-users@...
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-users


_______________________________________________
MusicBrainz-users mailing list
MusicBrainz-users@...
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-users

 « Return to Thread: CapitalizationStandardEnglish