« Return to Thread: [jira] Created: (LANG-505) Rewrite StringEscapeUtils

[jira] Commented: (LANG-505) Rewrite StringEscapeUtils

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View in Thread


    [ https://issues.apache.org/jira/browse/LANG-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716850#action_12716850 ]

Henri Yandell commented on LANG-505:
------------------------------------

Worked on this some more. The notion of passing in CharSequences works well, with a codepoint version to ease coding (and currently a char version but it may be pointless as codepoints feel like chars). I ignored the idea of having a query as to whether an escape will happen and then a second invocation to do the escape. It's not needed as yet.

escapeJava looks like:

{code:java}
    public static void escapeJava(String input, Writer out) throws IOException {
        AggregateEscaper escapers = new AggregateEscaper(
            new LookupTranslator(
                      new String[][] {
                            {"\"", "\\\""},
                            {"\\", "\\\\"}
                      }),
            new EscapeLowAsciiAsUnicode(),
            new EscapeNonAsciiAsUnicode()
        );
         
        escape(input, escapers, out);
    }
{code}

ie) despite the API change, much the same as the above. I went ahead and stopped the Escaper extending FilterWriter so that's why the out variable stops getting repeated everywhere.

The core algorithm itself now looks like:

{code:java}
    public static void escape(CharSequence input, CharSequenceEscaper escaper, Writer out) throws IOException {
        if (out == null) {
            throw new IllegalArgumentException("The Writer must not be null");
        }
        if (input == null) {
            return;
        }
        if (escaper == null) {
            throw new IllegalArgumentException("The CharSequenceEscaper must not be null. " +
                                               "Use NullEscaper if you expected this to mean a no-operation");
        }
        int sz = Character.codePointCount(input, 0, input.length());
        for (int i = 0; i < sz; i++) {

            // consumed is the number of codepoints consumed
            int consumed = escaper.escape(input, i, out);

            if(consumed == 0) {
                out.write( Character.toChars( Character.codePointAt(input, i) ) );
            } else {
                // contract with escapers is that they have to understand codepoints and they just took care of a surrogate pair
                for(int j=0; j<consumed; j++) {
                    if(i < sz - 2) {
                        // for loop will increment 1 anyway, so remove 1 from the end charCount
                        i += Character.charCount( Character.codePointAt(input, i) ) - 1;
                    }
                }
            }
        }
    }
{code}

This should be able to implement unescape as well as it is now a simple general purpose text translation API that can handle escaping n->m changes in characters. ie) a plugin can choose to consume 3 characters and turn them into 5, and vice versa.

> Rewrite StringEscapeUtils
> -------------------------
>
>                 Key: LANG-505
>                 URL: https://issues.apache.org/jira/browse/LANG-505
>             Project: Commons Lang
>          Issue Type: Task
>            Reporter: Henri Yandell
>             Fix For: 3.0
>
>
> I think StringEscapeUtils needs a strong rewrite. For each escape method (and unescape) there tend to be three or four types of escaping happening. So not being able to define which set of three or four apply is a pain point (and cause of bug reports due to different desired features).
> We should be offering basic functionality, but also allowing people to say "escape(Escapers.BASIC_XML, Escapers.LOW_UNICODE, Escapers.HIGH_UNICODE)".
> Also should delete escapeSql; it's a bad one imo. Dangerous in that it will lead people to not use PreparedStatement and given it only escapes ', it's not much use. Especially as different dialects escape that in different ways.
> Opening this ticket for discussion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

 « Return to Thread: [jira] Created: (LANG-505) Rewrite StringEscapeUtils