JavaMiscellaneous: java.util.regex


java.util.regex

General remarks

String sells its soul to the regex package (although it is far from the most general solution) by having methods that implicitly use it (without mentioning it in the method names, and without any advantage)

PatternSyntaxException unchecked? (cf. ParseException, MalformedURLException, URISyntaxException)

Pattern.split - unnessary and confusing role of limit, why not an extra boolean "removeTrailing", also no way just to iterate (instead of having all the parts at once). Why can the Pattern split (and internally use a Matcher), but only the Matcher replace? (The replace methods are also unrelated to the matcher and destroy its state).

Why is the matcher obtained from the pattern (and not constructed directly)?

Who matches (what)?:

Matcher formalized

(Eiffelesque syntax)

public class Matcher -- silly name?
{
-- silly return values (itself, StringBuffer argument) not shown

     pattern: Pattern


     reset alias reset(input)

     reset(newInput: CharSequence)
         ensures !hasMatched, appendPosition == 0
         ensures input == newInput

--  groups

     start
         requires hasMatched
     end
         requires hasMatched

     group: String -- no better name? 'matched'
         requires hasMatched
         ensures result.equals(input.subSequence(start, end).toString)

     groupCount: Int -- property of the pattern alone

     start(group: Int)
         requires hasMatched
         requires 0 <= group <= groupCount

     end(group: Int)
         requires hasMatched
         requires 0 <= group <= groupCount

     group(group: Int): String/null
         requires hasMatched
         requires 0 <= group <= groupCount
         ensures start(i) == -1 => result == null
                 start(i) != -1 => result.equals(input.subSequence(start(group), end(group)).toString)




-- command-query-separation-violation match methods:

     -- the two 'find' methods are no more similar than the all four to each other.
     -- find(end) is not equivalent to 'find' (it resets appendPosition) 
     --
     -- Are there differences between
     --    find(0) && start == 0              and    lookingAt?   -- no!
     --    lookingAt && end == input.length   and    matches?   
     -- (assuming the left also unsets hasMatched) -- yes, the former may a different match
     -- that doesn't match the whole input
     -- so 'matches' adds to the expressibility, 'lookingAt' not (only optimizes because only
     -- matches at the beginning have to be considered.)
     -- Of course, one can also reach either by recompiling the pattern with an extra
     -- "^(?:" at the beginning and ")" or ")$" at the end.
     -- 

     matches: Boolean
         ensures result == hasMatched, result => start == 0 && end == input.length
         -- implementation ensures appendPosition == 0 (one would hardly use it)

     lookingAt: Boolean -- method name even worse than the others
         ensures result == hasMatched, result => start == 0 
         -- implementation ensures appendPosition == 0

     find: Boolean
         ensures result == hasMatched
         ensures old hasMatched => start >= old end -- more (not match the empty string twice)
         ensures appendPosition == old appendPostion -- implicitly

     find(from: Int): Boolean
         requires: 0 <= from <= input.length 
         ensures result == hasMatched, result => start >= from
         ensures appendPosition == 0 -- why not 'from'?


-- replacement by appending

     appendReplacement(into: StringBuffer, with: String)
         requires hasMatched
         requires validReplacement(with)
         -- ensures first appended characters are input.subSequence(appendPosition, start).toString
         ensures appendPosition == end

     appendTail(into: StringBuffer)
      -- does not update appendPosition?

-- there should be a feature to set the append position (at least to the end of the current match)

-- state-destructing (CQS-violating) replacement features (simple application of appendReplacement, appendTail)

     replaceFirst(with: String): String
         requires validReplacement(with)
         ensures nothing

     replaceAll(with: String): String
         requires validReplacement(with)
         ensures nothing
     


     invariant
         groupCount >= 0
         0 <= appendPosition < input.length
         hasMatched =>
             input.length >= end >= start >= 0
             start == start(0), end == end(0)
             -- proper ranges:
             forall 0 < i <= groupCount:
                 end(i) >= start(i)
                 input.length >= end(i) >= start(i) >= 0 || (start(i) == -1 && end(i) == -1)
             -- properly nested:
             forall 0 < i < groupCount:
                 start(i + 1) != -1 && start(i) != -1 => -- because of this, incomplete
                     start(i + 1) >= start(i) -- each group is started by a bracket, which follow other
                     start(i + 1) < end(i) => end(i + 1) <= end(i) -- if started before the end, must also close before it

     sneaky
         input: CharSequence  -- no way to get the input back?
         hasMatched: Boolean -- by lack of CQS, introduces sneaky preconditions
         appendPosition: Int -- useful because it occurs in postconditions and changes the behaviour
         validReplacement(s: String): Boolean
}

Missing features

Change of pattern (while keeping appendPosition/groups/end) (#4526572)

All matches (#4888230), also, why can't the Matcher also split (iteratively)?

Pattern.pattern should be named 'expression' or ' regex' (#4633112) - it's uncritical to change it!

Pattern.compile should have a functional name ("compiled", if nothing better)

Methods that take a String should take a CharSequence, not a String


(C) 2001-2009 Christian Kaufhold (java@chka.de)