[H-GEN] REGEX pattern required

Wed Dec 8 00:05:35 EST 2004

On Wed, 2004-12-08 at 13:29 +1000, Andrae Muys wrote:
> [ Humbug *General* list - semi-serious discussions about Humbug and     ]
> [ Unix-related topics. Posts from non-subscribed addresses will vanish. ]
> 
> Gary Curtis wrote:
> > Whart I want is a pattern to extract the RIGHT-MOST set
> > of numeric digits in the following test strings. I plan to use
> > the PHP function reg_split (or preg_splt), and hope to get
> > a three element array.   array[1] will contain all text to the left
> > of the digits (perhaps null),  array[2] will contain the digits, and
> > array[3] will contain all text to the right of the digits (or null).
> > You may assume there is always at least one digit somewhere
> > in the string and NO spaces.
> > Examples:
> > 123                      null  123  null
> > 123abc                null  123  abc
> > jk234                    jk  234  null
> > abc34fgh             abc   34   fgh
> > a9bc456              a9bc   456   null
> > K+v5678aa=x       K+v    5678   aa=x
> > aa00001-123b     aa00001-   123   b
>  
> No memory required, so yes a regex can do this.  (If one part of the 
> pattern is defined in terms of another part of the patter, you can't use 
> a regex.  The cannonical example is matched parenthesis --- the number 
> of closing parens is defined wrt. the number of preceeding opening parens.)
> 
> What you want is (all non-numeric text)(numeric text)(text)
> 
> This becomes ([^0-9]*)([0-9]*)(.*)  escape and substitute appropriate 
> character class labels to suit.

This is an interesting exchange for several reasons. Firstly, we see
that an experienced REGEX user has gotten the answer wrong. This might
imply a misreading of the requirements, or an eagerness to get the first
post in... but still reads like a red mark against regexes. People do
find them hard to read and write. Is there a solution to this problem?
Probably not in the general case. Simplifications of the regex model for
filename matching and the like do seem to solve most problems users
face, but every so often you do need to pull out the big guns.

On other other side of things we can look at how we can fix this
problem. Several implementation-specific solutions have already been
suggested. A general one would be:
1) reverse(str)
2) run regex already described
3) reverse again! :)

This shows that our current regexes have a left-to-right bias. Perhaps
the bias could instead be encoded into the regex expression:
(.*)([0-9]*)([^0-9]*)/right-bias

This could essentially reverse the string, reverse the regex, then run
using the existing code paths before normalising the string direction.

Another approach is to try and stem the greed with more and more
complicated regular expressions:
sed 's=^$[0-9][0-9]*$$[^0-9]*$$=\1 \2='
# ^ matches number-at-start
# (at least one number)(non-numbers to the end)
sed 's=^$.*[^0-9]$$[0-9][0-9]*$$[^0-9]*$$=\1 \2 \3='
# ^ matches general case
# (anything leading up to a non-number)(at least one number)(non-numbers
to the end)
sed 's=^$.*[^0-9]$\?$[0-9][0-9]*$$[^0-9]*$$=\1 \2 \3='
# ^ matches every case (I believe... the ? is a bit of an extension, of
course, and your backslashes may vary... :)
# (optionally, anything leading up to a non-number)(at least one
number)(non-numbers to the end)

It's a pity you can't adjust the left-right bias, isn't it :)

-- 
Benjamin Carlyle <benjamincarlyle at optusnet.com.au>