I want a regular expression to match a number surrounded by a single pair of parentheses, e.g., it would match something that looks like this:
(1)
But it should not match the (1)
inside this:
((1))
Originally I tried this:
([^\(])\(([0-9]+)\)([^\)])
But it failed to match singly-parenthesised numerals at the very beginning or very end of a string. So blah blah (1)
did not return a match, even though it very clearly contains (1)
. This is because the regular expression above looks for a character which isn't on opening or closing parentheses, when at the beginning or end of a string there is no character to find.
Then I tried this:
([^\(]?)\(([0-9]+)\)([^\)]?)
This successfully matched (1)
but also matched the (1)
inside ((1))
, because it just ignored the surrounding parentheses in the regular expression. So this one was too broad for my needs.
I'll keep experimenting and will post a solution here if I find one, but any help would be much appreciated. Any ideas?
PLEASE NOTE: I am using JavaScript. Some regular expression features are not included in JavaScript.
UPDATE:
I did not explicitly note that capturing the number is important inside the parenthesis when matching is important. (I hope that won't adversely affect the solutions given so far below, apart from making them trickier to read!) However, the whole of (1)
should be replaced as a result, so matching both parentheses is important too.
All the thought-provoking responses led me to draw up a bunch of desired results for different situations. Hopefully, this makes it clearer what the aim of the expression should be.
(1)
==> match '(1)' and capture '1'
((1))
==> no match
(((1)))
==> no match
(1) (2)
==> match '(1)' and '(2)' and capture '1' and '2'
(1) ((2))
==> match '(1)' and capture '1'
((1) (2))
==> match '(1)' and '(2)' and capture '1' and '2'
(1)(2)
==> match '(1)' and '(2)' and capture '1' and '2' [ideally] OR no match
(1)((2))
==> match '(1)' and capture '1' [ideally] OR no match
((1)(2))
==> match '(1)' and '(2)' and capture '1' and '2' [ideally] OR no match
For these last three, I say 'ideally' because there is leniency. The first result is the preferred one but, if that isn't possible, I can live with there being no match at all. I realise this is something of a challenge (maybe even impossible, within JavaScript's RegExp limitations) but that's why I'm putting the question to this expert forum.
I want a regular expression to match a number surrounded by a single pair of parentheses, e.g., it would match something that looks like this:
(1)
But it should not match the (1)
inside this:
((1))
Originally I tried this:
([^\(])\(([0-9]+)\)([^\)])
But it failed to match singly-parenthesised numerals at the very beginning or very end of a string. So blah blah (1)
did not return a match, even though it very clearly contains (1)
. This is because the regular expression above looks for a character which isn't on opening or closing parentheses, when at the beginning or end of a string there is no character to find.
Then I tried this:
([^\(]?)\(([0-9]+)\)([^\)]?)
This successfully matched (1)
but also matched the (1)
inside ((1))
, because it just ignored the surrounding parentheses in the regular expression. So this one was too broad for my needs.
I'll keep experimenting and will post a solution here if I find one, but any help would be much appreciated. Any ideas?
PLEASE NOTE: I am using JavaScript. Some regular expression features are not included in JavaScript.
UPDATE:
I did not explicitly note that capturing the number is important inside the parenthesis when matching is important. (I hope that won't adversely affect the solutions given so far below, apart from making them trickier to read!) However, the whole of (1)
should be replaced as a result, so matching both parentheses is important too.
All the thought-provoking responses led me to draw up a bunch of desired results for different situations. Hopefully, this makes it clearer what the aim of the expression should be.
(1)
==> match '(1)' and capture '1'
((1))
==> no match
(((1)))
==> no match
(1) (2)
==> match '(1)' and '(2)' and capture '1' and '2'
(1) ((2))
==> match '(1)' and capture '1'
((1) (2))
==> match '(1)' and '(2)' and capture '1' and '2'
(1)(2)
==> match '(1)' and '(2)' and capture '1' and '2' [ideally] OR no match
(1)((2))
==> match '(1)' and capture '1' [ideally] OR no match
((1)(2))
==> match '(1)' and '(2)' and capture '1' and '2' [ideally] OR no match
For these last three, I say 'ideally' because there is leniency. The first result is the preferred one but, if that isn't possible, I can live with there being no match at all. I realise this is something of a challenge (maybe even impossible, within JavaScript's RegExp limitations) but that's why I'm putting the question to this expert forum.
catdogcat
conundrum. And I didn't ask originally about lookarounds; I acknowledged upfront this was a JavaScript question, as I knew lookbehinds don't feature in JS. But I accept that maybe this question is too similar to the other to keep mine open.
– guypursey
Commented
Jul 4, 2013 at 14:04
match
in JavaScript's specific sense when crafting the title.
– guypursey
Commented
Jul 4, 2013 at 14:10
This problem is likely impossible to solve in a robust fashion with regular expressions alone, because this is not a regular grammar: balancing parenthesis basically moves it up Chomsky's language plexity hierarchy. So to robustly solve this problem, you actually have to write a parser and create an expression tree. While this may sound daunting, it's really not that bad. Here's the plete solution:
// parse our little parentheses-based language; this will result in an expression
// object that contains the text of the expression, and any children (subexpressions)
// that represent balanced parentheses groups. because the expression objects contain
// start indexes for each balanced parentheses group, you can do fast substition in the
// original input string if desired
function parse(s) {
var expr = {text:s, children:[]}; // root expression; also stores current context
for( var i=0; i<s.length; i++ ) {
switch( s[i] ) {
case '(':
// start of a subexpression; create subexpression and change context
var subexpr = {parent: expr, start_idx: i, children:[]};
expr.children.push(subexpr);
expr = subexpr;
break;
case ')':
// end of a subexpression; fill out subexpression details and change context
if( !expr.parent ) throw new Error( 'Unmatched group!' );
expr.text = s.substr( expr.start_idx, i - expr.start_idx + 1 );
expr = expr.parent;
break;
}
}
return expr;
}
// a "valid tag" is (n) where the parent is not ((n));
function getValidTags(expr,tags) {
// at the beginning of recursion, tags may not be defined
if( tags===undefined ) tags = [];
// if the parent is ((n)), this is not a valid tags so we can just kill the recursion
if( expr.parent && expr.parent.text.match(/^\(\(\d+\)\)$/) ) return tags;
// since we've already handled the ((n)) case, all we have to do is see if this is an (n) tag
if( expr.text.match(/^\(\d+\)$/) ) tags.push( expr );
// recurse into children
expr.children.forEach(function(c){tags.concat(getValidTags(c,tags));});
return tags;
}
You can see this solution in action here: http://jsfiddle/SK5ee/3/
Without knowing your application, or all the details of what your'e trying to do, this solution may or may not be overkill for you. However, the advantages of it is that you can pretty much make your solution arbitrarily sophisticated. For example, you may want to be able to "escape" parentheses in your input, thereby taking them out of the normal parenthesis-balancing equation. Or you might want to ignore parenthesis inside of quotation marks or the like. With this solution, you simply have to extend the parser to cover these situations, and the solution can be made even more robust. If you stick with some clever regex-based solution, you might find yourself up against a wall if you need to extend your syntax to cover these types of enhancements.
If my understanding is correct, you want to get the numbers that are inside single parentheses but you want to exclude numbers inside double parenthesis. I'm going to further assume you just want an ordered list of those numbers. Based on that, this is what you're looking for:
a) "(1)(2)((3))" => [1,2]
b) " (5) ((7)) (8) " => [5,8]
What's not clear is what happens when parenthesis aren't balanced, or when there's more than just numbers inside the parenthesis. There's no support for balanced matching in JavaScript regular expressions, so the following cases will cause problems:
"((3) (2)" => [2] (probably we want [3,2]???)
"((3) (2) (4) (5))" => [2,4] (probably we want [3,2,4,5]???)
What's clear from those last two examples is that the whole thing hinges on determining whether there are one or two parenthesis before a number; not when the parentheses group is closed. If these examples need to be handled, you will have to construct a tree of parenthesis groups and go from there. That's a harder problem, which I'm not going to address here.
So, that leaves us with two problems: how do we handle matches that are butted up against one another ((1)(2)
) and how do we handle matches that start at the beginning of a string ((1)blah blah
)?
We'll ignore the second problem for now to focus on the harder of the two.
Obviously, if we don't care if the parenthesis is closed, we can get what we want this way:
" (1)(2)((3)) ".match(/[^(]\(\d+/g) => [" (1", ")(2"]
So far so good, but this could yield results we don't want:
" (1: a thing (2)(3)((4)) ".match(/[^(]\(\d+/g) => [" (1)", " (2", ")(3"]
So we clearly want to check for the closing parenthesis, which works for this:
" (1) (2) ((3)) ".match(/[^(]\(\d+\)/g) => [" (1)", " (2)"]
But fails when matches are butted up against one another:
" (1)(2)((3)) ".match(/[^(]\(\d+\)/g) => [" (1)"]
What we need, then, is to match that closing parenthesis, but don't consume it. That's the whole idea behind "lookahead" matches (sometimes called "zero-width assertions"). The idea is you make sure it's there, but you don't include it as part of the match, so it doesn't prevent the character from being included in future matches. In JavaScript, lookahead matches are specified with the (?=subexpression)
syntax:
" (1)(2)((3)) ".match(/[^(]\(\d+(?=\))/g) => [" (1", ")(2"]
Okay, so that solves that problem! On to the easier problem of what to do about matches that occur at the beginning/end of the string. Really, all we have to do is use alternation to say "match something that's not an opening parenthesis OR the beginning of the string", etc.:
"(1)(2)((3))".match(/(^|[^(])\(\d+(?=\))/g) => ["(1", ")(2"]
Another, "sneakier" way to do is just pad your input string to avoid the problem altogether:
s = "(1)(2)((3))"; // our original input
(" " + s + " ").match(/[^(]\(\d+(?=\))/g) => ["(1", ")(2"]
That way we don't have to fuss with alternation.
Okay, this has been a crazy long answer, but I'm going to wrap it up with how to clean up our output. Clearly, we don't want those strings with all the extra match garbage we don't want: we just want the numbers. There are lots of ways to acplish this, but here are my favorites:
// if your JavaScript implementation supports Array.prototype.map():
" (1)(2)((3)) ".match( /[^(]\(\d+(?=\))/g )
.map(function(m){return m.match(/\d+/)[0];})
// and if not:
var matches = " (1)(2)((3)) ".match( /[^(]\(\d+(?=\))/g );
for( var i=0; i<matches.length; i++ )
{ matches[i] = matches[i].match(/\d+/)[0]; }
After the OP updated the question with some input samples and expected output, I was able to craft some regexes to satisfy all of the sample input. Like so many regex solutions, the answer is often multiple regexes, instead of a single giant one.
NOTE: while this solution works for all the OP's sample inputs, there are all kinds of cases in which it will fail. See below for a plete, waterproof solution.
Basically this solution involves first matching for things that (sortof) look like parentheses groups:
/\(+.+?\)+/g
Once you get all of those, you check to see if they're either invalid tags (((n))
, (((n)))
, etc.), or good ones:
if( s.match(/\(\(\d+\)\)/) ) return null;
return s.match(/\(\d+\)/);
You can see this solution working for all the OP's sample input here:
http://jsfiddle/Cb5aG/
Answer to your edit
So you want to replace! That means you problem is practically equivalent to this one. That also makes things a lot easier. What we do is:
((number))
and ignore it(number)
and replace itThe first option will automatically be given precedence (because it start further to the left, if both apply), so that option will swallow up all unwanted occurrences:
"input".replace(/([(][(]\d+[)][)])|[(]\d+[)]/g, function(match, $1) {
if ($1)
return $1;
else
return do_whatever_you_want_with(match);
});
So we have two cases: match ((number))
and capture into group 1
- or match (number)
and let group 1
be undefined
.
The replacement is done via a callback, which takes the entire match
as the first argument and the first capture group as the second (here $1
). Then we check whether $1
was used - if so, we simply return it, hence replacing nothing. If not, we can do whatever we want with match
(which will be (number)
). Of course, you can also capture the number
only into another variable $2
and use that if it's more convenient.
Original answer, regarding matching along:
What you would need are lookarounds, but JavaScript does not support lookbehinds. I've explained some more elaborate workarounds here. But since your lookbehind is only for a single character, checking for the beginning of the string or a different character is sufficient. This leads to
/(?:^|[^(])[(](\d+)[)](?:[^)]|$)/
There is another problem though: matches cannot overlap! In (1)(2)
, the engine matches (1)(
(because the [^)]
includes a character in the match). Hence, (2)
cannot be matched, because that would overlap with the previous match.
So we remove it from the first match, by putting everything after the digit into a lookahead:
/(?:^|[^(])[(](\d+)(?=[)](?:[^)]|$))/
Note however, that this solution rules out digits that have only one double parenthesis around them, too: for instance neither ((1) abc)
nor (abc (2))
nor ((1) (2))
would yield a match. If this is not what you are looking for, you need to put the two cases (preceding and leading parenthesis) in an alternation. To make this easier, it helps to pull the lookahead in front of the digits:
/(?:^|[^(]|(?=[(]\d+[)](?:[^)]|$)))[(](\d+)/
Confusing, I know. But JavaScript's regex flavor is very limited, after all.
Here it is with a negative look ahead followed up with a negative lookahead:
\((?!\()(\d+)\)(?!\))
Edit live on Debuggex
is this what you want?
"(1)(2)((3))".match(/(\({1}\d+\){1})/g) // === ["(1)", "(2)", "(3)"]
looks like what you want, and seems a bit simpler than other methods, but maybe i'm missing something...
EDIT: missed a req, thought it was too easy...
well, there a limitation in the js regexp that will make this a bear to code, so i would do something slightly different that gets the desired results:
"(1)(2)((3))".match(/(\({1,}\d+\){1,})/g)
.filter(/./.test, /^\(\d\)$/) // == ["(1)", "(2)"]