I have some identifiers that will appear at the end of some file names and can vary in length. It will only be 8 or 12 characters long separated by some delimiter. It would be invalid if it were any other length.
I would like to keep the pattern as simple as possible but I don't think there's a mechanism (in standard regular expression syntax) to do multiple lengths without repeating myself.
This will not work for me since it allows lengths of 9-11 which are invalid:
-[A-Za-z0-9]{8,12}$
I could do this but I don't like that I have to repeat the character groups:
-(?:[A-Za-z0-9]{8}|[A-Za-z0-9]{12})$
It gets a little unruly when there are more lengths I need to support:
-(?:[A-Za-z0-9]{8}|[A-Za-z0-9]{12}|[A-Za-z0-9]{16}|[A-Za-z0-9]{20}|[A-Za-z0-9]{24}|[A-Za-z0-9]{28}|[A-Za-z0-9]{32})$
Are there any other more concise ways to do this or is this the best I can do?
I will accept anything that works for my case, but would be great if there was an option that would work for any arbitrary lengths.
I have some identifiers that will appear at the end of some file names and can vary in length. It will only be 8 or 12 characters long separated by some delimiter. It would be invalid if it were any other length.
I would like to keep the pattern as simple as possible but I don't think there's a mechanism (in standard regular expression syntax) to do multiple lengths without repeating myself.
This will not work for me since it allows lengths of 9-11 which are invalid:
-[A-Za-z0-9]{8,12}$
I could do this but I don't like that I have to repeat the character groups:
-(?:[A-Za-z0-9]{8}|[A-Za-z0-9]{12})$
It gets a little unruly when there are more lengths I need to support:
-(?:[A-Za-z0-9]{8}|[A-Za-z0-9]{12}|[A-Za-z0-9]{16}|[A-Za-z0-9]{20}|[A-Za-z0-9]{24}|[A-Za-z0-9]{28}|[A-Za-z0-9]{32})$
Are there any other more concise ways to do this or is this the best I can do?
I will accept anything that works for my case, but would be great if there was an option that would work for any arbitrary lengths.
My idea is similar to that of blhsing in that I would suggest checking for the length up front. However, I would suggest a positive definition of possible length. Just for illustration I use length 8,12,14 to not only have multiples of 4.
My regex attempt would be:
-(?=(?:.{8}|.{12}|.{14})$)[A-Za-z0-9]+$
See a demo on regex101. Input was taken from Hao Wus demo.
Explanation:
-
: Anchor pattern to literal -
.(?=(?: ... )$)
: Look ahead and check for different configurations of string length between -
and end of line.
.{8}|.{12}|.{14}
: In this case 8,12,14.[A-Za-z0-9]+$
: Finally assert your strings composition until end of line.The reason I bothered to add an additional answer is, that in a programming language like Python you would now be able to generate the pattern based on a list of possible length like so:
import re
strings=[
"some file name-ASDFghjk",
"some file name-ASDFghjk12",
"some file name-ASDFghjk1234",
"some file name-ASDFghjk123456",
"some file name-ASDFghjk12345678"
]
allowed_len=[8,12,14]
# Concatinate the possible lenght to ".{a}|.{b}|.....".
joined_len="|".join(".{"+str(n)+"}" for n in allowed_len)
# Use the concatination in the regex pattern to "outsource" this step.
# The ramaining pattern can easily be maintained here now.
pat=repile(rf"-(?=(?:{joined_len})$)[A-Za-z0-9]+$")
# Validate output.
[re.search(pat,s) for s in strings]
In general you can avoid spelling out the same character set multiple times by including the full range of repetition numbers with the quantifier {8,12}
but excluding the invalid range of {9,11}
with a negative lookahead pattern like this:
-(?!.{9,11}$)[A-Za-z0-9]{8,12}$
Obviously if you have multiple valid repetition numbers you'll have to exclude the multiple invalid ranges in between with multiple negative lookahead patterns, but at least you still get to avoid having to repeatedly spelling out the same character set.
@HaoWu's suggestion of using a subroutine would otherwise be the best option if your regex engine supports it.
Thought I'd add an answer to complement the working answers you currently got. PCRE(2) does support a container (not sure what to name it otherwise) called (?DEFINE)
to pre-define patterns that can be re-used throughout the rest of your regular expression. This way you create a somewhat modular pattern. In your case it may be over-engineering a solution but I thought I'd chuck in the option:
(?(DEFINE)(?<PW>[a-zA-Z0-9]))^.*-(?:(?&PW){8}|(?&PW){12})$
See an online demo
(?(DEFINE)(?<PW>[a-zA-Z0-9]))
- The construction at the start of the pattern that literally holds the named sub-pattern for later usage. I have called it 'PW' for now;^.*-(?:(?&PW){8}|(?&PW){12})$
- Rather self-explanatory. You can identify the use of the herefor identified sub-pattern named 'PW'.Why use this? When a pattern becomes long and tedious, this is a nice way to improve readability and maintainability. Btw, the DEFINE
construct can hold multiple subroutines like so: (?(DEFINE))(?<x>123)(?<y>456))
; could be handy :)
What about ^(([a-z0-9]{4}){2,8})$
since you show in the last example having to support some different multiples of 4, 8 to 32. I used Notepad++ to check my results, hence the other changes in the expression.
Obviously it only works when there is the situation as you explained, all multiples of 4 in the range of 8 to 32.
It seems you already have the correct regex; could probably short it:
-(?:[A-Za-z0-9]{4}){2,8}$
Can't think of anything else.
Details are in this link.
some file name-ASDFghjk1234
(extension omitted) Renaming them with PowerRename (part of power toys) and have boost extensions available, though I'm not sure if that is really relevant here. – Jeff Mercado Commented Feb 21 at 0:50\1
instead of?1
but it didn't seem to work and I was unaware of that option. It definitely works for the engine I'm using it in so I'd consider putting that up as an answer. – Jeff Mercado Commented Feb 21 at 6:15(?R)
) is a PCRE-like regex feature but I checked the link you have attached and it claims it's using ECMAScript regex engine which should not be supported, so I didn't add it as an answer. Also,(?1)
works but\1
does not is because\1
is a back-reference (captured substring) but(?1)
is a subroutine (the pattern itself). – Hao Wu Commented Feb 21 at 7:05