diff options
Diffstat (limited to 'roms/edk2/MdeModulePkg/Universal/RegularExpressionDxe/oniguruma/doc/RE')
-rw-r--r-- | roms/edk2/MdeModulePkg/Universal/RegularExpressionDxe/oniguruma/doc/RE | 578 |
1 files changed, 578 insertions, 0 deletions
diff --git a/roms/edk2/MdeModulePkg/Universal/RegularExpressionDxe/oniguruma/doc/RE b/roms/edk2/MdeModulePkg/Universal/RegularExpressionDxe/oniguruma/doc/RE new file mode 100644 index 000000000..4561698a7 --- /dev/null +++ b/roms/edk2/MdeModulePkg/Universal/RegularExpressionDxe/oniguruma/doc/RE @@ -0,0 +1,578 @@ +Oniguruma Regular Expressions Version 6.9.5 2020/01/28 + +syntax: ONIG_SYNTAX_ONIGURUMA (default) + + +1. Syntax elements + + \ escape (enable or disable meta character) + | alternation + (...) group + [...] character class + + +2. Characters + + \t horizontal tab (0x09) + \v vertical tab (0x0B) + \n newline (line feed) (0x0A) + \r carriage return (0x0D) + \b backspace (0x08) + \f form feed (0x0C) + \a bell (0x07) + \e escape (0x1B) + \nnn octal char (encoded byte value) + \o{17777777777} wide octal char (character code point value) + \uHHHH wide hexadecimal char (character code point value) + \xHH hexadecimal char (encoded byte value) + \x{7HHHHHHH} wide hexadecimal char (character code point value) + \cx control char (character code point value) + \C-x control char (character code point value) + \M-x meta (x|0x80) (character code point value) + \M-\C-x meta control char (character code point value) + + (* \b as backspace is effective in character class only) + + +3. Character types + + . any character (except newline) + + \w word character + + Not Unicode: + alphanumeric, "_" and multibyte char. + + Unicode: + General_Category -- (Letter|Mark|Number|Connector_Punctuation) + + \W non-word char + + \s whitespace char + + Not Unicode: + \t, \n, \v, \f, \r, \x20 + + Unicode case: + U+0009, U+000A, U+000B, U+000C, U+000D, U+0085(NEL), + General_Category -- Line_Separator + -- Paragraph_Separator + -- Space_Separator + + \S non-whitespace char + + \d decimal digit char + + Unicode: General_Category -- Decimal_Number + + \D non-decimal-digit char + + \h hexadecimal digit char [0-9a-fA-F] + + \H non-hexdigit char + + \R general newline (* can't be used in character-class) + "\r\n" or \n,\v,\f,\r (* but doesn't backtrack from \r\n to \r) + + Unicode case: + "\r\n" or \n,\v,\f,\r or U+0085, U+2028, U+2029 + + \N negative newline (?-m:.) + + \O true anychar (?m:.) (* original function) + + \X Text Segment \X === (?>\O(?:\Y\O)*) + + The meaning of this operator changes depending on the setting of + the option (?y{..}). + + \X doesn't check whether matching start position is boundary or not. + Please write as \y\X if you want to ensure it. + + [Extended Grapheme Cluster mode] (default) + Unicode case: + See [Unicode Standard Annex #29: http://unicode.org/reports/tr29/] + + Not Unicode case: \X === (?>\r\n|\O) + + [Word mode] + Currently, this mode is supported in Unicode only. + See [Unicode Standard Annex #29: http://unicode.org/reports/tr29/] + + + Character Property + + * \p{property-name} + * \p{^property-name} (negative) + * \P{property-name} (negative) + + property-name: + + + works on all encodings + Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower, + Print, Punct, Space, Upper, XDigit, Word, ASCII + + + works on EUC_JP, Shift_JIS + Hiragana, Katakana + + + works on UTF8, UTF16, UTF32 + See doc/UNICODE_PROPERTIES. + + + +4. Quantifier + + greedy + + ? 1 or 0 times + * 0 or more times + + 1 or more times + {n,m} (n <= m) at least n but no more than m times + {n,} at least n times + {,n} at least 0 but no more than n times ({0,n}) + {n} n times + + + reluctant + + ?? 0 or 1 times + *? 0 or more times + +? 1 or more times + {n,m}? (n <= m) at least n but not more than m times + {n,}? at least n times + {,n}? at least 0 but not more than n times (== {0,n}?) + + {n}? is reluctant operator in ONIG_SYNTAX_JAVA and ONIG_SYNTAX_PERL only. + (In that case, it doesn't make sense to write so.) + In default syntax, /a{n}?/ === /(?:a{n})?/ + + + possessive (greedy and does not backtrack once match) + + ?+ 1 or 0 times + *+ 0 or more times + ++ 1 or more times + {n,m} (n > m) at least m but not more than n times + + {n,m}+, {n,}+, {n}+ are possessive operators in ONIG_SYNTAX_JAVA and + ONIG_SYNTAX_PERL only. + + ex. /a*+/ === /(?>a*)/ + + +5. Anchors + + ^ beginning of the line + $ end of the line + \b word boundary + \B non-word boundary + + \A beginning of string + \Z end of string, or before newline at the end + \z end of string + \G where the current search attempt begins + \K keep (keep start position of the result string) + + + \y Text Segment boundary + \Y Text Segment non-boundary + + The meaning of these operators(\y, \Y) changes depending on the setting + of the option (?y{..}). + + [Extended Grapheme Cluster mode] (default) + Unicode case: + See [Unicode Standard Annex #29: http://unicode.org/reports/tr29/] + + Not Unicode: + All positions except between \r and \n. + + [Word mode] + Currently, this mode is supported in Unicode only. + See [Unicode Standard Annex #29: http://unicode.org/reports/tr29/] + + + +6. Character class + + ^... negative class (lowest precedence) + x-y range from x to y + [...] set (character class in character class) + ..&&.. intersection (low precedence, only higher than ^) + + ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w] + + * If you want to use '[', '-', or ']' as a normal character + in character class, you should escape them with '\'. + + + POSIX bracket ([:xxxxx:], negate [:^xxxxx:]) + + Not Unicode Case: + + alnum alphabet or digit char + alpha alphabet + ascii code value: [0 - 127] + blank \t, \x20 + cntrl + digit 0-9 + graph include all of multibyte encoded characters + lower + print include all of multibyte encoded characters + punct + space \t, \n, \v, \f, \r, \x20 + upper + xdigit 0-9, a-f, A-F + word alphanumeric, "_" and multibyte characters + + + Unicode Case: + + alnum Letter | Mark | Decimal_Number + alpha Letter | Mark + ascii 0000 - 007F + blank Space_Separator | 0009 + cntrl Control | Format | Unassigned | Private_Use | Surrogate + digit Decimal_Number + graph [[:^space:]] && ^Control && ^Unassigned && ^Surrogate + lower Lowercase_Letter + print [[:graph:]] | [[:space:]] + punct Connector_Punctuation | Dash_Punctuation | Close_Punctuation | + Final_Punctuation | Initial_Punctuation | Other_Punctuation | + Open_Punctuation + space Space_Separator | Line_Separator | Paragraph_Separator | + U+0009 | U+000A | U+000B | U+000C | U+000D | U+0085 + upper Uppercase_Letter + xdigit U+0030 - U+0039 | U+0041 - U+0046 | U+0061 - U+0066 + (0-9, a-f, A-F) + word Letter | Mark | Decimal_Number | Connector_Punctuation + + + +7. Extended groups + + (?#...) comment + + (?imxWDSPy-imxWDSP:subexp) option on/off for subexp + + i: ignore case + m: multi-line (dot (.) also matches newline) + x: extended form + W: ASCII only word (\w, \p{Word}, [[:word:]]) + ASCII only word bound (\b) + D: ASCII only digit (\d, \p{Digit}, [[:digit:]]) + S: ASCII only space (\s, \p{Space}, [[:space:]]) + P: ASCII only POSIX properties (includes W,D,S) + (alnum, alpha, blank, cntrl, digit, graph, + lower, print, punct, space, upper, xdigit, word) + + y{?}: Text Segment mode + This option changes the meaning of \X, \y, \Y. + Currently, this option is supported in Unicode only. + + y{g}: Extended Grapheme Cluster mode (default) + y{w}: Word mode + See [Unicode Standard Annex #29] + + (?imxWDSPy-imxWDSP) isolated option + + * It makes a group to the next ')' or end of the pattern. + /ab(?i)c|def|gh/ == /ab(?i:c|def|gh)/ + + + (?:subexp) non-capturing group + (subexp) capturing group + + (?=subexp) look-ahead + (?!subexp) negative look-ahead + + (?<=subexp) look-behind + (?<!subexp) negative look-behind + + * Cannot use Absent stopper (?~|expr) and Range clear + (?~|) operators in look-behind and negative look-behind. + + * In look-behind and negative look-behind, support for + ignore-case option is limited. Only supports conversion + between single characters. (Does not support conversion + of multiple characters in Unicode) + + (?>subexp) atomic group + no backtracks in subexp. + + (?<name>subexp), (?'name'subexp) + define named group + (Each character of the name must be a word character.) + + Not only a name but a number is assigned like a capturing + group. + + Assigning the same name to two or more subexps is allowed. + + + <Callouts> + + * Callouts of contents + (?{...contents...}) callout in progress + (?{...contents...}D) D is a direction flag char + D = 'X': in progress and retraction + '<': in retraction only + '>': in progress only + (?{...contents...}[tag]) tag assigned + (?{...contents...}[tag]D) + + * Escape characters have no effects in contents. + * contents is not allowed to start with '{'. + + (?{{{...contents...}}}) n times continuations '}' in contents is allowed in + (n+1) times continuations {{{...}}}. + + Allowed tag string characters: _ A-Z a-z 0-9 (* first character: _ A-Z a-z) + + + * Callouts of name + (*name) + (*name{args...}) with args + (*name[tag]) tag assigned + (*name[tag]{args...}) + + Allowed name string characters: _ A-Z a-z 0-9 (* first character: _ A-Z a-z) + Allowed tag string characters: _ A-Z a-z 0-9 (* first character: _ A-Z a-z) + + + <Absent functions> + + (?~absent) Absent repeater (* proposed by Tanaka Akira) + This works like .* (more precisely \O*), but it is + limited by the range that does not include the string + match with <absent>. + This is a written abbreviation of (?~|(?:absent)|\O*). + \O* is used as a repeater. + + (?~|absent|exp) Absent expression (* original) + This works like "exp", but it is limited by the range + that does not include the string match with <absent>. + + ex. (?~|345|\d*) "12345678" ==> "12", "1", "" + + (?~|absent) Absent stopper (* original) + After passed this operator, string right range is limited + at the point that does not include the string match whth + <absent>. + + (?~|) Range clear + Clear the effects caused by Absent stoppers. + + * Nested Absent functions are not supported and the behavior + is undefined. + + + <if-then-else> + + (?(condition_exp)then_exp|else_exp) if-then-else + (?(condition_exp)then_exp) if-then + + condition_exp can be a backreference number/name or a normal + regular expression. + When condition_exp is a backreference number/name, both then_exp and + else_exp can be omitted. + Then it works as a backreference validity checker. + + [ Backreference validity checker ] (* original) + + (?(n)), (?(-n)), (?(+n)), (?(n+level)) ... + (?(<n>)), (?('-n')), (?(<+n>)) ... + (?(<name>)), (?('name')), (?(<name+level>)) ... + + + +8. Backreferences + + When we say "backreference a group," it actually means, "re-match the same + text matched by the subexp in that group." + + \n \k<n> \k'n' (n >= 1) backreference the nth group in the regexp + \k<-n> \k'-n' (n >= 1) backreference the nth group counting + backwards from the referring position + \k<+n> \k'+n' (n >= 1) backreference the nth group counting + forwards from the referring position + \k<name> \k'name' backreference a group with the specified name + + When backreferencing with a name that is assigned to more than one groups, + the last group with the name is checked first, if not matched then the + previous one with the name, and so on, until there is a match. + + * Backreference by number is forbidden if any named group is defined and + ONIG_OPTION_CAPTURE_GROUP is not set. + + + backreference with recursion level + + (n >= 1, level >= 0) + + \k<n+level> \k'n+level' + \k<n-level> \k'n-level' + + \k<name+level> \k'name+level' + \k<name-level> \k'name-level' + + Destine a group on the recursion level relative to the referring position. + + ex 1. + + /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b>))\z/.match("reee") + /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer") + + \k<b+0> refers to the (?<b>.) on the same recursion level with it. + + ex 2. + + r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED) + (?<element> \g<stag> \g<content>* \g<etag> ){0} + (?<stag> < \g<name> \s* > ){0} + (?<name> [a-zA-Z_:]+ ){0} + (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0} + (?<etag> </ \k<name+1> >){0} + \g<element> + __REGEXP__ + + p r.match("<foo>f<bar>bbb</bar>f</foo>").captures + + +9. Subexp calls ("Tanaka Akira special") (* original function) + + When we say "call a group," it actually means, "re-execute the subexp in + that group." + + \g<n> \g'n' (n >= 1) call the nth group + \g<0> \g'0' call zero (call the total regexp) + \g<-n> \g'-n' (n >= 1) call the nth group counting backwards from + the calling position + \g<+n> \g'+n' (n >= 1) call the nth group counting forwards from + the calling position + \g<name> \g'name' call the group with the specified name + + * Left-most recursive calls are not allowed. + + ex. (?<name>a|\g<name>b) => error + (?<name>a|b\g<name>c) => OK + + * Calls with a name that is assigned to more than one groups are not + allowed. + + * Call by number is forbidden if any named group is defined and + ONIG_OPTION_CAPTURE_GROUP is not set. + + * The option status of the called group is always effective. + + ex. /(?-i:\g<name>)(?i:(?<name>a)){0}/.match("A") + + +10. Captured group + + Behavior of an unnamed group (...) changes with the following conditions. + (But named group is not changed.) + + case 1. /.../ (named group is not used, no option) + + (...) is treated as a capturing group. + + case 2. /.../g (named group is not used, 'g' option) + + (...) is treated as a non-capturing group (?:...). + + case 3. /..(?<name>..)../ (named group is used, no option) + + (...) is treated as a non-capturing group. + numbered-backref/call is not allowed. + + case 4. /..(?<name>..)../G (named group is used, 'G' option) + + (...) is treated as a capturing group. + numbered-backref/call is allowed. + + where + g: ONIG_OPTION_DONT_CAPTURE_GROUP + G: ONIG_OPTION_CAPTURE_GROUP + + ('g' and 'G' options are argued in ruby-dev ML) + + + +----------------------------- +A-1. Syntax-dependent options + + + ONIG_SYNTAX_ONIGURUMA + (?m): dot (.) also matches newline + + + ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA + (?s): dot (.) also matches newline + (?m): ^ matches after newline, $ matches before newline + + +A-2. Original extensions + + + hexadecimal digit char type \h, \H + + true anychar \O + + text segment boundary \y, \Y + + backreference validity checker (?(...)) + + named group (?<name>...), (?'name'...) + + named backref \k<name> + + subexp call \g<name>, \g<group-num> + + absent expression (?~|...|...) + + absent stopper (?|...) + + +A-3. Missing features compared with perl 5.8.0 + + + \N{name} + + \l,\u,\L,\U,\C + + (??{code}) + + * \Q...\E + This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA. + + +A-4. Differences with Japanized GNU regex(version 0.12) of Ruby 1.8 + + + add character property (\p{property}, \P{property}) + + add hexadecimal digit char type (\h, \H) + + add look-behind + (?<=fixed-width-pattern), (?<!fixed-width-pattern) + + add possessive quantifier. ?+, *+, ++ + + add operations in character class. [], && + ('[' must be escaped as an usual char in character class.) + + add named group and subexp call. + + octal or hexadecimal number sequence can be treated as + a multibyte code char in character class if multibyte encoding + is specified. + (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1]) + + allow the range of single byte char and multibyte char in character + class. + ex. /[a-<<any EUC-JP character>>]/ in EUC-JP encoding. + + effect range of isolated option is to next ')'. + ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b). + + isolated option is not transparent to previous pattern. + ex. a(?i)* is a syntax error pattern. + + allowed unpaired left brace as a normal character. + ex. /{/, /({)/, /a{2,3/ etc... + + negative POSIX bracket [:^xxxx:] is supported. + + POSIX bracket [:ascii:] is added. + + repeat of look-ahead is not allowed. + ex. /(?=a)*/, /(?!b){5}/ + + Ignore case option is effective to escape sequence. + ex. /\x61/i =~ "A" + + In the range quantifier, the number of the minimum is optional. + /a{,n}/ == /a{0,n}/ + The omission of both minimum and maximum values is not allowed. + /a{,}/ + + /{n}?/ is not a reluctant quantifier. + /a{n}?/ == /(?:a{n})?/ + + invalid back reference is checked and raises error. + /\1/, /(a)\2/ + + Zero-width match in an infinite loop stops the repeat, + then changes of the capture group status are checked as stop condition. + /(?:()|())*\1\2/ =~ "" + /(?:\1a|())*/ =~ "a" + +// END |