正則表達式是一個強大的匹配功能,支持 C、python 等多種語言,新興時尚的 Swift,當然也少不了它。學習完本教程,您將感受到正則表達式賦予程序使用者的強大能力。
本教程首先介紹了 Swift 中各種匹配模式的使用,輔之以各色實例;然后講解 NSRegularExpression,即我們所要使用的蘋果提供的類;最后用一個比較復雜的實例挽總。本教程內容不光涉及正則表達式,也包括錯誤處理、閉包使用、文檔讀取與寫入等,如有疏漏乃至謬誤,請讀者不吝賜教。
Part One —— Swift 正則表達式
正則表達式說來也很簡單:給定一個 pattern (匹配模式,String 類型),看被檢測的對象 String 是否滿足這個 pattern,如果滿足了,你可以獲得對應的部分。
例如:apple
是一個 pattern,它能夠匹配 apple tree
、I love apples.
這樣的 String,獲得的結果都是 apple
。
除此之外,正則表達式支持特定符號代表的省略的值,例如:d.g
可以匹配dog
、dig
、dag
等等 String,這就讓正則的功能變得強大起來。
這些 pattern 有一套自己的規則,該規則是一般的語言所通用的,不同語言可能有部分微調。pattern 包括普通字符(例如,a 到 z 之間的字母)和特殊字符(稱為”元字符”)。下表列出了所有 Swift 下的元字符(metacharacters)中的字符表達式,來自官方文檔。
字符表達式 | 描述 | 注釋 |
---|---|---|
\a | Match a BELL, \u0007 | |
\A | Match at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input. | 始終匹配輸入的開端,不會 因為類型為 anchorsMatchLines 而改變,這是與^不同的地方。 |
\b, outside of a [Set] | Match if the current position is a word boundary. Boundaries occur at the transitions between word (\w) and non-word (\W) characters, with combining marks ignored. | 連字符不是字符邊界 |
\b, within a [Set] | Match a BACKSPACE, \u0008. | 退格鍵 |
\B | Match if the current position is not a word boundary. | |
\cX | Match a control-X character | |
\d | Match any character with the Unicode General Category of Nd (Number, Decimal Digit.) | 匹配數字,包括 Unicode 中的各種數字寫法。 |
\D | Match any character that is not a decimal digit. | |
\e | Match an ESCAPE, \u001B. | |
\E | Terminates a \Q ... \E quoted sequence. | |
\f | Match a FORM FEED, \u000C. | 換頁符 |
\G | Match if the current position is at the end of the previous match. | |
\n | Match a LINE FEED, \u000A. | 換行符 |
\N{UNICODE CHARACTER NAME} | Match the named character. | |
\p{UNICODE PROPERTY NAME} | Match any character with the specified Unicode Property. | 所有的 Unicode Property 可以點擊查看 |
\P{UNICODE PROPERTY NAME} | Match any character not having the specified Unicode Property. | |
\Q | Quotes all following characters until \E. | |
\r | Match a CARRIAGE RETURN, \u000D. | 回車鍵 |
\s | Match a white space character. White space is defined as [\t\n\f\r\p{Z}]. | p{Z}包括 Unicode 行分隔、段落分隔、空格等,點擊查看 |
\S | Match a non-white space character. | |
\t | Match a HORIZONTAL TABULATION, \u0009. | 水平制表 |
\uhhhh | Match the character with the hex value hhhh. | |
\Uhhhhhhhh | Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff. | 必須提供32位的 Unicode |
\w | Match a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]. | |
\W | Match a non-word character. | |
\x{hhhh} | Match the character with hex value hhhh. From one to six hex digits may be supplied. | |
\xhh | Match the character with two digit hex value hh. | |
\X | Match a Grapheme Cluster. | 字形簇 |
\Z | Match if the current position is at the end of input, but before the final line terminator, if one exists. | |
\z | Match if the current position is at the end of input. | |
\n | Back Reference. Match whatever the nth capturing group matched. n must be a number ≥ 1 and ≤ total number of capture groups in the pattern. | n 是一個數字,對應著第幾個子表達式 |
\0ooo | Match an Octal character. ooo is from one to three octal digits. 0377 is the largest allowed Octal character. The leading zero is required; it distinguishes Octal constants from back references. | |
[pattern] | Match any one character from the pattern. | 中括號代表只匹配其中之一 |
. | Match any character. | 如果類型為 dotMatchesLineSeparators,則可以匹配換行符,否則不能匹配 |
^ | Match at the beginning of a line. | |
$ | Match at the end of a line. | |
\ | Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . / |
下表列出了所有 Swift 下的元字符中的運算符。
運算符 | 描述 | 注釋 |
---|---|---|
| | Alternation. A|B matches either A or B. | |
* | Match 0 or more times. Match as many times as possible. | |
+ | Match 1 or more times. Match as many times as possible. | |
? | Match zero or one times. Prefer one. | |
{n} | Match exactly n times. | |
{n,} | Match at least n times. Match as many times as possible. | |
{n,m} | Match between n and m times. Match as many times as possible, but not more than m. | |
*? | Match 0 or more times. Match as few times as possible. | |
+? | Match 1 or more times. Match as few times as possible. | |
?? | Match zero or one times. Prefer zero. | |
{n}? | Match exactly n times. | |
{n,}? | Match at least n times, but no more than required for an overall pattern match. | |
{n,m}? | Match between n and m times. Match as few times as possible, but not less than n. | |
*+ | Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match). | |
++ | Match 1 or more times. Possessive match. | |
?+ | Match zero or one times. Possessive match. | |
{n}+ | Match exactly n times. | |
{n,}+ | Match at least n times. Possessive Match. | |
{n,m}+ | Match between n and m times. Possessive Match. | |
(...) | Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match. | |
(?:...) | Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses. | |
(?>...) | Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the "(?>" | |
(?# ... ) | Free-format comment (?# comment ). | |
(?= ... ) | Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position. | |
(?! ... ) | Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position. | |
(?<= ... ) | Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.) | |
(?<! ... ) | Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.) | |
(?ismwx-ismwx:... ) | Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled. The flags are defined in Flag Options. | |
(?ismwx-ismwx) | Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.The flags are defined in Flag Options. |
如果不想為了英語文檔而傷腦筋,推薦查看菜鳥教程之正則表達式來入門,但如果要更好的學習 Swift 正則,官方的文檔需要參考。
Part Two —— NSRegularExpression 類
不如用一個實例來說明。現在給出一個 String
let sentence = "I'd like to follow my fellow to the fallow to see a hallow harrow."
do {
// [a-z] 表明該字母可以是a-z中的任意一個
let regex = try NSRegularExpression(pattern: "f[a-z]llow", options: [])
// matches 的類型是 NSTextCheckingResult 的數組
let matches = regex.matches(in: sentence, options: [], range: NSRange(location: 0, length: sentence.count))
print("\(matches.count) matches.")
} catch {
print(error.localizedDescription)
}
結果如下:
3 matches.
而如何獲得 matches 中的具體匹配上的字符串呢?調用 NSTextCheckingResult 的 range 屬性,將這一范圍還原到原來的 sentence 中就可以了。
...
let matches = ...
print(...)
for (i, match) in matches.enumerated() {
let substring = (sentence as NSString).substring(with: match.range)
print("\(i) is " + substring + ".")
}
...
結果如下:
3 matches.
0 is follow.
1 is fellow.
2 is fallow.
還可以使用閉包來進行遍歷:
// 直接對每一個 match 進行處理
regex.enumerateMatches(in: sentence, options: [], range: NSRange(location: 0, length: sentence.count), using: { result, _, _ in
guard let result = result else { return }
let substring = (sentence as NSString).substring(with: result.range)
print(substring)
})
結果如下:
follow
fellow
fallow