Vim Find Character Continue Past Newline

Regular Expressions

This chapter will discuss regular expressions (regexp) and related features in detail. As discussed in earlier chapters:

  • /searchpattern search the given pattern in the forward direction
  • ?searchpattern search the given pattern in the backward direction
  • :range s/searchpattern/replacestring/flags search and replace
    • :s is short for :substitute command
    • the delimiter after replacestring is optional if you are not using flags

Documentation links:

  • :h usr_27.txt — search commands and patterns
  • :h pattern-searches — reference manual for Patterns and search commands
  • :h :substitute — reference manual for :substitute command

info Recall that you need to add / prefix for built-in help on regular expressions, :h /^ for example.

Flags

  • g replace all occurrences within a matching line
    • by default, only the first matching portion will be replaced
  • c ask for confirmation before each replacement
  • i ignore case for searchpattern
  • I don't ignore case for searchpattern

These flags are applicable for the substitute command but not / or ? searches. Flags can also be combined, for example:

  • s/cat/Dog/gi replace every occurrence of cat with Dog
    • Case is ignored, so Cat, cAt, CAT, etc will also be replaced
    • Note that i doesn't affect the case of the replacement string

info See :h s_flags for a complete list of flags and more details about them.

Anchors

By default, regexp will match anywhere in the text. You can use line and word anchors to specify additional restrictions regarding the position of matches. These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as metacharacters in regular expressions parlance. In case you need to match those characters literally, you need to escape them with a \ (discussed in Escaping metacharacters section later in this chapter).

  • ^ restricts the match to the start-of-line
    • ^This matches This is a sample but not Do This
  • $ restricts the match to the end-of-line
    • )$ matches apple (5) but not def greeting():
  • ^$ match empty line
  • \<pattern restricts the match to the start of a word
    • word characters include alphabets, digits and underscore
    • \<his matches his or to-his or history but not this or _hist
  • pattern\> restricts the match to the end of a word
    • his\> matches his or to-his or this but not history or _hist
  • \<pattern\> restricts the match between start of a word and end of a word
    • \<his\> matches his or to-his but not this or history or _hist

info End-of-line can be \r (carriage return), \n (newline) or \r\n depending on your system and fileformat setting.

info See :h pattern-atoms for more details.

  • . match any single character other than end-of-line
    • c.t matches cat or cot or c2t or c^t or c.t or c;t but not cant or act or sit
  • \_. match any single character, including end-of-line

info As seen above, matching end-of-line character requires special attention. Which is why examples and descriptions in this chapter will assume you are operating line wise unless otherwise mentioned. You'll later see how \_ is used in many more places to include end-of-line in the matches.

Greedy Quantifiers

Quantifiers can be applied to literal characters, dot metacharacter, groups, backreferences and character classes. Basic examples are shown below, more will be discussed in the sections to follow.

  • * match zero or more times
    • abc* matches ab or abc or abccc or abcccccc but not bc
    • Error.*valid matches Error: invalid input but not valid Error
    • s/a.*b/X/ replaces table bottle bus with tXus since a.*b matches from the first a to the last b
  • \+ match one or more times
    • abc\+ matches abc or abccc but not ab or bc
  • \? match zero or one times
    • \= can also be used, helpful if you are searching backwards with the ? command
    • abc\? matches ab or abc. This will match abccc or abcccccc as well, but only the abc portion
    • s/abc\?/X/ replaces abcc with Xc
  • \{m,n} match m to n times (inclusive)
    • ab\{1,4}c matches abc or abbc or xabbbcz but not ac or abbbbbc
    • if you are familiar with BRE, you can also use \{m,n\} (ending brace is escaped)
  • \{m,} match at least m times
    • ab\{3,}c matches xabbbcz or abbbbbc but not ac or abc or abbc
  • \{,n} match up to n times (including 0 times)
    • ab\{,2}c matches abc or ac or abbc but not xabbbcz or abbbbbc
  • \{n} match exactly n times
    • ab\{3}c matches xabbbcz but not abbc or abbbbbc

Greedy quantifiers will consume as much as possible, provided the overall pattern is also matched. That's how the Error.*valid example worked. If .* had consumed everything after Error, there wouldn't be any more characters to try to match valid. How the regexp engine handles matching varying amount of characters depends on the implementation details (backtracking, NFA, etc).

info See :h pattern-overview for more details.

info If you are familiar with other regular expression flavors like Perl, Python, etc, you'd be surprised by the use of \ in the above examples. If you use \v very magic modifier (discussed later in this chapter), the \ won't be needed.

Non-greedy Quantifiers

Non-greedy quantifiers match as minimally as possible, provided the overall pattern is also matched.

  • \{-} match zero or more times as minimally as possible
    • s/t.\{-}a/X/g replaces that is quite a fabricated tale with XX fabricaXle
      • the matching portions are tha, t is quite a and ted ta
    • s/t.*a/X/g replaces that is quite a fabricated tale with Xle since * is greedy
  • \{-m,n} match m to n times as minimally as possible
    • m or n can be left out as seen in the Greedy Quantifiers section
    • s/.\{-2,5}/X/ replaces 123456789 with X3456789 (here . matched 2 times)
    • s/.\{-2,5}6/X/ replaces 123456789 with X789 (here . matched 5 times to satisfy overall pattern)

info See :h pattern-overview and stackoverflow: non-greedy matching for more details.

Character Classes

To create a custom placeholder for a limited set of characters, you can enclose them inside [] metacharacters. Character classes have their own versions of metacharacters and provide special predefined sets for common use cases.

  • [aeiou] match any lowercase vowel character
  • [^aeiou] match any character other than lowercase vowels
  • [a-d] match any of a or b or c or d
    • the range metacharacter - can be applied between any two characters
  • \a match any alphabet character [a-zA-Z]
  • \A match other than alphabets [^a-zA-Z]
  • \l match lowercase alphabets [a-z]
  • \L match other than lowercase alphabets [^a-z]
  • \u match uppercase alphabets [A-Z]
  • \U match other than uppercase alphabets [^A-Z]
  • \d match any digit character [0-9]
  • \D match other than digits [^0-9]
  • \o match any octal character [0-7]
  • \O match other than octals [^0-7]
  • \x match any hexadecimal character [0-9a-fA-F]
  • \X match other than hexadecimals [^0-9a-fA-F]
  • \h match alphabets and underscore [a-zA-Z_]
  • \H match other than alphabets and underscore [^a-zA-Z_]
  • \w match any word character (alphabets, digits, underscore) [a-zA-Z0-9_]
    • this definition is same as seen earlier with word boundaries
  • \W match other than word characters [^a-zA-Z0-9_]
  • \s match space and tab characters [ \t]
  • \S match other than space and tab characters [^ \t]

Here are some examples with character classes:

  • c[ou]t matches cot or cut
  • \<[ot][on]\> matches oo or on or to or tn as whole words only
  • ^[on]\{2,}$ matches no or non or noon or on etc as whole lines only
  • s/"[^"]\+"/X/g replaces "mango" and "(guava)" with X and X
  • s/\d\+/-/g replaces Sample123string777numbers with Sample-string-numbers
  • s/\<0*[1-9]\d\{2,}\>/X/g replaces 0501 035 26 98234 with X 035 26 X (matches numbers >=100 with optional leading zeros)
  • s/\W\+/ /g replaces load2;err_msg--\ant with load2 err_msg ant

info To include the end-of-line character, use \_ instead of \ for any of the above escape sequences. For example, \_s will help you match across lines. Similarly, use \_[] for bracketed classes.

warning info The above escape sequences do not have special meaning within bracketed classes. For example, [\d\s] will only match \ or d or s. You can use named character sets in such scenarios. For example, [[:digit:][:blank:]] to match digits or space or tab characters. See :h :alnum: for full list and more details.

info The predefined sets are also better in terms of performance compared to bracketed versions. And there are more such sets than the ones discussed above. See :h character-classes for more details.

Alternation and Grouping

Alternation helps you to match multiple terms and they can have their own anchors as well (since each alternative is a regexp pattern). Often, there are some common things among the regular expression alternatives. In such cases, you can group them using a pair of parentheses metacharacters. Similar to a(b+c)d = abd+acd in maths, you get a(b|c)d = abd|acd in regular expressions.

  • \| match either of the specified patterns
    • min\|max matches min or max
    • one\|two\|three matches one or two or three
    • \<par\>\|er$ matches whole word par or a line ending with er
  • \(pattern\) group a pattern to apply quantifiers, create a terser regexp by taking out common elements, etc
    • a\(123\|456\)b is equivalent to a123b\|a456b
    • hand\(y\|ful\) matches handy or handful
    • hand\(y\|ful\)\? matches hand or handy or handful
    • \(to\)\+ matches to or toto or tototo and so on
    • re\(leas\|ceiv\)\?ed matches reed or released or received

There's some tricky situations when using alternation. Say, you want to match are or spared — which one should get precedence? The bigger word spared or the substring are inside it or based on something else? The alternative which matches earliest in the input gets precedence, irrespective of the order of the alternatives.

  • s/are\|spared/X/g replaces rare spared area with rX X Xa
    • s/spared\|are/X/g will also give the same results

In case of matches starting from the same location, for example spa and spared, the leftmost alternative gets precedence. Sort by longest term first if don't want shorter terms to take precedence.

  • s/spa\|spared/**/g replaces spared spare with **red **re
  • s/spared\|spa/**/g replaces spared spare with ** **re

Backreference

The groupings seen in the previous section are also known as capture groups. The string captured by these groups can be referred later using backreference \N where N is the capture group you want. Backreferences can be used in both search and replacement sections.

  • \(pattern\) capture group for later use via backreferences
  • \%(pattern\) non-capturing group
  • leftmost group is 1, second leftmost group is 2 and so on (maximum 9 groups)
  • \1 backreference to the first capture group
  • \2 backreference to the second capture group
  • \9 backreference to the ninth capture group
  • & or \0 backreference to the entire matched portion

Here are some examples:

  • \(\a\)\1 matches two consecutive repeated alphabets like ee, TT, pp and so on
    • recall that \a refers to [a-zA-Z]
  • \(\a\)\1\+ matches two or more consecutive repeated alphabets like ee, ttttt, PPPPPPPP and so on
  • s/\d\+/(&)/g replaces 52 apples 31 mangoes with (52) apples (31) mangoes (surround digits with parentheses)
  • s/\(\w\+\),\(\w\+\)/\2,\1/g replaces good,bad 42,24 with bad,good 24,42 (swap words separated by comma)
  • s/\(_\)\?_/\1/g replaces _foo_ __123__ _baz_ with foo _123_ baz (matches one or two underscores, deletes one underscore)
  • s/\(\d\+\)\%(abc\)\+\(\d\+\)/\2:\1/ replaces 12abcabcabc24 with 24:12 (matches digits separated by one or more abc sequences, swaps the numbers with : as the separator)
    • note the use of non-capturing group for abc since it isn't needed later
    • s/\(\d\+\)\(abc\)\+\(\d\+\)/\3:\1/ does the same if only capturing groups are used

Referring to text matched by a capture group with a quantifier will give only the last match, not entire match. Use a capture group around the grouping and quantifier together to get the entire matching portion. In such cases, the inner grouping is an ideal candidate to use non-capturing group.

  • s/a \(\d\{3}\)\+/b (\1)/ replaces a 123456789 with b (789)
    • a 4839235 will be replaced with b (923)5
  • s/a \(\%(\d\{3}\)\+\)/b (\1)/ replaces a 123456789 with b (123456789)
    • a 4839235 will be replaced with b (483923)5

Lookarounds

Lookarounds help to create custom anchors and add conditions within the searchpattern. These assertions are also known as zero-width patterns because they add restrictions similar to anchors and are not part of the matched portions.

info Vim's syntax is different than those usually found in programming languages like Perl, Python and JavaScript. The syntax starting with \@ is always added as a suffix to the pattern atom used in the assertion. For example, (?!\d) and (?<=pat.*) in other languages are specified as \d\@! and \(pat.*\)\@<= respectively in Vim.

  • \@! negative lookahead assertion
    • ice\d\@! matches ice as long as it is not immediately followed by a digit character, for example ice or iced! or icet5 or ice.123 but not ice42 or ice123
    • s/ice\d\@!/X/g replaces iceiceice2 with XXice2
    • s/par\(.*\<par\>\)\@!/X/g replaces par with X as long as whole word par is not present later in the line, for example parse and par and sparse is converted to parse and X and sXse
    • at\(\(go\)\@!.\)*par matches cat,dog,parrot but not cat,god,parrot (i.e. match at followed by par as long as go isn't present in between, this is an example of negating a grouping)
  • \@<! negative lookbehind assertion
    • _\@<!ice matches ice as long as it is not immediately preceded by a _ character, for example ice or _(ice) or 42ice but not _ice
    • \(cat.*\)\@<!dog matches dog as long as cat is not present earlier in the line, for example fox,parrot,dog,cat but not fox,cat,dog,parrot
  • \@= positive lookahead assertion
    • ice\d\@= matches ice as long as it is immediately followed by a digit character, for example ice42 or ice123 but not ice or iced! or icet5 or ice.123
    • s/ice\d\@=/X/g replaces ice ice_2 ice2 iced with ice ice_2 X2 iced
  • \@<= positive lookbehind assertion
    • _\@<=ice matches ice as long as it is immediately preceded by a _ character, for example _ice or (_ice) but not ice or _(ice) or 42ice

info info info You can also specify number of bytes to search for lookbehind patterns. This will significantly speed up the matching process. You have to specify the number between @ and < characters. For example, _\@1<=ice will lookback only one byte before ice for matching purposes. \(cat.*\)\@10<!dog will lookback only ten bytes before dog to check the given assertion.

Atomic Grouping

As discussed earlier, both greedy and non-greedy quantifiers will try to satisfy the overall pattern by varying the amount of characters matched by the quantifiers. You can use atomic grouping if you do not want a specific sub-pattern to ever give back characters it has already matched. Similar to lookarounds, you need to use \@> as a suffix, for example \(pattern\)\@>.

  • s/\(0*\)\@>\d\{3,\}/(&)/g replaces only numbers >= 100 irrespective of any number of leading zeros, for example 0501 035 154 is converted to (0501) 035 (154)
    • \(0*\)\@> matches the 0 character zero or more times, but it will not give up this portion to satisfy overall pattern
    • s/0*\d\{3,\}/(&)/g replaces 0501 035 154 with (0501) (035) (154) (here 035 is matched because 0* will match zero times to satisfy the overall pattern)

info Some regexp engines provide this feature as possessive quantifiers.

Set start and end of the match

Some of the positive lookbehind and lookahead usage can be replaced with \zs and \ze respectively.

  • \zs set the start of the match (portion before \zs won't be part of the match)
    • s/\<\w\zs\w*\W*//g replaces sea eat car rat eel tea with secret
    • same as s/\(\<\w\)\@<=\w*\W*//g or s/\(\<\w\)\w*\W*/\1/g
  • \ze set the end of the match (portion after \ze won't be part of the match)
    • s/ice\ze\d/X/g replaces ice ice_2 ice2 iced with ice ice_2 X2 iced
    • same as s/ice\d\@=/X/g or s/ice\(\d\)/X\1/g

info As per :h \zs and :h \ze, these "Can be used multiple times, the last one encountered in a matching branch is used."

Magic modifiers

These escape sequences change certain aspects of the syntax and behavior of the search pattern that comes after such a modifier. You can use multiple such modifiers as needed for particular sections of the pattern.

Magic and nomagic

  • \m magic mode (this is the default setting)
  • \M nomagic mode
    • ., * and ~ are no longer metacharacters (compared to magic mode)
    • \., \* and \~ will make them to behave as metacharacters
    • ^ and $ would still behave as metacharacters
    • \Ma.b matches only a.b
    • \Ma\.b matches a.b as well as a=b or a<b or acd etc

Very magic

The default syntax of Vim regexp has only a few metacharacters like ., *, ^ and so on. If you are familiar with regexp usage in programming languages such as Perl, Python and JavaScript, you can use \v to get a similar syntax in Vim. This will allow the use of more metacharacters such as (), {}, +, ? and so on without having to prefix them with a \ metacharacter. From :h magic documentation:

Use of \v means that after it, all ASCII characters except 0-9, a-z, A-Z and _ have special meaning

  • \v<his> matches his or to-his but not this or history or _hist
  • a<b.*\v<end> matches c=a<b

0 Response to "Vim Find Character Continue Past Newline"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel