Vim Find Character Continue Past Newline
Regular Expressions
This chapter will discuss regular expressions (regexp) and related features in detail. As discussed in earlier chapters:
-
/searchpattern
search the given pattern in the forward direction -
?searchpattern
search the given pattern in the backward direction -
:range s/searchpattern/replacestring/flags
search and replace-
:s
is short for:substitute
command - the delimiter after
replacestring
is optional if you are not using flags
-
Documentation links:
- :h usr_27.txt — search commands and patterns
- :h pattern-searches — reference manual for Patterns and search commands
- :h :substitute — reference manual for
:substitute
command
Recall that you need to add
/
prefix for built-in help on regular expressions, :h /^ for example.
Flags
-
g
replace all occurrences within a matching line- by default, only the first matching portion will be replaced
-
c
ask for confirmation before each replacement -
i
ignore case forsearchpattern
-
I
don't ignore case forsearchpattern
These flags are applicable for the substitute command but not /
or ?
searches. Flags can also be combined, for example:
-
s/cat/Dog/gi
replace every occurrence ofcat
withDog
- Case is ignored, so
Cat
,cAt
,CAT
, etc will also be replaced - Note that
i
doesn't affect the case of the replacement string
- Case is ignored, so
See :h s_flags for a complete list of flags and more details about them.
Anchors
By default, regexp will match anywhere in the text. You can use line and word anchors to specify additional restrictions regarding the position of matches. These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as metacharacters in regular expressions parlance. In case you need to match those characters literally, you need to escape them with a \
(discussed in Escaping metacharacters section later in this chapter).
-
^
restricts the match to the start-of-line-
^This
matchesThis is a sample
but notDo This
-
-
$
restricts the match to the end-of-line-
)$
matchesapple (5)
but notdef greeting():
-
-
^$
match empty line -
\<pattern
restricts the match to the start of a word- word characters include alphabets, digits and underscore
-
\<his
matcheshis
orto-his
orhistory
but notthis
or_hist
-
pattern\>
restricts the match to the end of a word-
his\>
matcheshis
orto-his
orthis
but nothistory
or_hist
-
-
\<pattern\>
restricts the match between start of a word and end of a word-
\<his\>
matcheshis
orto-his
but notthis
orhistory
or_hist
-
End-of-line can be
\r
(carriage return),\n
(newline) or\r\n
depending on your system andfileformat
setting.
See :h pattern-atoms for more details.
-
.
match any single character other than end-of-line-
c.t
matchescat
orcot
orc2t
orc^t
orc.t
orc;t
but notcant
oract
orsit
-
-
\_.
match any single character, including end-of-line
As seen above, matching end-of-line character requires special attention. Which is why examples and descriptions in this chapter will assume you are operating line wise unless otherwise mentioned. You'll later see how
\_
is used in many more places to include end-of-line in the matches.
Greedy Quantifiers
Quantifiers can be applied to literal characters, dot metacharacter, groups, backreferences and character classes. Basic examples are shown below, more will be discussed in the sections to follow.
-
*
match zero or more times-
abc*
matchesab
orabc
orabccc
orabcccccc
but notbc
-
Error.*valid
matchesError: invalid input
but notvalid Error
-
s/a.*b/X/
replacestable bottle bus
withtXus
sincea.*b
matches from the firsta
to the lastb
-
-
\+
match one or more times-
abc\+
matchesabc
orabccc
but notab
orbc
-
-
\?
match zero or one times-
\=
can also be used, helpful if you are searching backwards with the?
command -
abc\?
matchesab
orabc
. This will matchabccc
orabcccccc
as well, but only theabc
portion -
s/abc\?/X/
replacesabcc
withXc
-
-
\{m,n}
matchm
ton
times (inclusive)-
ab\{1,4}c
matchesabc
orabbc
orxabbbcz
but notac
orabbbbbc
- if you are familiar with BRE, you can also use
\{m,n\}
(ending brace is escaped)
-
-
\{m,}
match at leastm
times-
ab\{3,}c
matchesxabbbcz
orabbbbbc
but notac
orabc
orabbc
-
-
\{,n}
match up ton
times (including0
times)-
ab\{,2}c
matchesabc
orac
orabbc
but notxabbbcz
orabbbbbc
-
-
\{n}
match exactlyn
times-
ab\{3}c
matchesxabbbcz
but notabbc
orabbbbbc
-
Greedy quantifiers will consume as much as possible, provided the overall pattern is also matched. That's how the Error.*valid
example worked. If .*
had consumed everything after Error
, there wouldn't be any more characters to try to match valid
. How the regexp engine handles matching varying amount of characters depends on the implementation details (backtracking, NFA, etc).
See :h pattern-overview for more details.
If you are familiar with other regular expression flavors like Perl, Python, etc, you'd be surprised by the use of
\
in the above examples. If you use\v
very magic modifier (discussed later in this chapter), the\
won't be needed.
Non-greedy Quantifiers
Non-greedy quantifiers match as minimally as possible, provided the overall pattern is also matched.
-
\{-}
match zero or more times as minimally as possible-
s/t.\{-}a/X/g
replacesthat is quite a fabricated tale
withXX fabricaXle
- the matching portions are
tha
,t is quite a
andted ta
- the matching portions are
-
s/t.*a/X/g
replacesthat is quite a fabricated tale
withXle
since*
is greedy
-
-
\{-m,n}
matchm
ton
times as minimally as possible-
m
orn
can be left out as seen in the Greedy Quantifiers section -
s/.\{-2,5}/X/
replaces123456789
withX3456789
(here.
matched 2 times) -
s/.\{-2,5}6/X/
replaces123456789
withX789
(here.
matched 5 times to satisfy overall pattern)
-
See :h pattern-overview and stackoverflow: non-greedy matching for more details.
Character Classes
To create a custom placeholder for a limited set of characters, you can enclose them inside []
metacharacters. Character classes have their own versions of metacharacters and provide special predefined sets for common use cases.
-
[aeiou]
match any lowercase vowel character -
[^aeiou]
match any character other than lowercase vowels -
[a-d]
match any ofa
orb
orc
ord
- the range metacharacter
-
can be applied between any two characters
- the range metacharacter
-
\a
match any alphabet character[a-zA-Z]
-
\A
match other than alphabets[^a-zA-Z]
-
\l
match lowercase alphabets[a-z]
-
\L
match other than lowercase alphabets[^a-z]
-
\u
match uppercase alphabets[A-Z]
-
\U
match other than uppercase alphabets[^A-Z]
-
\d
match any digit character[0-9]
-
\D
match other than digits[^0-9]
-
\o
match any octal character[0-7]
-
\O
match other than octals[^0-7]
-
\x
match any hexadecimal character[0-9a-fA-F]
-
\X
match other than hexadecimals[^0-9a-fA-F]
-
\h
match alphabets and underscore[a-zA-Z_]
-
\H
match other than alphabets and underscore[^a-zA-Z_]
-
\w
match any word character (alphabets, digits, underscore)[a-zA-Z0-9_]
- this definition is same as seen earlier with word boundaries
-
\W
match other than word characters[^a-zA-Z0-9_]
-
\s
match space and tab characters[ \t]
-
\S
match other than space and tab characters[^ \t]
Here are some examples with character classes:
-
c[ou]t
matchescot
orcut
-
\<[ot][on]\>
matchesoo
oron
orto
ortn
as whole words only -
^[on]\{2,}$
matchesno
ornon
ornoon
oron
etc as whole lines only -
s/"[^"]\+"/X/g
replaces"mango" and "(guava)"
withX and X
-
s/\d\+/-/g
replacesSample123string777numbers
withSample-string-numbers
-
s/\<0*[1-9]\d\{2,}\>/X/g
replaces0501 035 26 98234
withX 035 26 X
(matches numbers >=100 with optional leading zeros) -
s/\W\+/ /g
replacesload2;err_msg--\ant
withload2 err_msg ant
To include the end-of-line character, use
\_
instead of\
for any of the above escape sequences. For example,\_s
will help you match across lines. Similarly, use\_[]
for bracketed classes.
The above escape sequences do not have special meaning within bracketed classes. For example,
[\d\s]
will only match\
ord
ors
. You can use named character sets in such scenarios. For example,[[:digit:][:blank:]]
to match digits or space or tab characters. See :h :alnum: for full list and more details.
The predefined sets are also better in terms of performance compared to bracketed versions. And there are more such sets than the ones discussed above. See :h character-classes for more details.
Alternation and Grouping
Alternation helps you to match multiple terms and they can have their own anchors as well (since each alternative is a regexp pattern). Often, there are some common things among the regular expression alternatives. In such cases, you can group them using a pair of parentheses metacharacters. Similar to a(b+c)d = abd+acd
in maths, you get a(b|c)d = abd|acd
in regular expressions.
-
\|
match either of the specified patterns-
min\|max
matchesmin
ormax
-
one\|two\|three
matchesone
ortwo
orthree
-
\<par\>\|er$
matches whole wordpar
or a line ending wither
-
-
\(pattern\)
group a pattern to apply quantifiers, create a terser regexp by taking out common elements, etc-
a\(123\|456\)b
is equivalent toa123b\|a456b
-
hand\(y\|ful\)
matcheshandy
orhandful
-
hand\(y\|ful\)\?
matcheshand
orhandy
orhandful
-
\(to\)\+
matchesto
ortoto
ortototo
and so on -
re\(leas\|ceiv\)\?ed
matchesreed
orreleased
orreceived
-
There's some tricky situations when using alternation. Say, you want to match are
or spared
— which one should get precedence? The bigger word spared
or the substring are
inside it or based on something else? The alternative which matches earliest in the input gets precedence, irrespective of the order of the alternatives.
-
s/are\|spared/X/g
replacesrare spared area
withrX X Xa
-
s/spared\|are/X/g
will also give the same results
-
In case of matches starting from the same location, for example spa
and spared
, the leftmost alternative gets precedence. Sort by longest term first if don't want shorter terms to take precedence.
-
s/spa\|spared/**/g
replacesspared spare
with**red **re
-
s/spared\|spa/**/g
replacesspared spare
with** **re
Backreference
The groupings seen in the previous section are also known as capture groups. The string captured by these groups can be referred later using backreference \N
where N
is the capture group you want. Backreferences can be used in both search and replacement sections.
-
\(pattern\)
capture group for later use via backreferences -
\%(pattern\)
non-capturing group - leftmost group is
1
, second leftmost group is2
and so on (maximum9
groups) -
\1
backreference to the first capture group -
\2
backreference to the second capture group -
\9
backreference to the ninth capture group -
&
or\0
backreference to the entire matched portion
Here are some examples:
-
\(\a\)\1
matches two consecutive repeated alphabets likeee
,TT
,pp
and so on- recall that
\a
refers to[a-zA-Z]
- recall that
-
\(\a\)\1\+
matches two or more consecutive repeated alphabets likeee
,ttttt
,PPPPPPPP
and so on -
s/\d\+/(&)/g
replaces52 apples 31 mangoes
with(52) apples (31) mangoes
(surround digits with parentheses) -
s/\(\w\+\),\(\w\+\)/\2,\1/g
replacesgood,bad 42,24
withbad,good 24,42
(swap words separated by comma) -
s/\(_\)\?_/\1/g
replaces_foo_ __123__ _baz_
withfoo _123_ baz
(matches one or two underscores, deletes one underscore) -
s/\(\d\+\)\%(abc\)\+\(\d\+\)/\2:\1/
replaces12abcabcabc24
with24:12
(matches digits separated by one or moreabc
sequences, swaps the numbers with:
as the separator)- note the use of non-capturing group for
abc
since it isn't needed later -
s/\(\d\+\)\(abc\)\+\(\d\+\)/\3:\1/
does the same if only capturing groups are used
- note the use of non-capturing group for
Referring to text matched by a capture group with a quantifier will give only the last match, not entire match. Use a capture group around the grouping and quantifier together to get the entire matching portion. In such cases, the inner grouping is an ideal candidate to use non-capturing group.
-
s/a \(\d\{3}\)\+/b (\1)/
replacesa 123456789
withb (789)
-
a 4839235
will be replaced withb (923)5
-
-
s/a \(\%(\d\{3}\)\+\)/b (\1)/
replacesa 123456789
withb (123456789)
-
a 4839235
will be replaced withb (483923)5
-
Lookarounds
Lookarounds help to create custom anchors and add conditions within the searchpattern
. These assertions are also known as zero-width patterns because they add restrictions similar to anchors and are not part of the matched portions.
Vim's syntax is different than those usually found in programming languages like Perl, Python and JavaScript. The syntax starting with
\@
is always added as a suffix to the pattern atom used in the assertion. For example,(?!\d)
and(?<=pat.*)
in other languages are specified as\d\@!
and\(pat.*\)\@<=
respectively in Vim.
-
\@!
negative lookahead assertion-
ice\d\@!
matchesice
as long as it is not immediately followed by a digit character, for exampleice
oriced!
oricet5
orice.123
but notice42
orice123
-
s/ice\d\@!/X/g
replacesiceiceice2
withXXice2
-
s/par\(.*\<par\>\)\@!/X/g
replacespar
withX
as long as whole wordpar
is not present later in the line, for exampleparse and par and sparse
is converted toparse and X and sXse
-
at\(\(go\)\@!.\)*par
matchescat,dog,parrot
but notcat,god,parrot
(i.e. matchat
followed bypar
as long asgo
isn't present in between, this is an example of negating a grouping)
-
-
\@<!
negative lookbehind assertion-
_\@<!ice
matchesice
as long as it is not immediately preceded by a_
character, for exampleice
or_(ice)
or42ice
but not_ice
-
\(cat.*\)\@<!dog
matchesdog
as long ascat
is not present earlier in the line, for examplefox,parrot,dog,cat
but notfox,cat,dog,parrot
-
-
\@=
positive lookahead assertion-
ice\d\@=
matchesice
as long as it is immediately followed by a digit character, for exampleice42
orice123
but notice
oriced!
oricet5
orice.123
-
s/ice\d\@=/X/g
replacesice ice_2 ice2 iced
withice ice_2 X2 iced
-
-
\@<=
positive lookbehind assertion-
_\@<=ice
matchesice
as long as it is immediately preceded by a_
character, for example_ice
or(_ice)
but notice
or_(ice)
or42ice
-
You can also specify number of bytes to search for lookbehind patterns. This will significantly speed up the matching process. You have to specify the number between
@
and<
characters. For example,_\@1<=ice
will lookback only one byte beforeice
for matching purposes.\(cat.*\)\@10<!dog
will lookback only ten bytes beforedog
to check the given assertion.
Atomic Grouping
As discussed earlier, both greedy and non-greedy quantifiers will try to satisfy the overall pattern by varying the amount of characters matched by the quantifiers. You can use atomic grouping if you do not want a specific sub-pattern to ever give back characters it has already matched. Similar to lookarounds, you need to use \@>
as a suffix, for example \(pattern\)\@>
.
-
s/\(0*\)\@>\d\{3,\}/(&)/g
replaces only numbers >= 100 irrespective of any number of leading zeros, for example0501 035 154
is converted to(0501) 035 (154)
-
\(0*\)\@>
matches the0
character zero or more times, but it will not give up this portion to satisfy overall pattern -
s/0*\d\{3,\}/(&)/g
replaces0501 035 154
with(0501) (035) (154)
(here035
is matched because0*
will match zero times to satisfy the overall pattern)
-
Some regexp engines provide this feature as possessive quantifiers.
Set start and end of the match
Some of the positive lookbehind and lookahead usage can be replaced with \zs
and \ze
respectively.
-
\zs
set the start of the match (portion before\zs
won't be part of the match)-
s/\<\w\zs\w*\W*//g
replacessea eat car rat eel tea
withsecret
- same as
s/\(\<\w\)\@<=\w*\W*//g
ors/\(\<\w\)\w*\W*/\1/g
-
-
\ze
set the end of the match (portion after\ze
won't be part of the match)-
s/ice\ze\d/X/g
replacesice ice_2 ice2 iced
withice ice_2 X2 iced
- same as
s/ice\d\@=/X/g
ors/ice\(\d\)/X\1/g
-
As per :h \zs and :h \ze, these "Can be used multiple times, the last one encountered in a matching branch is used."
Magic modifiers
These escape sequences change certain aspects of the syntax and behavior of the search pattern that comes after such a modifier. You can use multiple such modifiers as needed for particular sections of the pattern.
Magic and nomagic
-
\m
magic mode (this is the default setting) -
\M
nomagic mode-
.
,*
and~
are no longer metacharacters (compared to magic mode) -
\.
,\*
and\~
will make them to behave as metacharacters -
^
and$
would still behave as metacharacters -
\Ma.b
matches onlya.b
-
\Ma\.b
matchesa.b
as well asa=b
ora<b
oracd
etc
-
Very magic
The default syntax of Vim regexp has only a few metacharacters like .
, *
, ^
and so on. If you are familiar with regexp usage in programming languages such as Perl, Python and JavaScript, you can use \v
to get a similar syntax in Vim. This will allow the use of more metacharacters such as ()
, {}
, +
, ?
and so on without having to prefix them with a \
metacharacter. From :h magic documentation:
Use of
\v
means that after it, all ASCII characters except0
-9
,a
-z
,A
-Z
and_
have special meaning
-
\v<his>
matcheshis
orto-his
but notthis
orhistory
or_hist
-
a<b.*\v<end>
matchesc=a<b
0 Response to "Vim Find Character Continue Past Newline"
Post a Comment