We briefly mentioned Regular Expressions during previous lessons on strings. At the core, if you recall, a "Regex" is just a fancy matching function. It allows you to specify a highly customizable match type and then it returns everything which matches the criteria. They are also used across many different programming languages and other implementations so it's a good tool to have basic familiarity with.
Regular Expressions are quite useful when you're pulling giant strings of data in from files, which is why we'll take another look at them here. They can be particularly useful to help scrub and process that data.
In this lesson, we'll look at the structure of a Regex so you will feel comfortable with the basics. You typically only need to build one regex for a single use case so it's fine to just remember the high level stuff and then custom build any regex's you need using a tool like Rubular.
Don't feel like you need to memorize all the different ways to put together a Regex. Just get the basics in your head and know where to find more information.
A regex uses a special syntax to build the search pattern for your target string. It has special characters that denote different types of matches. The expression is enclosed within slashes
/ and, in Ruby, can be used with the tilde
~ to kick off a match:
# Returns the number character that begins the first match /niblets/ =~ "I like chicken niblets because niblets are awesome" #=> 15
There are all kinds of different functions which take regular expressions and do various things with them. A common (and basic) one is
match, which returns your match wrapped in an instance of the highly versatile MatchData class.
Think of that MatchData instance as a boring, old search result with superpowers added on -- you can ask which character was matched, what the previous or next words were, and more that you'll see in a sec:
# Grab a MatchData object using `match` regex = /niblets/ string = "I like chicken niblets because niblets are awesome" our_match = string.match(regex) #=> #<MatchData "niblets"> # Play with it a bit using some MatchData methods > our_match.pre_match #=> "I like chicken " > our_match.post_match #=> " because niblets are awesome" > our_match.to_s #=> "niblets"
Regexes generally match based on the position and/or type of character. They are extremely flexible and you can build one for just about any situation you could possibly imagine. The basics are covered by Rubular in their reference section:
Matching gets more powerful, though, because you can actually tell the regex to start at the matched character and then grab all the characters matching additional criteria denoted by parentheses
These "capture groups" are available inside the MatchData object under the
# Capture all characters to the left and right # of the match (line breaks added for clarity) > string = "I like chicken niblets because niblets are awesome" > regex = /(.*)niblets(.*)/ > our_match = string.match(regex) #=> #<MatchData "I like chicken niblets because niblets are awesome" 1:"I like chicken niblets because " 2:" are awesome"> # View the captures # Note that only the last match was processed # because `match` only processes the last match it finds. > our_match.captures #=> ["I like chicken niblets because ", " are awesome"] # Or, if you want to, just treat the `MatchData` # like an array where index 0 is the first match # and following index is the first capture > our_match #=> "I like chicken niblets because "
#) and wraps them in a
<span> with some special formatting. You'd want to look for every instance of your special chain of characters and then return everything from that match to the end of the line to be wrapped in an element. This is done by "capturing" the match.
In the JS library that we're currently using on this site for syntax highlighting, Ruby comments are detected using a regex that looks like:
# A comment is: # 1. Everything between a `#` with a space after it # and the end of the line # 2. Everything between a newline starting with `#` # and the end of the line /(# [^\r\n]*(\r?\n|$)|(\r|\n|^)#[^\r\n]*(\r?\n|$))/
Don't go crazy trying to parse that, but it's an example where it captures everything between the match (of the
# in the right situations) and the end of the line. It's usually best to work from known examples in these cases and modify as necessary.
match returns the last instance of a match,
scan will return all of them in array form:
# Return all matches for the given expression # in array form (so not a `MatchData` object) > regex = /niblets/ #=> /niblets/ > s = string.scan(regex) #=> ["niblets", "niblets"] # Scan with capture groups still only grabs them # on the last matched instance > regex = /(.*)niblets(.*)/ #=> /(.*)niblets(.*)/ > s = string.scan(regex) #=> [["I like chicken niblets because ", " are awesome"]]
Another common use of a Regex is a simple find/replace scenario, which uses the
> "Dude, where's my car?".gsub(/my/,"your") #=> "Dude, where's your car?"
This concept comes up frequently in other areas (e.g. algorithms). Should a particular Regex quantifier (e.g.
*) return the longest match of the thing it's quantifying or the shortest?
Greedy* is the default behavior and the matcher returns as many instances of its quantified token as possible. So, basically, the longest match. Lazy is the opposite.
The important bits of code from this lesson
# Return MatchData object for last match "string".match(/.*/) /.*/ =~ "string" # Capturing data from or around a match /(.*)testing(.*)/ =~ "need more testing in here" # Return an array of matches "string".scan(/foo/) # Substitute the matches for something else "foostring".gsub(/foo/,"bar")
Regexes are something that you only get familiar with by practicing a lot. Once you've seen their power, it's tempting to use them everywhere (it'll fade). For our purposes, we'd just like to get you comfortable enough that you can hack your way through them as necessary in the future. Don't feel like you need to memorize them! That's a waste of valuable brain space right now.