Regular Expressions

A brief introduction to this quirky but enormously useful syntax for searching strings.

Scroll down...

Content

Resources

Comments

We briefly mentioned Regular Expressions during previous lessons on strings. At the core, if you recall, a "Regex" is just a fancy matching function. It allows you to specify a highly customizable match type and then it returns everything which matches the criteria. They are also used across many different programming languages and other implementations so it's a good tool to have basic familiarity with.

Regular Expressions are quite useful when you're pulling giant strings of data in from files, which is why we'll take another look at them here. They can be particularly useful to help scrub and process that data.

In this lesson, we'll look at the structure of a Regex so you will feel comfortable with the basics. You typically only need to build one regex for a single use case so it's fine to just remember the high level stuff and then custom build any regex's you need using a tool like Rubular.

Don't feel like you need to memorize all the different ways to put together a Regex. Just get the basics in your head and know where to find more information.

The Basics

A regex uses a special syntax to build the search pattern for your target string. It has special characters that denote different types of matches. The expression is enclosed within slashes / and, in Ruby, can be used with the tilde ~ to kick off a match:

# Returns the number character that begins the first match
/niblets/ =~ "I like chicken niblets because niblets are awesome"
#=> 15

There are all kinds of different functions which take regular expressions and do various things with them. A common (and basic) one is match, which returns your match wrapped in an instance of the highly versatile MatchData class.

Think of that MatchData instance as a boring, old search result with superpowers added on -- you can ask which character was matched, what the previous or next words were, and more that you'll see in a sec:

# Grab a MatchData object using `match`
regex = /niblets/ 
string = "I like chicken niblets because niblets are awesome"
our_match = string.match(regex)
#=> #<MatchData "niblets">

# Play with it a bit using some MatchData methods
> our_match.pre_match
#=> "I like chicken " 
> our_match.post_match
#=> " because niblets are awesome" 
> our_match.to_s
#=> "niblets" 

Regex Syntax

Regexes generally match based on the position and/or type of character. They are extremely flexible and you can build one for just about any situation you could possibly imagine. The basics are covered by Rubular in their reference section:

regex regular expression reference from rubular

  • [abc] A single character of: a, b, or c
  • [^abc] Any single character except: a, b, or c
  • [a-z] Any single character in the range a-z
  • [a-zA-Z] Any single character in the range a-z or A-Z
  • ^ Start of line
  • $ End of line
  • \A Start of string
  • \z End of string
  • . Any single character
  • \s Any whitespace character
  • \S Any non-whitespace character
  • \d Any digit
  • \D Any non-digit
  • \w Any word character (letter, number, underscore)
  • \W Any non-word character
  • \b Any word boundary
  • (...) Capture everything enclosed
  • (a|b) a or b
  • a? Zero or one of a
  • a\* Zero or more of a
  • a+ One or more of a
  • a{3} Exactly 3 of a
  • a{3,} 3 or more of a
  • a{3,6} Between 3 and 6 of a

Returning "Capture Groups"

Matching gets more powerful, though, because you can actually tell the regex to start at the matched character and then grab all the characters matching additional criteria denoted by parentheses ().

These "capture groups" are available inside the MatchData object under the captures array:

# Capture all characters to the left and right 
# of the match (line breaks added for clarity)
> string = "I like chicken niblets because niblets are awesome"
> regex = /(.*)niblets(.*)/
> our_match = string.match(regex)
#=> #<MatchData "I like chicken niblets because 
    niblets are awesome" 1:"I like chicken niblets 
    because " 2:" are awesome"> 

# View the captures
# Note that only the last match was processed
#   because `match` only processes the last match it finds.
> our_match.captures
#=> ["I like chicken niblets because ", " are awesome"] 

# Or, if you want to, just treat the `MatchData`
# like an array where index 0 is the first match
# and following index is the first capture
> our_match[1]
#=> "I like chicken niblets because "

Why is this useful? Let's say you're building a syntax highlighter in JavaScript (remember, regexes are in other languages too) which detects a certain sequence of characters (e.g. after a #) and wraps them in a <span> with some special formatting. You'd want to look for every instance of your special chain of characters and then return everything from that match to the end of the line to be wrapped in an element. This is done by "capturing" the match.

In the JS library that we're currently using on this site for syntax highlighting, Ruby comments are detected using a regex that looks like:

# A comment is:
# 1.  Everything between a `#` with a space after it
#     and the end of the line
# 2.  Everything between a newline starting with `#`
#     and the end of the line
/(# [^\r\n]*(\r?\n|$)|(\r|\n|^)#[^\r\n]*(\r?\n|$))/

Don't go crazy trying to parse that, but it's an example where it captures everything between the match (of the # in the right situations) and the end of the line. It's usually best to work from known examples in these cases and modify as necessary.

Scanning for Multiple Matches

Where match returns the last instance of a match, scan will return all of them in array form:

# Return all matches for the given expression
# in array form (so not a `MatchData` object)
> regex = /niblets/
#=> /niblets/ 
> s = string.scan(regex)
#=> ["niblets", "niblets"] 

# Scan with capture groups still only grabs them
#   on the last matched instance
> regex = /(.*)niblets(.*)/
#=> /(.*)niblets(.*)/ 
> s = string.scan(regex)
#=> [["I like chicken niblets because ", " are awesome"]] 

Gsub to Replace Instances

Another common use of a Regex is a simple find/replace scenario, which uses the gsub method:

> "Dude, where's my car?".gsub(/my/,"your")
#=> "Dude, where's your car?" 

Eager vs Lazy Matching

This concept comes up frequently in other areas (e.g. algorithms). Should a particular Regex quantifier (e.g. + or *) return the longest match of the thing it's quantifying or the shortest?

Greedy* is the default behavior and the matcher returns as many instances of its quantified token as possible. So, basically, the longest match. Lazy is the opposite.

Read more about regex quantifiers here, specifically the "Greedy Trap" section. You can also learn more from regular-expressions.info under "Watch Out for The Greediness!"

Code Review

The important bits of code from this lesson

# Return MatchData object for last match
"string".match(/.*/)
/.*/ =~ "string"

# Capturing data from or around a match
/(.*)testing(.*)/ =~ "need more testing in here"

# Return an array of matches
"string".scan(/foo/)

# Substitute the matches for something else
"foostring".gsub(/foo/,"bar")

Wrapping Up

Regexes are something that you only get familiar with by practicing a lot. Once you've seen their power, it's tempting to use them everywhere (it'll fade). For our purposes, we'd just like to get you comfortable enough that you can hack your way through them as necessary in the future. Don't feel like you need to memorize them! That's a waste of valuable brain space right now.



Sign up to track your progress for free

There are ( ) additional resources for this lesson. Check them out!

Sorry, comments aren't active just yet!

Next Lesson: Test Yourself: File Operations