Jason Baldridge is an associate professor of computational linguistics at the University of Texas at Austin. He has been actively involved in open source software development for over 14 years (including founding OpenNLP), and regularly codes in Scala, Java, Python, R and Perl. Jason is a DZone MVB and is not an employee of DZone and has posted 13 posts at DZone. You can read more from them at their website. View Full User Profile

First steps in Scala for beginning programmers, Part 5

12.29.2011
| 3002 views |
  • submit to reddit

This is part 5 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for.

This post is the first of two about regular expressions (regexes), which are essential for a wide range of programming tasks, and for computational linguistics tasks in particular. This tutorial explains how to use them with Scala, assuming that the reader is already familiar with regular expression syntax. It shows how to create regular expressions in Scala and use them with Scala powerful pattern matching capabilities, in particular for variable assignment and cases in match expressions.

Creating regular expressions

Scala provides a very simple way to create regexes: just define a regex as a string and then call the r method on it. The following defines a regular expression that characterizes the string language a^mb^n (one or more a‘s followed by one or more b’s, not necessarily the same as the number of a‘s).

scala> val AmBn = "a+b+".r
AmBn: scala.util.matching.Regex = a+b+

To use meta-characters, like \s, \w, and \d, you must either escape the slashes or use multiquoted strings, which are referred to as raw strings. The following are two equivalent ways to write a regex that covers strings of a sequence of word characters followed by a sequence of digits.

scala> val WordDigit1 = "\\w+\\d+".r
WordDigit1: scala.util.matching.Regex = \w+\d+

scala> val WordDigit2 = """\w+\d+""".r
WordDigit2: scala.util.matching.Regex = \w+\d+

Whether escaping or using raw strings is preferable depends on the context. For example, with the above, I’d go with the raw string. However, for using a regex to split a string on whitespace characters, escaping is somewhat preferable.

scala> val adder = "We're as similar as two dissimilar things in a pod.\n\t-Blackadder"
adder: java.lang.String =
We're as similar as two dissimilar things in a pod.
-Blackadder

scala> adder.split("\\s+")
res2: Array[java.lang.String] = Array(We're, as, similar, as, two, dissimilar, things, in, a, pod., -Blackadder)

scala> adder.split("""\s+""")
res3: Array[java.lang.String] = Array(We're, as, similar, as, two, dissimilar, things, in, a, pod., -Blackadder)

A note on naming: the convention in Scala is to use variable names with the first letter uppercased for Regex objects. This makes them consistent with the use of pattern matching in match statements, as shown below.

Matching with regexes

We saw above that using the r method on a String returns a value that is a Regex object (more on the scala.util.matching part below). How do you actually do useful things with these Regex objects? There are a number of ways. The prettiest, and perhaps most common for the non-computational linguist, is to use them in tandem with Scala’s standard pattern matching capabilities. Let’s consider the task of parsing names and turning them into useful data structures that we can do various useful things with.

scala> val Name = """(Mr|Mrs|Ms)\. ([A-Z][a-z]+) ([A-Z][a-z]+)""".r
Name: scala.util.matching.Regex = (Mr|Mrs|Ms)\. ([A-Z][a-z]+) ([A-Z][a-z]+)

scala> val Name(title, first, last) = "Mr. James Stevens"
title: String = Mr
first: String = James
last: String = Stevens

scala> val Name(title, first, last) = "Ms. Sally Kenton"
title: String = Ms
first: String = Sally
last: String = Kenton

Notice the similarity with pattern matching on types like Array and List.

scala> val Array(title, first, last) = "Mr. James Stevens".split(" ")
title: java.lang.String = Mr.
first: java.lang.String = James
last: java.lang.String = Stevens

scala> val List(title, first, last) = "Mr. James Stevens".split(" ").toList
title: java.lang.String = Mr.
first: java.lang.String = James
last: java.lang.String = Stevens

Of course, notice that here the “.” was captured, while the regex excised it. A more substantive difference with the regular expression is that it only accepts strings with the right form and will reject others, unlike simple splitting and matching to Array.

scala> val Array(title, first, last) = "221B Baker Street".split(" ")
title: java.lang.String = 221B
first: java.lang.String = Baker
last: java.lang.String = Street

scala> val Name(title, first, last) = "221B Baker Street"
scala.MatchError: 221B Baker Street (of class java.lang.String)
at .<init>(<console>:12)
at .<clinit>(<console>)
at .<init>(<console>:11)
at .<clinit>(<console>)
at $export(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:592)
at scala.tools.nsc.interpreter.IMain$Request$$anonfun$10.apply(IMain.scala:828)
at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
at scala.tools.nsc.io.package$$anon$2.run(package.scala:31)
at java.lang.Thread.run(Thread.java:680)

That’s a lot of complaining, of course, but actually you would generally be either (a) absolutely sure that you have strings that are in the correct format or (b) you will be checking for such possible exceptions or (c) you’ll be using the regex as one option of many in a match expression.

For now, let’s assume the input is appropriate. This means we can easily convert a list of names as strings into a list of tuples using map and a match expression.

scala> val names = List("Mr. James Stevens", "Ms. Sally Kenton", "Mrs. Jane Doe", "Mr. John Doe", "Mr. James Smith")
names: List[java.lang.String] = List(Mr. James Stevens, Ms. Sally Kenton, Mrs. Jane Doe, Mr. John Doe, Mr. James Smith)

scala> names.map(x => x match { case Name(title, first, last) => (title, first, last) })
res11: List[(String, String, String)] = List((Mr,James,Stevens), (Ms,Sally,Kenton), (Mrs,Jane,Doe), (Mr,John,Doe), (Mr,James,Smith))

Note the crucial use of groups in the Name regex: the number of groups equal the number of variables being initialized in the match. The first group is needed for the alternatives Mr, Mrs, and Ms. Without the other groups, we get an error. (From here on, I’ll shorten the MatchError output.)

scala> val NameOneGroup = """(Mr|Mrs|Ms)\. [A-Z][a-z]+ [A-Z][a-z]+""".r
NameOneGroup: scala.util.matching.Regex = (Mr|Mrs|Ms)\. [A-Z][a-z]+ [A-Z][a-z]+

scala> val NameOneGroup(title, first, last) = "Mr. James Stevens"
scala.MatchError: Mr. James Stevens (of class java.lang.String)

Of course, we can still match to the first group.

scala> val NameOneGroup(title) = "Mr. James Stevens"
title: String = Mr

What if we go in the other direction, creating more groups so that we can, for example, share the “M” in the various titles? Here’s an attempt.

scala> val NameShareM = """(M(r|rs|s))\. ([A-Z][a-z]+) ([A-Z][a-z]+)""".r
NameShareM: scala.util.matching.Regex = (M(r|rs|s))\. ([A-Z][a-z]+) ([A-Z][a-z]+)

scala> val NameShareM(title, first, last) = "Mr. James Stevens"
scala.MatchError: Mr. James Stevens (of class java.lang.String)

What happened is that a new group was created, so there are now four groups to match.

scala> val NameShareM(title, titleEnding, first, last) = "Mr. James Stevens"
title: String = Mr
titleEnding: String = r
first: String = James
last: String = Stevens

scala> val NameShareM(title, titleEnding, first, last) = "Mrs. Sally Kenton"
title: String = Mrs
titleEnding: String = rs
first: String = Sally
last: String = Kenton

So, there is submatched group capturing. To stop the (r|rs|s) part from creating a match group while still being able to use it to group alternatives in a disjunction, use the ?: operator.

scala> val NameShareMThreeGroups = """(M(?:r|rs|s))\. ([A-Z][a-z]+) ([A-Z][a-z]+)""".r
NameShareMThreeGroups: scala.util.matching.Regex = (M(?:r|rs|s))\. ([A-Z][a-z]+) ([A-Z][a-z]+)

scala> val NameShareMThreeGroups(title, first, last) = "Mr. James Stevens"
title: String = Mr
first: String = James
last: String = Stevens

By this point, sharing the M hasn’t saved anything over (Mr|Mrs|Ms), but there are plenty of situations where this is quite useful.

We can also use regex backreferences. Say we want to match names like “Mr. John Bohn“, “Mr. Joe Doe“, and “Mrs. Jill Hill“.

scala> val RhymeName = """(Mr|Mrs|Ms)\. ([A-Z])([a-z]+) ([A-Z])\3""".r
RhymeName: scala.util.matching.Regex = (Mr|Mrs|Ms)\. ([A-Z])([a-z]+) ([A-Z])\3

scala> val RhymeName(title, firstInitial, firstRest, lastInitial) = "Mr. John Bohn"
title: String = Mr
firstInitial: String = J
firstRest: String = ohn
lastInitial: String = B

Then we could piece things together to get the names we wanted.

scala> val first = firstInitial+firstRest
first: java.lang.String = John

scala> val last = lastInitial+firstRest
last: java.lang.String = Bohn

But we can do better by using an embedded group and just thowing its match result away with the underscore _.

scala> val RhymeName2 = """(Mr|Mrs|Ms)\. ([A-Z]([a-z]+)) ([A-Z]\3)""".r
RhymeName2: scala.util.matching.Regex = (Mr|Mrs|Ms)\. ([A-Z]([a-z]+)) ([A-Z]\3)

scala> val RhymeName2(title, first, _, last) = "Mr. John Bohn"
title: String = Mr
first: String = John
last: String = Bohn

Note: we can’t use the ?: operator with ([a-z]+) to stop the match because we need exactly that string to match with the \3 later.

Using regexes for assignment via pattern matching requires full string match.

scala> val Name(title, first, last) = "Mr. James Stevens"
title: String = Mr
first: String = James
last: String = Stevens

scala> val Name(title, first, last) = "Mr. James Stevens walked to the door."
scala.MatchError: Mr. James Stevens walked to the door. (of class java.lang.String)

This is a crucial aspect of using them in match expressions. Consider an application that needs to be able to parse telephone numbers in different formats, like (123)555-5555 and 123-555-5555. Here are regexes for these two patterns and their use to parse these numbers.

scala> val Phone1 = """\((\d{3})\)\s*(\d{3})-(\d{4})""".r
Phone1: scala.util.matching.Regex = \((\d{3})\)\s*(\d{3})-(\d{4})

scala> val Phone2 = """(\d{3})-(\d{3})-(\d{4})""".r
Phone2: scala.util.matching.Regex = (\d{3})-(\d{3})-(\d{4})

scala> val Phone1(area, first3, last4) = "(123) 555-5555"
area: String = 123
first3: String = 555
last4: String = 5555

scala> val Phone2(area, first3, last4) = "123-555-5555"
area: String = 123
first3: String = 555
last4: String = 5555

We could of course use a single regular expression, but we’ll go with these two so that they can be used as separate case statements in a match expression that is part of a function that takes a string representation of a phone number and returns a tuple of three strings (thus normalizing the numbers).

def normalizePhoneNumber (number: String) = number match {
  case Phone1(x,y,z) => (x,y,z)
  case Phone2(x,y,z) => (x,y,z)
}

The action being taken for each match is just to package the separate values up in a Tuple3 — more interesting things could be done if one were looking for country codes, dealing with multiple countries, etc. The point here is to see how the regular expressions are used for the cases to capture values and assign them to local variables, each time appropriate for the form of the string that is brought in. (We’ll see in a later tutorial how to protect such a method from inputs that are not phone numbers and such.)

Now that we have that function, we can easily apply it to a list of strings representing phone numbers and filter out just those in a specific area, for example.

scala> val numbers = List("(123) 555-5555", "123-555-5555", "(321) 555-0000")
numbers: List[java.lang.String] = List((123) 555-5555, 123-555-5555, (321) 555-0000)

scala> numbers.map(normalizePhoneNumber)
res16: List[(String, String, String)] = List((123,555,5555), (123,555,5555), (321,555,0000))

scala> numbers.map(normalizePhoneNumber).filter(n => n._1=="123")
res17: List[(String, String, String)] = List((123,555,5555), (123,555,5555))

Building Regexes from Strings

Sometimes one wants to build up a regex from smaller component parts, for example, defining what a noun phrase is and then searching for sequence of noun phrases. To do this, we first must see the longer form of creating a regex.

scala> val AmBn = new scala.util.matching.Regex("a+b+")
AmBn: scala.util.matching.Regex = a+b+

This is the first time in these tutorials that we are explicitly creating an object using the reserved word new. We’ll be covering objects in more detail later, but what you need to know now is that Scala has a great deal of functionality that is not available by default. Mostly, we’ve been working with things like Strings, Ints, Doubles, Lists, and so on — and for the most part it has appeared to you as though they are “just” Strings, Ints, Doubles, and Lists. However, that is not the case: actually they are fully specified as:

  • java.lang.String
  • scala.Int
  • scala.Double
  • scala.List

And, in the case of the last one, scala.List is a type that is actually backed by a concrete implementation in scala.collection.immutable.List. So, when you just see “List”, Scala is actually hiding some detail; most importantly, it makes it possible to use extremely common types with very little fuss.

What scala.util.matching.Regex is telling you is that the Regex class is part of the scala.util.matching package (and that scala.util.matching is a subpackage of scala.util, which itself is a subpackage of the scala package). Fortunately, you don’t need to type out scala.util.matching every time you want to use Regex: just use an import statement, and then use Regex without the extra package specification.

scala> import scala.util.matching.Regex
import scala.util.matching.Regex

scala> val AmBn = new Regex("a+b+")
AmBn: scala.util.matching.Regex = a+b+

The other thing to explain is the new part. Again, we’ll cover this in more detail later, but for now think about it the following way. The Regex class is like a factory for producing regex objects, and the way you request (order) one of those objects is to say “new Regex(…)“, where the indicates the string that should be used to define the properties of that object. You’ve actually been doing this quite a lot already when creating Lists, Ints, and Doubles, but again, for those core types, Scala has provided special syntax to simplify their creation and use.

Okay, but why would one want to use new Regex(“a+b+”) when “a+b+”.r can be used to do the same? Here’s why: the latter needs to be given a complete string, but the former can be built up from several String variables. As an example, say you want a regex that matches strings of the form “the/a dog/cat/mouse/bird chased/ate the/a dog/cat/mouse/bird” such as “the dog chased the cat” and “a cat chased the bird.” The following might be the first attempt.

scala> val Transitive = "(a|the) (dog|cat|mouse|bird) (chased|ate) (a|the) (dog|cat|mouse|bird)".r
Transitive: scala.util.matching.Regex = (a|the) (dog|cat|mouse|bird) (chased|ate) (a|the) (dog|cat|mouse|bird)

This works, but we can also build it without repeating the same expression twice by using a variable that contains a String defining a regular expression (but which is not a Regex object itself) and building the regex with that.

scala> val nounPhrase = "(a|the) (dog|cat|mouse|bird)"
nounPhrase: java.lang.String = (a|the) (dog|cat|mouse|bird)

scala> val Transitive = new Regex(nounPhrase + " (chased|ate) " + nounPhrase)
Transitive: scala.util.matching.Regex = (a|the) (dog|cat|mouse|bird) (chased|ate) (a|the) (dog|cat|mouse|bird)

UPDATE: Actually, you can do this with .r rather than new Regex(…).

scala> val Transitive = (nounPhrase + " (chased|ate) " + nounPhrase).r
Transitive: scala.util.matching.Regex = (a|the) (dog|cat|mouse|bird) (chased|ate) (a|the) (dog|cat|mouse|bird)

The next tutorial will show how to use the scala.util.matching package API to do more extensive matching with regular expressions, such as finding multiple matches and performing substitutions.

 

From http://bcomposes.wordpress.com/2011/09/04/first-steps-in-scala-for-beginning-programmers-part-5/

Published at DZone with permission of Jason Baldridge, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Afandi Merathi replied on Sun, 2012/03/18 - 7:53am

I don’t understand the reasoning behind the last example: You can use String variables just fine without needing to directly creating a Regex object.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.