Word Break in Clojure and Ruby, and Laziness in Ruby

TL;DR

I implemented "word break" in Clojure and Ruby and yammered on about how laziness can lead to better code with higher reuse. Lazy lists in Ruby are easy to do but the laziness performance tax may be too high to bear (with the implementation used in this post). That said, it is still a valid tool when used appropriately.

A gist of all the code found in this post can be found here.

Word Break

Daniel Tunkelang recently posted one of his favorite interview questions called "word break". From his blog, this is the description of the problem:

Given an input string and a dictionary of words,
segment the input string into a space-separated
sequence of dictionary words if possible. For
example, if the input string is "applepie" and
dictionary contains a standard set of English words,
then we would return the string "apple pie" as output.

Clojure version

I was curious how a Clojure version would compare to the posted Java version so I implemented the following recursive backtracking solution:

(ns word-break.core
  (:use midje.sweet
        [clojure.contrib.core :only [-?> -?>>]]))

(defn word-break [input dict]
  (if (contains? dict input)
    input
    (let [length (count input)
          candidates (map #(list (subs input 0 %) %) (range length))
          prefixes (filter (fn [[c _]] (contains? dict c)) candidates)]
      (first (for [[prefix i] prefixes
                   :let [suffix (-?> (subs input i length) (word-break dict))]
                   :when suffix]
               (str prefix " " suffix))))))

;; tests
(facts
  (word-break "applepie" #{"apple" "pie"}) => "apple pie"
  (word-break "anappleaday" #{"an" "apple" "a" "day"}) => "an apple a day"
  (word-break "aaaab" #{"a" "aa" "aaa" "aaaa"}) => nil)

I could have gone golfing with this but I decided to leave the intermediate bindings for clarity. The noteworthy difference with this implementation and the original Java version is the use of lazy lists (sequences in Clojure). Rather than looping and explicitly terminating with a return statement this version is able to specify all valid breaks (lazily) and simply return the first one. Since the entire list is never fully realized it is equivalent in terms of runtime complexity, O(2n), with the provided Java one.

For completeness here is a memoized version:

;; inspired by http://stackoverflow.com/questions/3906831/how-do-i-generate-memoized-recursive-functions-in-clojure
(defmacro memoize-fn
  "Produces a memoized anonymous function that can recursively call itself."
  [fn-name & fn-args]
  `(with-local-vars
      [~fn-name (memoize
                (fn ~@fn-args))]
     (.bindRoot ~fn-name @~fn-name)
     @~fn-name))

(defn word-break-with-memoization [input dict]
  (let [word-break
        (memoize-fn word-break [input]
                    (if (contains? dict input)
                      input
                      (let [length (count input)
                            candidates (map #(list (subs input 0 %) %) (range length))
                            prefixes (filter (fn [[c _]] (contains? dict c)) candidates)]
                        (first (for [[prefix i] prefixes
                                     :let [suffix (-?> (subs input i length) word-break)]
                                     :when suffix]
                                 (str prefix " " suffix))))))]
    (word-break input)))

;; tests are the same as above

As an interesting side note, the memoization could also have been implemented using the Y Combinator. Matt Might has a great article on this exact use case.

Ruby version

I next implemented the algorithm in Ruby (1.9.2-p290) to see how it would compare:

module WordBreak

  class << self
    def segmentize(string, dictionary)
      return string if dictionary.include?(string)
      string.size.times do |i|
        prefix = string[0..i]
        next unless dictionary.include?(prefix)
        remaining_segment = segmentize(string[(i+1)..-1], dictionary)
        return [prefix, remaining_segment].join(" ") if remaining_segment
      end
      nil
    end

  end
end

I would consider the above to be idiomatic Ruby and I don't find it hard to follow. That said, I think it does seem more complex that the Clojure version. This imperative version has more conditionals, early termination with an explicit return, and is less modular. By "less modular" I mean it doesn't take advantage of the core Enumerable methods like #map and #select (filter in other languages). For example, it uses the next keyword with a conditional to reimplment what #select already does.

In short, I am in complete agreement with John Hughes that laziness lends itself to more modular and clear code. If you haven't already, I highly suggest reading his paper on the subject, "Why Functional Programming Matters"[PDF]

Again, for completeness here is the memoized version:

module WordBreak
  class << self
    def memoize(fn)
      cache = {}
      -> *args { cache.include?(args) ? cache[args] : cache[args] = fn.call(*args) }
    end

    def segmentize_with_memoization(string, dictionary)
      segmentize = memoize( -> string {
        return string if dictionary.include?(string)
        string.size.times do |i|
          prefix = string[0..i]
          next unless dictionary.include?(prefix)
          remaining_segment = segmentize.call(string[(i+1)..-1])
          return [prefix, remaining_segment].join(" ") if remaining_segment
        end
        nil
      })
      segmentize.call(string)
    end
  end
end

Being Lazy in Ruby

There are a number of ways one could implement lazy lists in Ruby. (There are also quite a few libraries... Hamster is one of the more interesting ones as it also implements a number of immutable persistent data structures.) One of the easier ways to implement them however is to use the built in Enumerator class (Enumerable::Enumerator in ruby 1.8x). For example, one could implement an iterate method, as is found in many functional languages, to generate infinite lists:

class Enumerator
  def self.iterate(initial_value, &iterate_fn)
    Enumerator.new do |yielder|
      yielder.yield initial_value
      current_value = initial_value
      loop do
        current_value = iterate_fn.call(current_value)
        yielder.yield current_value
      end
    end
  end
end

doubles = Enumerator.iterate(1) { |v| v * 2}
doubles.take(4) # => [1, 2, 4, 8]

Enumerator is a great starting place but you still need a way to call all your regular Enumerable methods that you would want to (e.g. #map, #take). Adding these methods is easy as well, but we don't even need to do that since the fantastic facets library has already done this for us with the Enumerable#defer method and Denumerable class.

Using facet's #defer, here is a lazy version of "word break" that follows the same strategy as the Clojure version:

require 'set'
require 'rubygems'
require 'facets/enumerator'
require 'facets/enumerable/defer'

# This monkey patch allows for destructuring of the generated/yielded items...
# TODO: open a pull request with Facets with this change...
class Denumerator
  def to_ary
    self.next.to_ary
  end
end

module WordBreak
  class << self
    def lazy_segmentize(string, dictionary)
      return string if dictionary.include?(string)
      (0...(string.size)).defer.  # defer starts the laziness
        map { |i| [string[0..i], i]}.
        select { |(prefix,i)| dictionary.include?(prefix) }.
        map{ |(prefix,i)|
          remaining_segment = lazy_segmentize(string[(i+1)..-1], dictionary)
          [prefix, remaining_segment].join(" ") if remaining_segment }.
        reject {|r| r.nil? }.
        first
    end
  end
end

As with the Clojure version, this lazy Ruby version builds on top of all the common functions/methods that you normally use on collections. The #times loop has been replaced with #map, the next keyword with #select, and the early return statement with #first. While most Rubyists would say the above is non-idiomatic they would all understand each individual chained component since they use them everyday.

The memoized version, no surprises:

    def lazy_segmentize_with_memoization(string, dictionary)
      segmentize = memoize(
        -> string {
        return string if dictionary.include?(string)
        (0...(string.size)).defer.
        map { |i| [string[0..i], i]}.
        select { |(prefix,i)| dictionary.include?(prefix) }.
        map{ |(prefix,i)|
          remaining_segment = segmentize.call(string[(i+1)..-1])
          [prefix, remaining_segment].join(" ") if remaining_segment }.
        reject {|r| r.nil? }.
        first})
      segmentize.call(string)
    end

The cost of being lazy in Ruby

JRUBY UPDATE With the help of Jesse Cooke and serendiptiy we discovered an issue with JRuby in 1.9 mode. In 1.9 mode the lazy version performs much slower than when in 1.8 mode. I plan on rerunning the benchmarks in 1.8. For now, please disregard all but the last (lazy_benchmark.rb) JRuby benchmarks. We let the JRuby people know but I haven't heard back whether or not this is a known issue.

ALTERNATE LIBRARY UPDATE Since publishing this post I've been directed to a new library called Enumerating. The author claims it is more efficient than Denumerable and appears to have some benchmarks to prove it. I may do a follow-up post with this library to compare times.

Without running any benchmarks it is obvious that the imperative version of this algorithm will be faster, as it is doing less work. For each lazy operation in the chain a new Denumerable is created along with it's associated block that needs to be called when a new item is demanded. Additionally, while the Enumerable core methods are written in C (or Java for JRuby) the Denumerable counter-part methods are written in Ruby. The call stack for the lazy version also grows larger since each intermediate lazy operation is a call to a block. In short, being lazy in Ruby is not a performance optimization but rather a code modularity optimization. Depending on the use case a lazy version will have significant advantages over a more imperative version, while in other cases the requirements of the problem will demand the most efficient solution. The advantages of the lazy approach are generally compounded as you build up a set of highly reusable functions/methods. (Again, see John Hughes' paper[PDF]).

I was still morbidly curious what the performance differences were so I ran some benchmarks. Here on some benchmarks on some random input:

require 'benchmark'
require 'rubygems'
require 'facets/random'

dictionary = File.open("/usr/share/dict/words") do |f|
  f.enum_for(:each_line).defer.map{|l| l.strip}.take(300).to_a # see the laziness?
end

def dictionary.random_words(num_words)
  Random.srand(1) # make it deterministic
  self.rand_subset(num_words, false).join("")
end


warmup = 2 # The additional warmups are for HotSpot's benefit
(warmup + 1).times do
  Benchmark.bmbm do |x|
    [10, 50, 100, 200].each do |num_words|
      input = dictionary.random_words(num_words)
      x.report("non-lazy w/memo; num_words = #{num_words}, n = #{input.size}") { WordBreak.segmentize_with_memoization(input, dictionary)}
      x.report("lazy w/memo; num_words = #{num_words}, n = #{input.size}") { WordBreak.lazy_segmentize_with_memoization(input, dictionary)}
    end
  end
end
$ rvm use ruby-1.9.2-p290 && ruby word_break.rb
...
                                              user     system total        real
non-lazy w/memo; num_words = 10, n = 87     0.010000   0.000000 0.010000 (0.014502)
lazy w/memo; num_words = 10, n = 87         0.020000   0.000000 0.020000 (0.023510)
non-lazy w/memo; num_words = 50, n = 428    0.350000   0.000000 0.350000 (0.350354)
lazy w/memo; num_words = 50, n = 428        0.510000   0.000000 0.510000 (0.509684)
non-lazy w/memo; num_words = 100, n = 846   1.400000   0.010000 1.410000 (1.409881)
lazy w/memo; num_words = 100, n = 846       1.730000   0.010000 1.740000 (1.738787)
non-lazy w/memo; num_words = 200, n = 1657  5.380000   0.050000 5.430000 (5.427849)
lazy w/memo; num_words = 200, n = 1657      5.770000   0.040000 5.810000 (5.807468)

The same with JRuby (granted, not ideal JVM benchmarking):

$ rvm use jruby-1.6.3 && ruby --server --1.9 word_break.rb
...
                                                 user     system total        real
non-lazy w/memo; num_words = 10, n = 93      0.005000   0.000000 0.005000 (  0.005000)
lazy w/memo; num_words = 10, n = 93          0.041000   0.000000 0.041000 (  0.042000)
non-lazy w/memo; num_words = 50, n = 395     0.120000   0.000000 0.120000 (  0.120000)
lazy w/memo; num_words = 50, n = 395         0.588000   0.000000 0.588000 (  0.588000)
non-lazy w/memo; num_words = 100, n = 785    0.492000   0.000000 0.492000 (  0.492000)
lazy w/memo; num_words = 100, n = 785        2.069000   0.000000 2.069000 (  2.069000)
non-lazy w/memo; num_words = 200, n = 1605   1.976000   0.000000 1.976000 (  1.976000)
lazy w/memo; num_words = 200, n = 1605       8.715000   0.000000 8.715000 (  8.715000)
                                              user     system total        real

Since this input was random I don't want to read too much in to these numbers. Based on how it was created though my guess is that not a lot of backtracking was needed (if any) so they are close to ideal inputs for the algorithm. It is interesting to note that JRuby is quite a bit faster that MRI in the non-lazy version but is much slower in the lazy version. Since the lazy version requires a lot of additional block invocations my guess is that it may be the slow spot. I didn't profile to verify, but I'd be curious if someone does or if a JRuby expert could shed some light on the discrepancy. (See the update above.)

Next I created some benchmarks with worst-case inputs, meaning the entire tree would need to be searched (w/memoization however):

require 'benchmark'

def worst_case(n)
  ["a" * n + "b",
   Enumerator.iterate("a"){|i| i + "a"}.take(n-1).to_a] # see what I did here? ;)
end

warmup = 2 # The additional warmups are for HotSpot's benefit
(warmup + 1).times do
  Benchmark.bmbm do |x|
    [10, 50, 100, 200, 300, 400].each do |n|
      input, dictionary = worst_case(n)
      x.report("non-lazy w/memo; n = #{n}") { WordBreak.segmentize_with_memoization(input, dictionary)}
      x.report("lazy w/memo; n = #{n}") { WordBreak.lazy_segmentize_with_memoization(input, dictionary)}
    end
  end
end

The results:

$ rvm use ruby-1.9.2-p290 && ruby word_break.rb
                            user     system      total        real
non-lazy w/memo; n = 10   0.000000   0.000000   0.000000 (0.000345)
lazy w/memo; n = 10       0.000000   0.000000   0.000000 (0.000827)
non-lazy w/memo; n = 50   0.010000   0.000000   0.010000 (0.006957)
lazy w/memo; n = 50       0.020000   0.000000   0.020000 (0.025863)
non-lazy w/memo; n = 100  0.040000   0.000000   0.040000 (0.041895)
lazy w/memo; n = 100      0.100000   0.010000   0.110000 (0.094248)
non-lazy w/memo; n = 200  0.200000   0.000000   0.200000 (0.204303)
lazy w/memo; n = 200      0.260000   0.000000   0.260000 (0.266870)

The difference between the two seem to go down as n increases which I believe is a result of the memoization being used more. Unfortunately, with n > 219 the call stack blew up on the lazy version. JRuby numbers (I was able to take n higher with JRuby):

$ rvm use jruby-1.6.3 && jruby --server --1.9 word_break.rb
...
                               user     system      total real
non-lazy w/memo; n = 10    0.001000   0.000000   0.001000 ( 0.001000)
lazy w/memo; n = 10        0.014000   0.000000   0.014000 ( 0.014000)
non-lazy w/memo; n = 50    0.005000   0.000000   0.005000 ( 0.005000)
lazy w/memo; n = 50        0.109000   0.000000   0.109000 ( 0.109000)
non-lazy w/memo; n = 100   0.018000   0.000000   0.018000 ( 0.019000)
lazy w/memo; n = 100       0.326000   0.000000   0.326000 ( 0.326000)
non-lazy w/memo; n = 200   0.113000   0.000000   0.113000 ( 0.113000)
lazy w/memo; n = 200       1.350000   0.000000   1.350000 ( 1.349000)
non-lazy w/memo; n = 300   0.245000   0.000000   0.245000 ( 0.245000)
lazy w/memo; n = 300       4.459000   0.000000   4.459000 ( 4.459000)
non-lazy w/memo; n = 400   0.524000   0.000000   0.524000 ( 0.524000)
lazy w/memo; n = 400       9.576000   0.000000   9.576000 ( 9.576000)

With the worst-case inputs JRuby does worse in both versions and really bad with the lazy version. My best explanation is that JRuby doesn't perform as well with recursion and many block invocations. Again, I didn't take the time to dig deeper and would welcome a more informed explanation. (Again, see the update above.) (For the curious, the Clojure version was able to perform n=200 in ~0.033secs and n=400 in ~0.14secs.)

The benchmarks above are interesting but they are trying to test too many things. I was curious what the cost of the lazy operations were in comparison to their Enumerable counterparts. The world doesn't have enough flawed microbenchmarks so I created this one:

require 'benchmark'
require 'rubygems'

require 'facets/enumerator'
require 'facets/enumerable/defer'

warmup = 2 # The additional warmups are for HotSpot's benefit
(warmup + 1).times do
  Benchmark.bmbm do |x|
    x.report("non-lazy one op")  { (1..100000).map{|i| i + 1} }
    x.report("lazy one op") { (1..100000).defer.map{|i| i + 1}.to_a }
    x.report("non-lazy two ops")  { (1..100000).map{|i| i + 1}.map{|i| i + 1} }
    x.report("lazy two ops") { (1..100000).defer.map{|i| i + 1}.map{|i| i + 1}.to_a }
    x.report("non-lazy three ops")  { (1..100000).map{|i| i + 1}.map{|i| i + 1}.map{|i| i + 1} }
    x.report("lazy three ops") { (1..100000).defer.map{|i| i + 1}.map{|i| i + 1}.map{|i| i + 1}.to_a }
  end
end

Again, keep in mind that Enumerable's methods are in C/Java while the Denumerable's methods are in Ruby. I ran the benchmark in 1.8.7, 1.9.2, and JRuby:

$ rvm use ruby-1.8.7-p248 && ruby lazy_benchmark.rb
                      user     system      total        real
non-lazy one op     0.030000   0.010000   0.040000 (  0.030534)
lazy one op         0.790000   0.000000   0.790000 (  0.793148)
non-lazy two ops    0.090000   0.000000   0.090000 (  0.089110)
lazy two ops        1.080000   0.000000   1.080000 (  1.087114)
non-lazy three ops  0.110000   0.010000   0.120000 (  0.109865)
lazy three ops      1.440000   0.000000   1.440000 (  1.441663)


$ rvm use ruby-1.9.2-p290 && ruby lazy_benchmark.rb
                      user     system      total        real
non-lazy one op     0.010000   0.000000   0.010000 (  0.011690)
lazy one op         0.100000   0.010000   0.110000 (  0.103982)
non-lazy two ops    0.020000   0.000000   0.020000 (  0.021354)
lazy two ops        0.160000   0.000000   0.160000 (  0.159124)
non-lazy three ops  0.030000   0.000000   0.030000 (  0.033827)
lazy three ops      0.240000   0.000000   0.240000 (  0.240330)

Looking at these comparisons of 1.8.7 and 1.9.2 reminded me of this tweet from a coworker:

<span class="tweet_avatar"><a href="http://twitter.com/bmidgley/status/101702746717175808" target="_blank"><img alt="Mug_normal" height="48" src="http://a0.twimg.com/profile_images/202032736/mug_normal.jpg" width="48"></a></span>
ruby 1.9.2 would be dead last in the performance tests if it weren't for ruby 1.8.7! ruby... making ruby look good
<span class="tweet_author">— <a href="http://twitter.com/bmidgley/status/101702746717175808" class="tweet_url" target="_blank">@bmidgley</a> via Twitter</span>

Here are the JRuby results:

$ rvm use jruby-1.6.3 && jruby --1.8 --server lazy_benchmark.rb
                         user     system      total        real
non-lazy one op      0.010000   0.000000   0.010000 (  0.010000)
lazy one op          0.093000   0.000000   0.093000 (  0.093000)
non-lazy two ops     0.016000   0.000000   0.016000 (  0.017000)
lazy two ops         0.134000   0.000000   0.134000 (  0.134000)
non-lazy three ops   0.024000   0.000000   0.024000 (  0.024000)
lazy three ops       0.196000   0.000000   0.196000 (  0.196000)

Looks like JRuby barely wins this microbenchmark. I'll leave the Rubinius benchmark as an exercise for the reader since RVM failed to install rbx-1.2.4 for me. The differences between the lazy and non-lazy versions in MRI 1.9.2 and JRuby 1.6.3 at least seem to be uniform in this case. Given that, I feel that these numbers are a better indication of the overhead incurred when using Denumerable in Ruby. Again, the trade off is for modularity and each use case is different. The performance hit may be worth it at times while other times it may not even be an option.

I'd love to hear about lazy-list Ruby libraries with better performance so if you have some benchmarks please send them my way. UPDATE: Someone did :)


About this entry