Cagdas Basaraner is a software engineer graduated from Hacettepe University Computer Engineering department and Information Systems master program (Turkey). He has 5 years of professional experience, and is working on information systems with JEE web technologies. He is also a former developer of information systems with Microsoft .NET technologies and Command & Control (C4I) systems with Java technologies. Cagdas is a DZone MVB and is not an employee of DZone and has posted 23 posts at DZone. You can read more from them at their website. View Full User Profile

A Regular Expression HashMap Implementation in Java

04.10.2012
| 9264 views |
  • submit to reddit

Below is an implementation of a Regular Expression HashMap. It works with key-value pairs which the key is a regular expression. It compiles the key (regular expression) while adding (i.e. putting), so there is no compile time while getting. Once getting an element, you don't give regular expression; you give any possible value of a regular expression. 

As a result, this behaviour provides to map numerous values of a regular expression into the same value. The class does not depend to any external libraries, uses only default java.util. So, it will be used simply when a behaviour like that is required.

import java.util.ArrayList;
import java.util.HashMap;
import java.util.regex.Pattern;
/**
 * This class is an extended version of Java HashMap
 * and includes pattern-value lists which are used to
 * evaluate regular expression values. If given item
 * is a regular expression, it is saved in regexp lists.
 * If requested item matches with a regular expression,
 * its value is get from regexp lists.
 *
 * @author cb
 *
 * @param <K> : Key of the map item.
 * @param <V> : Value of the map item.
 */
public class RegExHashMap<K, V> extends HashMap<K, V> {
    // list of regular expression patterns
    private ArrayList<Pattern> regExPatterns = new ArrayList<Pattern>();
    // list of regular expression values which match patterns
    private ArrayList<V> regExValues = new ArrayList<V>();
    /**
     * Compile regular expression and add it to the regexp list as key.
     */
    @Override
    public V put(K key, V value) {
       
        regExPatterns.add(Pattern.compile(key.toString()));
        regExValues.add(value);
        return value;
    }
    /**
     * If requested value matches with a regular expression,
     * returns it from regexp lists.
     */
    @Override
    public V get(Object key) {
        CharSequence cs = new String(key.toString());
       
        for (int i = 0; i < regExPatterns.size(); i++) {
            if (regExPatterns.get(i).matcher(cs).matches()) {
               
                return regExValues.get(i);
            }
        }
        return super.get(key);
    }
}

 

 
Published at DZone with permission of Cagdas Basaraner, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Kaan yy replied on Wed, 2012/04/11 - 1:16am

Since your are extending hashmap; list of regExPatterns  and regExValues  are not needed. why don't you call superior put with the compiled key and value. this would also simplify the matching get.

in fact this is a must. currently you are discarding map interface with this implemantation. what if i call remove method...

Oliver Plohmann replied on Wed, 2012/04/11 - 2:30am

Hi Cagdas,

 nice idea! What do you exactly mean with compiling a pattern? The get method iterates over all patterns till a matching one is found. So the get is not really logarithmically in time. Also made me think how to improve this, but this seems not easy ...

Lance Semmens replied on Wed, 2012/04/11 - 2:38am

This is a terrible implementation of a map.

1. You call it RegExHashMap but there is no hashing involved. Using the word hash should mean it has performance of O(1) but your's has performance of O(N).

2. Your put method does not call super.put() so all of the other methods on map (remove, size, putAll, values, entrySet etc) don't work. You would be better to implement java.util.Map and throw UnsupportedOperationException() for all other methods than what you are doing here.

3. Why not RegExHashMap<V> extends HashMap<CharSequence, V>?

4. new String(key.toString()). Please explain what new String achieves (string is immutable)

5. Your two lists are not needed if extending HashMap

6. Your implementation allows multiple entries with the same key

7. Your implementation throws NPE for null keys

I could go on and on... 

Lance Semmens replied on Wed, 2012/04/11 - 2:59am in response to: Oliver Plohmann

He means that Pattern.compile(key) is called once per key and that the resulting Pattern's are re-used.

Fabrizio Giudici replied on Wed, 2012/04/11 - 5:23am

Given that we're commenting regex maps, I've been using another home-made since a few time. It works for my specific projects that needs it, but I never validated it against the Map contract. I'd like to get some feedback just while you're still hot about the topic ;-). It is not optimized with pattern compilation, but it does super.put() and has a shortcut for gets whose arg is not a regexp (which in my use cases happens 80% of the time). My impression is that designing an implementation that is optimized for a lot of generic cases *and* it complies with Map contract is a hard task. 

 

public class RegexTreeMap<Type> extends TreeMap<String, Type>
  {
    @Nonnull
    public static String escape (final @Nonnull String string)
      {
        final StringBuilder builder = new StringBuilder();

        for (int i = 0; i < string.length(); i++)
          {
            final char c = string.charAt(i);

            if ("[\\^$.|?*+()".contains("" + c))
              {
                builder.append('\\');
              }

            builder.append(c);
          }

        return builder.toString();
      }

    @Override
    public Type put (final @Nonnull String string, final Type value)
      {
        return super.put(Pattern.quote(string), value);
      }

    public Type putRegex (final @Nonnull String regex, final Type value)
      {
        return super.put(regex, value);
      }

    @Override
    public Type get (final @Nonnull Object value)
      {
        final String stringValue = (String)value;
        Type result = super.get(Pattern.quote(stringValue)); // first try a direct match that is fast
        int matchLength = 0;

        if (result == null) // otherwise returns the longest match
          {
            for (final Entry<String, Type> entry : super.entrySet())
              {
                final String regex = entry.getKey();

                if (stringValue.matches(regex) && (regex.length() > matchLength))
                  {
                    result = entry.getValue();
                    matchLength = regex.length();
                  }
              }
          }

        return result;
      }
  }

 

Liam Knox replied on Wed, 2012/04/11 - 8:43am

1. Why do this?

2. You shoud delegate and never inhertit ( not flexible )

Also what are you trying to do?

Map is a great 'private' classs.  As a public return is a whore. Never expose it 

Andrew Gilmartin replied on Wed, 2012/04/11 - 3:34pm

Lance Semmens is right that this is a terrible implementation. The O(N) performance is the worst aspect. If you want to do this right you need to build a pattern that is the merger of ALL the keys so that one check of the merged pattern will tell you if your map contains the key and where is its location (this bit gets tricky as there might be multiple matches). Implementing is not trival but it is approachable and especially so when you can build off of an existing general purpose pattern matching framework. Unfortuantly, I do not know of one.

Lance Semmens replied on Thu, 2012/04/12 - 4:41am in response to: Fabrizio Giudici

Fabrizio Giudici: Your escape method is not needed. Java regexes treat anything wrapped in \Q and \E as a literal.

Eg "\\Q+-:(?*\\E".matches("+-:(?*") == true

 

 

Lance Semmens replied on Thu, 2012/04/12 - 5:06am in response to: Andrew Gilmartin

Andrew Gilmartin: I think that combining all of the regexes into a single regex (or'red together) will just end up with the same O(N) performance but in a single regex rather than looping muiltiple regexes.

It's pretty hard to get rid of the O(N) performance for this very general RegexMap. To improve performance, I think you would need to think about your incoming data and then try to group your regexes in buckets. You could then use a hash lookup for the bucket.

For example, you might use the first three characters to lookup the bucket.

Another example is for a URL, you might choose to split the input URL's on "/" and have a tree hierarchy of regexes that only apply to a single level at a time (instead of multiple regexes that work on the entire URL). I'm sure that SpringMVC has a solution to this, I wonder how they have done it?

Reto Bachmann-Gmür replied on Mon, 2012/12/17 - 5:12pm

More as a proof of concept than as a working alternative to the proposed code I've implemented a RegexMap where all key-regexes are converted to a single deterministic finite automaton and the various accept-states are mapped to the map-values. This seems to be what Andrew Gilmartin is proposing. Ideally this should allow an access performance proportional to the length of the string to be matched and independent of the number of keys in the map.

The basic idea is that the first regex is converted to a DFA and with every new key that is added the DFA is modified to accept the strings that are matched by the new regex as well. The exit states are mapped to the values.

The code is here: https://github.com/retobg/regexmap. It currently only supports a very limited subset of the regex syntax and it certainly has bugs.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.