Tuesday, January 24, 2012

IRC message Regex

So I'm in the process of making an irc bot, and one of the problems I always seem to have is parsing the message. If you don't know what parsing means, the simplest explanation is changing the message, in this case breaking it down into some form of an object that organizes the data for it.

Normally I would split it and then check various information from it being split then check all the split parameters and try to organize it, however when searching recently I found an easier way. Now keep in mind, this is client side parsing. I found the original on Caleb Delnay, so original credit goes here, but I wanted to expand upon it and convert it to work with the other subtle changes in regex across other languages.

*** There were further improvements added by someone else who I found thanks to some referrals in the stats thanks to a lovely accreditation to this post. Check out more info at mybuddymichael.com. I'll try to integrate the changes into my post once I can take some time to give it a good lookover and try it out a bit (hopefully it will help out with a new bot I'm currently working on in Haskell).

The original .NET compatible regex is
^(:(?<prefix>\S+) )?(?<command>\S+)( (?!:)(?<params>.+?))?( :(?<trail>.+))?$

Python (place in an r"" string so you won't need to escape backslashes), Perl, PHP, and AS3:
^(:(?P<prefix>\S+) )?(?P<command>\S+)( (?!:)(?P<params>.+?))?( :(?P<trail>.+))?$

Java (before 7 didn't support named groups, want to look at groups 2, 3, 5 and 7):
^(:(\\S+) )?(\\S+)( (?!:)(.+?))?( :(.+))?$

Java (7 and up supports named groups, have not tried this yet):
^(:(?<prefix>\\S+) )?(?<command>\\S+)( (?!:)(?<params>.+?))?( :(?<trail>.+))?$

JavaScript (no named grouping, use groups 2, 3, 5, and 7, does not need to be in a string):
/^(:(\S+) )?(\S+)( (?!:)(.+?))?( :(.+))?$/

The basic premise is under the assumption messages are formatted along the lines of :<prefix> <command> <params> :<trailing>, where any values are optional. If you know a better way to do any of them are know ways in languages I left out, let me know. As far as the regex methods and ways to work it out, that is up to you, I am just supplying the pattern and it is up to you so use it correctly.

Since regex can be complicated, hopefully this saves everyone some time, figuring out the needed methods to use it shouldn't be too hard.



***Edit:
After a comment about some stuff in the RFC, I played around with trying to make the regex work with that specification, I came up with a partially working version. Due to the complexity, my lack of knowledge and lack of benefit from this, I will only post the one edit I made and hopefully not bother with this again. While this is good for something quick and dirty, string methods seem to be more practical.

^(:(?P<prefix>\S+) )?(?P<command>\S+)( (?!:)(?P<params>\S{14} (:)?|.+ :?))?((?P<trail>.+))?$

The params section will end up with either a trailing space or a space and colon. That's the best I could do, and the last I'll do of this.

5 comments:

  1. Writing a framework in Node.js, I have a working function, but I'll switch to the regex if it is faster.

    ReplyDelete
    Replies
    1. Regex speed depends on the specific regex engine, so to verify which would be faster would require either knowing the backend or doing some benchmarking. As a general rule, regex has a lot of overhead and should be avoided if possible (what most people I talk with think). The reason I use it is because it's easy to implement and use and takes a lot less thinking on my part.

      Also to be considered is that most string methods are quite slow, generally faster than regex, but parsing a message could pull more overhead with many different function calls and such.

      So, to make a choice, I'd say benchmark it. Also if you could, maybe send me back the results so anyone else who sees this can know. Sorry there is no immediate clear answer, but there's a lot to consider and I don't have enough background knowledge to feel safe in giving you a definite yes or no.

      Delete
  2. I did something similar some years ago. Following from the RFC grammar (snippet above) I created the following expression:

    #<message> ::= [':' <prefix> <SPACE> ] <command> <params> <crlf>
    #<prefix> ::= <servername> | <nick> [ '!' <user> ] [ '@' <host> ]
    #<command> ::= <letter> { <letter> } | <number> <number> <number>
    #<SPACE> ::= ' ' { ' ' }
    #<params> ::= <SPACE> [ ':' <trailing> | <middle> <params> ]
    #
    #<middle> ::= <Any *non-empty* sequence of octets not including SPACE
    # or NUL or CR or LF, the first of which may not be ':'>
    #<trailing> ::= <Any, possibly *empty*, sequence of octets not including
    # NUL or CR or LF>
    #
    #<crlf> ::= CR LF

    # Compile the parsing expression
    expr_space = "\s+"
    expr_prefix = "(?P<prefix>[^ ]+)"
    expr_command = "(?P<command>([A-Z]+|[0-9]+))"
    expr_middle = "(?P<middle>(?!( :)|:).+?(?=(?: :)|(?:\s*$)))"
    expr_trailing = "(?P<trailing>.*)"
    expr_params = "((" + expr_space + ":" + expr_trailing + ")|(" + expr_space + expr_middle + "))+"
    expr_message = "^(:" + expr_prefix + expr_space + ")?" + expr_command + expr_params + "$"
    self.ircpattern = re.compile(expr_message)

    It's not actually 100% the IRC grammar but I have never found an IRCd that used the bit of the grammar that breaks it.

    Strictly speaking from a purely mathematical point of view, whenever a formal grammar such as this, there is no regular expression that can correctly parse it. You're better building a push down automata by using a stack and parsing the grammar as outlined in the RFC.

    ReplyDelete
  3. It's not entirely correct however.

    According to the grammar specified in RFC2812 for parameters, if the amount of parameters is exactly 14, the colon before the trail is optional.

    ReplyDelete
    Replies
    1. Thank you for the info. I will have to alter it sometime to take account of that.

      Delete