Register - Login
Views: 99803124
Main - Memberlist - Active users - Calendar - Wiki - IRC Chat - Online users
Ranks - Rules/FAQ - Stats - Latest Posts - Color Chart - Smilies
05-03-22 07:17:19 AM
Jul - Computers and Technology - Fucking regexes, how do they work?! New poll - New thread - New reply
Next newer thread | Next older thread
Cirvante
1340
Feel the wrath of eternal damnation please! I would appreciate that very much thank you!
Level: 74


Posts: 1233/1342
EXP: 3616426
For next: 37118

Since: 07-10-07


Since last post: 8.4 years
Last activity: 5 days

Posted on 03-07-11 03:46:38 AM (last edited by Cirvante at 03-07-11 12:56 AM) Link | Quote
So we've been asked to code up a simple lexical analyzer using Java's Scanner class and regular expressions (no JFlex or any other tools allowed) that recognizes the language composed of:

• The single-character tokens: + - * / = ; , ( ) > <
• The multi-character operators: >= <= -->
• The keywords: if def else fi while
• Identifiers (a letter followed by 0 or more letters and digits).
• Integers (one or more digits).

The following characters are ignored:

• All whitespace (blanks, tabs, newlines, and a few others). (\\s ftw)
• All comments, which begin with ‘#’ and proceed to the next end of line. (Pattern.COMMENTS takes care of this nicely)

After a while of kludging stuff up together (nevermind the fact that I've never worked with regexes before in my life, I came up with this:


import java.util.Scanner;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LexicalMain {

/**
* @param args
*/
public static void main(String[] args) {

Scanner lexicalInput = new Scanner (System.in)
Pattern grammar = Pattern.compile("[a-zA-Z][a-zA-Z0-9]*?|[0-9]*?|[-\+\*/=;,\(\)><]|(\s*?)", Pattern.COMMENTS)
//the back-slashes are actually double-escaped, despite the board eating one of them :/

Matcher m = grammar.matcher("23")
System.out.println(m.matches()) //should print true
m = grammar.matcher("2+3")
System.out.println(m.matches()) //should print true
m = grammar.matcher("89mitsu")
System.out.println(m.matches()) //should print true

while (lexicalInput.hasNext(grammar)) {
String input = lexicalInput.next(grammar)
System.out.println("Current token is: " + input)

//incomplete for now
if (input.matches("[a-zA-Z][a-zA-Z0-9]*"))
System.out.println("IDENTIFIER")
else if (input.matches("[0-9]*"))
System.out.println("INT_LIT")
else if (input.matches("[-\+\*/=;,\(\)><]"))
System.out.println("SC_OPERAND")
else System.out.println("\nError: Token is not in grammar")
}

System.out.println("\nError: Token is not in grammar")

}

}




Output:

true
false
false



The problem here lies in that Scanner tokenizes everything separated by whitespace/newline, and only applies the pattern to the token itself - which is why entering 2 + 3 correctly returns INT_LIT IDENTIFIER INT_LIT while entering 2+3 simply terminates. Same thing happens when entering identifiers: mitsu89 returns IDENTIFIER, but 89mitsu terminates instead of reporting INT_LIT IDENTIFIER like it's supposed to. I tried grabbing the entire line first as a string and then splitting it according to the regex, but it ends up splitting the string into its individual characters instead of logical groupings of identifiers, literals and operands, and still dies horribly in the above cases. My question is, how do I make it so that when reading, say,


mitsu89if(2+3>=8;))lol76bh;a-->8f6


it gets tokenized like this:


mitsu89 if ( 2 + 3 >= 8 ; ) ) lol76 bh ; a --> 8 f6


which should eventually return something akin to


IDENTIFIER KEYWORD_IF LEFT_PAREN INT_LIT PLUS_SIGN INT_LIT GREATER_EQUAL INT_LIT SEMICOLON RIGHT_PAREN RIGHT_PAREN IDENTIFIER IDENTIFIER SEMICOLON IDENTIFIER ASSIGN INT_LIT IDENTIFIER


?

____________________


board2 - Jul - OC ReMix
Skreeny
Member
I have a custom title.
Level: 54


Posts: 583/636
EXP: 1172541
For next: 61329

Since: 09-15-07


Since last post: 9.3 years
Last activity: 1.2 years

Posted on 03-07-11 04:52:28 AM Link | Quote
Can't really help with the other problems, but I see a couple issues with the regex.


[a-zA-Z][a-zA-Z0-9]*?
*? will attempt to pick up the minimal number of matches, which, given that it's at the end of the expression, will be 0 (run it on "mitsu89", it'll give you "m"); should probably just be *. Possibly add a "if|def|fi|" at the start, just to get those conditions out of the way, although it'll prevent you from having an identifier with those terms in it.


[0-9]*?
(\s*?)
Ignoring the ?, the * should probably be a + ("one or more matches"), unless you want it to match the null string.

I'm not exactly an expert on this stuff, so I may be a bit off.
Lyskar
12210
-The Chaos within trumps the Chaos without-
Level: 192


Posts: 8506/12211
EXP: 99321214
For next: 552357

Since: 07-03-07

From: 52-2-88-7

Since last post: 7.4 years
Last activity: 7.3 years

Posted on 03-07-11 05:29:42 AM Link | Quote
Stats
Time/Date
03-06-11 11:29:42 PM
Posts
8506
Days Here
1342
Level
135
Metal_Man88's Post
It is vaguely ironic you post about this as I am using Regexes to program a vaguely Watson-esque program for my independent study, and also that I have had a year or two of experience with them.

You'll want a text editor that allows search and replace with Regex, like Madedit. Then you can test all your regexes before you put them in the program.

Second, regexes are maddening and will melt your brain. IF this happens, it is a good sign, because they are horribly confusing at times.

Next, with all of those different patterns, I'd suggest you split it up into patterns made for dividing and conquering the whole. Now if I sat here for 30 minutes to an hour, I could just generate the regex myself, but that'd be cheating.

Skreeny is more or less correct. I wish I wasn't busy myself with a project of regex nature, otherwise I could give more help.

____________________

Eisnaught - SSQ² - Mobius Roleplay - SSS
Lyskar
12210
-The Chaos within trumps the Chaos without-
Level: 192


Posts: 8507/12211
EXP: 99321214
For next: 552357

Since: 07-03-07

From: 52-2-88-7

Since last post: 7.4 years
Last activity: 7.3 years

Posted on 03-07-11 05:46:23 AM Link | Quote
Stats
Time/Date
03-06-11 11:46:23 PM
Posts
8507
Days Here
1342
Level
135
Metal_Man88's Post
Also, read this, it's where most of my knowledge came from. Yes, it's applicable to java, C++, PHP, etc, within reason.

____________________

Eisnaught - SSQ² - Mobius Roleplay - SSS
Lyskar
12210
-The Chaos within trumps the Chaos without-
Level: 192


Posts: 8508/12211
EXP: 99321214
For next: 552357

Since: 07-03-07

From: 52-2-88-7

Since last post: 7.4 years
Last activity: 7.3 years

Posted on 03-07-11 06:23:15 AM Link | Quote
Stats
Time/Date
03-07-11 12:23:15 AM
Posts
8508
Days Here
1342
Level
135
Metal_Man88's Post
I looked at it deeper after I finished my work for the night. I don't know how scanner behaves, but I can definitely get the string you mentioned to morph into the one below that you want.

The way I'd work it is to make a regex for each case of the symbols you mentioned above (which is what I did), then manually run it over a test string and tweak them so they don't conflict (most importantly--the single symbol ones with > and < and - and = conflict with the double symbol ones, so you have to make exceptions/special regex things so they don't put too many spaces) and then use that to create the spaced out version, THEN tokenize it, THEN use the Scanner to assign INT_LIT and such.

The Scanner is just kind of getting in your way, is the way I see it, when you need to break it up yourself first. The Scanner will just come across the first space you make and explode rather than process the whole thing together. I may be looking at this wrong, as I haven't used Scanner myself (although if I reaaally wanted to dig into this, I could try compiling your program and trying to fix it myself) but I suggest focusing on breaking it up with Regex, then giving it to Scanner so it can use the nice and easy pattern to figure out the capital letter thingies.

____________________

Eisnaught - SSQ² - Mobius Roleplay - SSS
Next newer thread | Next older thread
Jul - Computers and Technology - Fucking regexes, how do they work?! New poll - New thread - New reply


Rusted Logic

Acmlmboard - commit 47be4dc [2021-08-23]
©2000-2022 Acmlm, Xkeeper, Kaito Sinclaire, et al.

28 database queries.
Query execution time: 0.085484 seconds
Script execution time: 0.022415 seconds
Total render time: 0.107899 seconds