Lex and Yacc Tutorial
Lex
*Regular Expressions in Lex
*Lex Program Structure
*Section I: Global C Definitions
*Section II: Lex Rules for Matching Patterns
*Section III: C Code
*Advanced Lex
*Lex Variables
*Lex Functions
*Yacc
*References
*
Lex stands for Lexical Analyzer. Yacc stands for Yet Another Compiler Compiler. In this tutorials first Lex is discussed and then Yacc.
Lex matches regular expressions and returns tokens. Regular expressions are defined in a particular syntax (discussed in the succeeding section). A matched regular expression is called a token. When input is provided to Lex in form of a file or text typed in from the terminal, it attempts to match the text with the regular expression defined. Input is taken one character (or symbol) at a time. Until a pattern is matched, input continues. If Lex determines it can match a pattern, it returns the associated token. If on the other hand, Lex determines no regular expression can be matched, further processing stops and an error message is displayed.
Lex is tightly coupled to C. A .lex file is passed through lex utility and it produces output files in C. These file(s) are compiled to produce an executable version of the lexical analyzer. (Lex files have the extension .lex.)
A regular expression is a pattern description using a meta language – a language one uses to describe patterns of interest. An expression is made up of symbols. There are normal symbols like characters and numbers. In addition to these standard symbols, there are other symbols that have special meaning in Lex. Various symbols and their meanings are given in the following table.
|
Character |
Meaning |
|
A-Z, 0-9, a-z |
Characters and numbers that form part of the pattern. |
|
. |
Matches any character except \n |
|
- |
Used to denote range. ExA-Z implies all characters from A to Z. |
|
[ ] |
A character class. Matches any character in the brackets. If the first character is ^ then it indicates a negation pattern. Ex [abC] matches either of a, b, and C. |
|
* |
Match zero or more occurrences of the preceding pattern. |
|
+ |
Matches one or more occurrences of the preceding pattern. |
|
? |
Matches zero or one occurrences of the preceding pattern. |
|
$ |
Matches end of line as the last character of the pattern. |
|
{ } |
Indicates how many times a pattern can be present. Ex A{1,3} implies one or three occurrences of ‘A’ may be present. |
|
\ |
Used to escape metacharacters. Also used to remove the special meaning of characters as defined in this table. |
|
^ |
Negation. |
|
| |
Logical OR between expressions |
|
"<some symbols>" |
Literal meanings of characters. Metacharacters hold. |
|
/ |
Look ahead. Matches the preceding pattern only if followed by the succeeding expression. Ex. A0/1 matches A0 only if A01 is the input. |
|
( ) |
Groups a series of regular expressions |
Some examples of regular expression and their meanings are given in the following table.
|
Regular Expression |
Meaning |
|
joke[rs] |
Matches either jokes or joker |
|
A{1,2}shis+ |
Matches AAshis, Ashis, AAshi, Ashi |
|
(A[b-e])+ |
Matches zero or one occurrences of A followed by any character from b to e. |
Tokens in Lex are declared like variable name in C. Every token has an associated expression. Examples of tokens and expression are given below. Let’s try and build a word counting program. All the consecutive examples will try to solve this problem. So, this example shows how tokens are declared.
|
Token |
Associated Expression |
Meaning |
|
number |
([0-9])+ |
1 or more occurrences of a digit |
|
chars |
[A-Za-z] |
Any character |
|
blank |
" " |
A blank space |
|
word |
(chars)+ |
1 or more occurrences of chars |
|
variable |
(chars)+(number)*(chars)*( number)* |
|
A Lex program is divided into three sections. First section has global C definitions. Second section has the patterns. What to do on encountering them is coded in C. The last section has supplemental C functions. For example, main() would typically be found in this section. These sections are delimited by %%. Now, let us try to build the word counting Lex program and see how the various sections are composed.
Section I: Global C Definitions
In this section we can add C variable declarations. For our word counting program, we shall declare an integer variable here that shall hold the number of words counted by the program. Token declarations of Lex are also performed in this section.
|
%{ |
|
int wordCount = 0; |
|
%} |
|
chars [A-za-z\’\`\.\"] |
|
numbers ([0-9])+ |
|
delim [" "\n\t] |
|
whitespace {delim}+ |
|
words {chars}+ |
|
%% |
The double percent sign implies the end of this section and beginning of section II.
Section II: Lex Rules for Matching Patterns
This section has Lex rules for describing the token to be matched. Action to be taken when a token is matched is defined in C language. Taking our word counting program forward, rules for the same are defined below.
{words} { wordCount++; /* increase the word count by one*/ }
{whitespace} { /* do nothing*/ }
{numbers} { /* one may want to add some processing here*/ }
%%
This section has C function declarations as required. It can have the main function also. One important point to note is that this section must have yywrap() function. Lex has a set of functions and variables that are available to the user. One of them is yywrap. Typically, yywrap() is defined as shown in the example below. More information is available in the section Advanced Lex.
void main()
{
yylex(); /* start the analysis*/
printf(" No of words: %d\n", wordCount);
}
int yywrap()
{
return 1;
}
In the preceding sections, basic Lex has been discussed. This discussion should help the reader in writing simple lexical analysis programs. However, many times, more functionality is required. A deeper understanding of what functionality that Lex provides is also required. These advanced topics are grouped in the next section.
Lex has several functions and variables that provide different information. These can be used to build programs that can perform complex functions. Some of these variables and functions and their uses are listed in this section.
|
yyin |
Of the type FILE*. This points to the current file being parsed by the lexer. |
|
yyout |
Of the type FILE*. This points to the location where the output f the lexer will be written. By defualt, both yyin and yyout point to standard input and output. |
|
yytext |
The text of the matched pattern is stored in this variable. (char*) |
|
yyleng |
Gives the length of the matched pattern. |
|
yylineno |
Provides current line number information. (May or may not be supported by the lexer). |
yylex():
The function that starts the analysis. It is automatically generated by Lex.yywrap()
: function is called when end of file (or input) us encountered. If this function returns 1, the parsing stops. So, this can be used to parse multiple files. Code can be written in the IIIrd section, which will allow multiple files to be parsed. The strategy is to make yyin file pointer (see Lex Variables) point to a different file until all the files are parsed. At the end, yywrap() can return 1 to indicate end of parsing.yyless(int n)
: It can be used to push back all but first ‘n’ characters of the read token.yymore()
: This function tells the lexer to append the next token to the current token.There are different Lex programs available today. Lex is a standard on Unix systems.
Gnu version of Lex is called Flex. Object oriented extension is called Flex++. On Windows systems, MKS markets MKS Lex. This completes discussion of Lex and the next topic is Yacc.