lex(1)lex(1)NAMElex - Generates programs for lexical tasks
SYNOPSISlex [-ct] [-n | -v] [file...]
[Tru64 UNIX] The following syntax applies when the CMD_ENV environment
variable is set to svr4: lex [-crt] [-n | -v] [-V] [-Qy | -Qn]
[file...]
STANDARDS
Interfaces documented on this reference page conform to industry stan‐
dards as follows:
lex: XPG4, XPG4-UNIX
Refer to the standards(5) reference page for more information about
industry standards and associated tags.
OPTIONS
Writes C code to the file lex.yy.c. This is the default. Suppresses
the statistics summary. When you set your own table sizes for the
finite state machine, lex automatically produces this summary if you do
not select this flag. [Tru64 UNIX] Writes RATFOR code to the file
lex.yy.r. (There is no RATFOR compiler for Tru64 UNIX.) Writes to
standard output instead of writing to a file. Provides a summary of
the generated finite state machine statistics. [Tru64 UNIX] Outputs
lex version number to standard error. Requires the environment variable
CMD_ENV to be set to svr4. [Tru64 UNIX] Determines whether the lex
version number is written to the output file. The -Qn option does not
do so and is the default. Requires the environment variable CMD_ENV to
be set to svr4.
DESCRIPTION
The lex command uses the rules and actions contained in file to gener‐
ate a program, lex.yy.c, which can be compiled with the cc command.
That program can then receive input, break the input into the logical
pieces defined by the rules in file, and run program fragments con‐
tained in the actions in file.
The generated program is a C Language function called yylex(). The lex
command stores yylex() in a file named lex.yy.c. You can use yylex()
alone to recognize simple, 1-word input, or you can use it with other C
Language programs to perform more difficult input analysis functions.
For example, you can use lex to generate a program that tokenizes an
input stream before sending it to a parser program generated by the
yacc command.
The yylex() function analyzes the input stream using a program struc‐
ture called a finite state machine. This structure allows the program
to exist in only one state (or condition) at a time. A finite number
of states are allowed. The rules in file determine how the program
moves from one state to another in response to the input that the pro‐
gram receives.
The lex command reads its skeleton finite state machine from the file
/usr/ccs/lib/ncpform or /usr/ccs/lib/ncform. Use the environment vari‐
able LEXER to specify another location for lex to read from.
If you do not specify a file, lex reads standard input. It treats mul‐
tiple files as a single file.
Input File Format
The input file can contain three sections: definitions, rules, and
user subroutines. Each section must be separated from the others by a
line containing only the delimiter, %%. The format is as follows:
definitions %% rules %% user_subroutines
The purpose and format of each of these sections are described under
the headings that follow.
Definitions Section
If you want to use variables in rules, you must define them in the def‐
initions section. The variables make up the left column, and their def‐
initions make up the right column. For example, to define D as a
numerical digit, enter: D [0-9]
You can use a defined variable in the rules section by enclosing the
variable name in braces, {D}.
In the definitions section, you can set either of the following two
mutually exclusive declarations: Declare the type of yytext to be a
null-terminated character array. Declare the type of yytext to be a
pointer to a null-terminated character string. Use of the %pointer def‐
inition selects the /usr/ccs/lib/ncpform skeleton.
In the definitions section, you can also set table sizes for the
resulting finite state machine. The default sizes are large enough for
small programs. You may want to set larger sizes for more complex pro‐
grams: Number of positions is number (default 5000) Number of states is
number (default 2500) Number of parse tree nodes is number (default
2000) Number of transitions is number (default 5000) Number of packed
character classes is number (default 2000) Number of output slots is
number (default 5000)
If extended characters appear in regular expression strings, you may
need to reset the output array size with the %o parameter (possibly to
array sizes in the range 10,000 to 20,000). This reset reflects the
much larger number of extended characters relative to the number of
ASCII characters.
Rules Section
The rules section is required, and it must be preceded by the %% delim‐
iter, even if you do not have a definitions section. The lex command
does not recognize rules without the delimiter.
In this section, the left column contains the pattern to be recognized
in an input file to yylex(). The right column contains the C program
fragment executed when that pattern is recognized.
Patterns can include extended characters with one exception: extended
characters may not appear in range specifications within character
class expressions surrounded by brackets.
The columns are separated by a tab. For example, to search files for
the word LEAD and replace it with GOLD, perform the following steps:
Create a file called transmute.l containing the lines:
%% (LEAD) printf("GOLD"); Then issue the following commands to
the shell: lex transmute.l cc -o transmute lex.yy.c -ll You can
test the resulting program with the command: transmute <trans‐
mute.l
This command echoes the contents of transmute.l, with the occurrences
of LEAD changed to GOLD.
Each pattern may have a corresponding action, that is, a fragment of C
source code to execute when the pattern is matched. Each statement
must end with a ; (semicolon). If you use more than one statement in
an action, you must enclose all of them in {} (braces). A second delim‐
iter, %%, must follow the rules section if you have a user subroutine
section.
When yylex() matches a string in the input stream, it copies the
matched text to an external character array, yytext, before it executes
any actions in the rules section.
You can use the following operators to form patterns that you want to
match: Matches the characters written. Matches any one character in
the enclosed range ([.-.]) or the enclosed list ([...]). [abcx-z]
matches a,b,c,x,y, or z. Matches the enclosed character or string even
if it is an operator. "$" prevents lex from interpreting the $ charac‐
ter as an operator. Acts the same as double quotes. \$ prevents lex
from interpreting the $ character as an operator. Matches zero or more
occurrences of the single-character regular expression immediately pre‐
ceding it. x* matches zero or more repeated literal characters x.
Matches one or more occurrences of the single-character regular expres‐
sion immediately preceding it. Matches either zero or one occurrence
of the single-character regular expression immediately preceding it.
Matches the character only at the beginning of a line. ^x matches an x
at the beginning of a line. Matches any character except for the char‐
acters following the ^. [^xyz] matches any character but x, y, or z.
Matches any character except the newline character. Matches the end of
a line. Matches either of two characters. x|y matches either x or y.
Matches one extended regular expression (ERE) only when followed by a
second ERE. It reads only the first token into yytext. Given the regu‐
lar expression a*b/cc and the input aaabcc, yytext would contain the
string aaab on this match. Matches the pattern in the ( ) (parenthe‐
ses). This is used for grouping. It reads the whole pattern into
yytext. A group in parentheses can be used in place of any single char‐
acter in any other pattern. (xyz123) matches the pattern xyz123 and
reads the whole string into yytext. Matches the character as defined
in the definitions section. If D is defined as numeric digits, {D}
matches all numeric digits. Matches m-to-n occurrences of the speci‐
fied character. x{2,4} matches 2, 3, or 4 occurrences of x.
If a line begins with only a space, lex copies it to the lex.yy.c out‐
put file. If the line is in the definitions section of file, lex copies
it to the declarations section of lex.yy.c. If the line is in the rules
section, lex copies it to the program code section of lex.yy.c.
User Subroutines Section
The lex library has three subroutines defined as macros that you can
use in the rules. Reads a character from yyin. Replaces a character
after it is read. Writes a character to yyout.
You can override these three macros by writing your own code for these
routines in the user subroutines section. But if you write your own
routines, you must undefine these macros in the definitions section as
follows:
%{ #undef input #undef unput #undef output }%
When you are using lex as a simple transformer/recognizer for stdin to
stdout piping, you can avoid writing the framework by using libl.a (the
lex library). It has a main routine that calls yylex() for you.
External names generated by lex all begin with the prefix yy, as in
yyin, yyout, yylex, and yytext.
Putting Spaces in an Expression
Normally, spaces or tabs end a rule and, therefore, the expression that
defines a rule. However, you can enclose the spaces or tab characters
in "" (double quotes) to include them in the expression. Use quotes
around all spaces in expressions that are not already within sets of [
] (brackets).
Other Special Characters
The lex program recognizes many of the normal C language special char‐
acters. These character sequences are as follows:
Sequence Meaning
\n Newline
\t Tab
\b Backspace
\\ Backslash
\digits The character whose encoding is represented
by the three-digit octal number
\xdigits The character whose encoding is represented
by the hexadecimal integer
Do not use the actual newline character in an expression.
When using these special characters in an expression, you do not need
to enclose them in quotes. Every character, except these special char‐
acters and the previously described operator symbols, is always a text
character.
Matching Rules
When more than one expression can match the current input, lex chooses
the longest match first. Among rules that match the same number of
characters, the rule that occurs first is chosen. For example:
integer keyword action...; [a-z]+ identifier action...;
If the preceding rules are given in that order and integers is the
input word, lex matches the input as an identifier because [a-z]+
matches eight characters, while integer matches only seven. However,
if the input is integer, both rules match seven characters. The keyword
rule is selected because it occurs first. A shorter input, such as int,
does not match the expression rule integer and causes lex to select the
rule identifier.
Matching a String with Wildcard Characters
Because lex chooses the longest match first, do not use rules contain‐
ing expressions like (for example: '.*').
The preceding rule might seem like a good way to recognize a string in
single quotes. However, the lexical analyzer reads far ahead, looking
for a distant single quote to complete the long match. If a lexical
analyzer with such a rule gets the following input, it matches the
whole string:
'first' quoted string here, 'second' here
To find the smaller strings, first and second, use the following rule:
'[^'\n]*'
This rule stops after matching 'first'.
Errors of this type are not far-reaching because the . (dot) operator
does not match a newline character. Therefore, expressions like stop
on the current line. Do not try to defeat this with expressions like
[.\n] +. The lexical analyzer tries to read the entire input file, and
an internal buffer overflow occurs.
Finding Strings within Strings
The lex program partitions the input stream and does not search for all
possible matches of each expression. Each character is accounted for
once and only once. For example, to count occurrences of both she and
he in an input text, try the following rules:
she s++; he h++; \n | . ;
The last two rules ignore everything besides he and she. However,
because she includes he, lex does not recognize the instances of he
that are included in she.
To override this choice, use the REJECT action. This directive tells
lex to go to the next rule. The lex command then adjusts the position
of the input pointer to where it was before the first rule was exe‐
cuted, and executes the second choice rule. For example, to count the
included instances of he, use the following rules:
she {s++; REJECT;} he {h++; REJECT;} \n | . ;
After counting the occurrences of she, lex rejects the input stream and
then counts the occurrences of he. In this case, you can omit the
REJECT action on he because she includes he but not vice versa. In
other cases, it may be difficult to determine which input characters
are in both classes.
In general, REJECT is useful whenever the purpose of lex is not to par‐
tition the input stream but to detect all examples of some items in the
input, and the instances of these items may overlap or include each
other.
NOTES
Because lex uses fixed names for intermediate and output files, you can
have only one lex-generated program in a given directory. If the -t
option is not specified, informational, error, and warning messages are
written to stdout. If the -t option is specified, informational, error,
and warning messages are written to stderr.
[Tru64 UNIX] The yytext array has a default dimension of 200, con‐
trolled by the constant YYLMAX. If the programmer needs to allow a
larger array, the YYLMAX constant may be redefined as follows from
within the lex command file:
{ #undef YYLMAX #define YYLMAX 8192 }
Two other arrays use YYLMAX, yysubf, and yylstate.
The lex program can be compiled as a C program with -std0, -std, or
-std1 mode. It can also be compiled as a C++ program. If YY_NOPROTO is
defined on the compilation command line, function prototypes are not
generated.
EXAMPLES
The following command draws lex instructions from the file lexcommands
and places the output in lex.yy.c: lex lexcommands The file lexcommands
contains an example of a lex program that would be put into a lex com‐
mand file. The following program converts uppercase to lowercase,
removes spaces at the end of a line, and replaces multiple spaces with
single spaces:
%% [A-Z] putchar(tolower(yytext[0])); [ ]+$ ; [ ]+ putchar(' ');
ENVIRONMENT VARIABLES
The following environment variables affect the behavior of lex(): Pro‐
vides a default value for the locale category variables that are not
set or null. If set, overrides the values of all other locale vari‐
ables. Determines the order in which output is sorted for the -x
option. Determines the locale for the interpretation of byte sequences
as characters (single-byte or multi-byte) in input parameters and
files. Determines the locale used to affect the format and contents of
diagnostic messages displayed by the command. Determines the location
of message catalogs for the processing of LC_MESSAGES.
FILES
Run-time library. Default C language skeleton finite state machine for
lex. Default C language skeleton finite state machine for lex, imple‐
mented with the pointer definition of yytext. Default RATFOR language
skeleton finite state machine for lex.
SEE ALSO
Commands: yacc(1)
Standards: standards(5)
Programming Support Tools
lex(1)