Commit 4cf817a7 by Neil Booth Committed by Neil Booth

* doc/cppinternals.texi: Update.

From-SVN: r45839
parent ef1d8fc8
2001-09-27 Neil Booth <neil@daikokuya.demon.co.uk>
* doc/cppinternals.texi: Update.
2001-09-26 Neil Booth <neil@daikokuya.demon.co.uk> 2001-09-26 Neil Booth <neil@daikokuya.demon.co.uk>
* cpphash.h (struct cpp_pool): Remove locks and locked. * cpphash.h (struct cpp_pool): Remove locks and locked.
......
...@@ -41,8 +41,8 @@ into another language, under the above conditions for modified versions. ...@@ -41,8 +41,8 @@ into another language, under the above conditions for modified versions.
@titlepage @titlepage
@c @finalout @c @finalout
@title Cpplib Internals @title Cpplib Internals
@subtitle Last revised Jan 2001 @subtitle Last revised September 2001
@subtitle for GCC version 3.0 @subtitle for GCC version 3.1
@author Neil Booth @author Neil Booth
@page @page
@vskip 0pt plus 1filll @vskip 0pt plus 1filll
...@@ -69,14 +69,14 @@ into another language, under the above conditions for modified versions. ...@@ -69,14 +69,14 @@ into another language, under the above conditions for modified versions.
@node Top, Conventions,, (DIR) @node Top, Conventions,, (DIR)
@chapter Cpplib---the core of the GNU C Preprocessor @chapter Cpplib---the core of the GNU C Preprocessor
The GNU C preprocessor in GCC 3.0 has been completely rewritten. It is The GNU C preprocessor in GCC 3.x has been completely rewritten. It is
now implemented as a library, cpplib, so it can be easily shared between now implemented as a library, cpplib, so it can be easily shared between
a stand-alone preprocessor, and a preprocessor integrated with the C, a stand-alone preprocessor, and a preprocessor integrated with the C,
C++ and Objective-C front ends. It is also available for use by other C++ and Objective-C front ends. It is also available for use by other
programs, though this is not recommended as its exposed interface has programs, though this is not recommended as its exposed interface has
not yet reached a point of reasonable stability. not yet reached a point of reasonable stability.
This library has been written to be re-entrant, so that it can be used The library has been written to be re-entrant, so that it can be used
to preprocess many files simultaneously if necessary. It has also been to preprocess many files simultaneously if necessary. It has also been
written with the preprocessing token as the fundamental unit; the written with the preprocessing token as the fundamental unit; the
preprocessor in previous versions of GCC would operate on text strings preprocessor in previous versions of GCC would operate on text strings
...@@ -86,8 +86,6 @@ This brief manual documents some of the internals of cpplib, and a few ...@@ -86,8 +86,6 @@ This brief manual documents some of the internals of cpplib, and a few
tricky issues encountered. It also describes certain behaviour we would tricky issues encountered. It also describes certain behaviour we would
like to preserve, such as the format and spacing of its output. like to preserve, such as the format and spacing of its output.
Identifiers, macro expansion, hash nodes, lexing.
@menu @menu
* Conventions:: Conventions used in the code. * Conventions:: Conventions used in the code.
* Lexer:: The combined C, C++ and Objective-C Lexer. * Lexer:: The combined C, C++ and Objective-C Lexer.
...@@ -123,18 +121,106 @@ behaviour. ...@@ -123,18 +121,106 @@ behaviour.
@node Lexer, Whitespace, Conventions, Top @node Lexer, Whitespace, Conventions, Top
@unnumbered The Lexer @unnumbered The Lexer
@cindex lexer @cindex lexer
@cindex tokens
The lexer is contained in the file @file{cpplex.c}. We want to have a
lexer that is single-pass, for efficiency reasons. We would also like
the lexer to only step forwards through the input files, and not step
back. This will make future changes to support different character
sets, in particular state or shift-dependent ones, much easier.
This file also contains all information needed to spell a token, i.e.@: to @section Overview
output it either in a diagnostic or to a preprocessed output file. This The lexer is contained in the file @file{cpplex.c}. It is a hand-coded
information is not exported, but made available to clients through such lexer, and not implemented as a state machine. It can understand C, C++
functions as @samp{cpp_spell_token} and @samp{cpp_token_len}. and Objective-C source code, and has been extended to allow reasonably
successful preprocessing of assembly language. The lexer does not make
an initial pass to strip out trigraphs and escaped newlines, but handles
them as they are encountered in a single pass of the input file. It
returns preprocessing tokens individually, not a line at a time.
It is mostly transparent to users of the library, since the library's
interface for obtaining the next token, @code{cpp_get_token}, takes care
of lexing new tokens, handling directives, and expanding macros as
necessary. However, the lexer does expose some functionality so that
clients of the library can easily spell a given token, such as
@code{cpp_spell_token} and @code{cpp_token_len}. These functions are
useful when generating diagnostics, and for emitting the preprocessed
output.
@section Lexing a token
Lexing of an individual token is handled by @code{_cpp_lex_direct} and
its subroutines. In its current form the code is quite complicated,
with read ahead characters and suchlike, since it strives to not step
back in the character stream in preparation for handling non-ASCII file
encodings. The current plan is to convert any such files to UTF-8
before processing them. This complexity is therefore unnecessary and
will be removed, so I'll not discuss it further here.
The job of @code{_cpp_lex_direct} is simply to lex a token. It is not
responsible for issues like directive handling, returning lookahead
tokens directly, multiple-include optimisation, or conditional block
skipping. It necessarily has a minor r@^ole to play in memory
management of lexed lines. I discuss these issues in a separate section
(@pxref{Lexing a line}).
The lexer places the token it lexes into storage pointed to by the
variable @var{cur_token}, and then increments it. This variable is
important for correct diagnostic positioning. Unless a specific line
and column are passed to the diagnostic routines, they will examine the
@var{line} and @var{col} values of the token just before the location
that @var{cur_token} points to, and use that location to report the
diagnostic.
The lexer does not consider whitespace to be a token in its own right.
If whitespace (other than a new line) precedes a token, it sets the
@code{PREV_WHITE} bit in the token's flags. Each token has its
@var{line} and @var{col} variables set to the line and column of the
first character of the token. This line number is the line number in
the translation unit, and can be converted to a source (file, line) pair
using the line map code.
The first token on a logical, i.e.@: unescaped, line has the flag
@code{BOL} set for beginning-of-line. This flag is intended for
internal use, both to distinguish a @samp{#} that begins a directive
from one that doesn't, and to generate a callback to clients that want
to be notified about the start of every non-directive line with tokens
on it. Clients cannot reliably determine this for themselves: the first
token might be a macro, and the tokens of a macro expansion do not have
the @code{BOL} flag set. The macro expansion may even be empty, and the
next token on the line certainly won't have the @code{BOL} flag set.
New lines are treated specially; exactly how the lexer handles them is
context-dependent. The C standard mandates that directives are
terminated by the first unescaped newline character, even if it appears
in the middle of a macro expansion. Therefore, if the state variable
@var{in_directive} is set, the lexer returns a @code{CPP_EOF} token,
which is normally used to indicate end-of-file, to indicate
end-of-directive. In a directive a @code{CPP_EOF} token never means
end-of-file. Conveniently, if the caller was @code{collect_args}, it
already handles @code{CPP_EOF} as if it were end-of-file, and reports an
error about an unterminated macro argument list.
The C standard also specifies that a new line in the middle of the
arguments to a macro is treated as whitespace. This white space is
important in case the macro argument is stringified. The state variable
@code{parsing_args} is non-zero when the preprocessor is collecting the
arguments to a macro call. It is set to 1 when looking for the opening
parenthesis to a function-like macro, and 2 when collecting the actual
arguments up to the closing parenthesis, since these two cases need to
be distinguished sometimes. One such time is here: the lexer sets the
@code{PREV_WHITE} flag of a token if it meets a new line when
@code{parsing_args} is set to 2. It doesn't set it if it meets a new
line when @code{parsing_args} is 1, since then code like
@smallexample
#define foo() bar
foo
baz
@end smallexample
@noindent would be output with an erroneous space before @samp{baz}:
@smallexample
foo
baz
@end smallexample
This is a good example of the subtlety of getting token spacing correct
in the preprocessor; there are plenty of tests in the testsuite for
corner cases like this.
The most painful aspect of lexing ISO-standard C and C++ is handling The most painful aspect of lexing ISO-standard C and C++ is handling
trigraphs and backlash-escaped newlines. Trigraphs are processed before trigraphs and backlash-escaped newlines. Trigraphs are processed before
...@@ -148,62 +234,56 @@ within the characters of an identifier, and even between the @samp{*} ...@@ -148,62 +234,56 @@ within the characters of an identifier, and even between the @samp{*}
and @samp{/} that terminates a comment. Moreover, you cannot be sure and @samp{/} that terminates a comment. Moreover, you cannot be sure
there is just one---there might be an arbitrarily long sequence of them. there is just one---there might be an arbitrarily long sequence of them.
So the routine @samp{parse_identifier}, that lexes an identifier, cannot So, for example, the routine that lexes a number, @code{parse_number},
assume that it can scan forwards until the first non-identifier cannot assume that it can scan forwards until the first non-number
character and be done with it, because this could be the @samp{\} character and be done with it, because this could be the @samp{\}
introducing an escaped newline, or the @samp{?} introducing the trigraph introducing an escaped newline, or the @samp{?} introducing the trigraph
sequence that represents the @samp{\} of an escaped newline. Similarly sequence that represents the @samp{\} of an escaped newline. If it
for the routine that handles numbers, @samp{parse_number}. If these encounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines}
routines stumble upon a @samp{?} or @samp{\}, they call to skip over any potential escaped newlines before checking whether the
@samp{skip_escaped_newlines} to skip over any potential escaped newlines number has been finished.
before checking whether they can finish.
Similarly code in the main body of @samp{_cpp_lex_token} cannot simply Similarly code in the main body of @code{_cpp_lex_direct} cannot simply
check for a @samp{=} after a @samp{+} character to determine whether it check for a @samp{=} after a @samp{+} character to determine whether it
has a @samp{+=} token; it needs to be prepared for an escaped newline of has a @samp{+=} token; it needs to be prepared for an escaped newline of
some sort. These cases use the function @samp{get_effective_char}, some sort. Such cases use the function @code{get_effective_char}, which
which returns the first character after any intervening newlines. returns the first character after any intervening escaped newlines.
The lexer needs to keep track of the correct column position, The lexer needs to keep track of the correct column position, including
including counting tabs as specified by the @option{-ftabstop=} option. counting tabs as specified by the @option{-ftabstop=} option. This
This should be done even within comments; C-style comments can appear in should be done even within C-style comments; they can appear in the
the middle of a line, and we want to report diagnostics in the correct middle of a line, and we want to report diagnostics in the correct
position for text appearing after the end of the comment. position for text appearing after the end of the comment.
Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers, Some identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,
may be invalid and require a diagnostic. However, if they appear in a may be invalid and require a diagnostic. However, if they appear in a
macro expansion we don't want to complain with each use of the macro. macro expansion we don't want to complain with each use of the macro.
It is therefore best to catch them during the lexing stage, in It is therefore best to catch them during the lexing stage, in
@samp{parse_identifier}. In both cases, whether a diagnostic is needed @code{parse_identifier}. In both cases, whether a diagnostic is needed
or not is dependent upon lexer state. For example, we don't want to or not is dependent upon the lexer's state. For example, we don't want
issue a diagnostic for re-poisoning a poisoned identifier, or for using to issue a diagnostic for re-poisoning a poisoned identifier, or for
@samp{__VA_ARGS__} in the expansion of a variable-argument macro. using @code{__VA_ARGS__} in the expansion of a variable-argument macro.
Therefore @samp{parse_identifier} makes use of flags to determine Therefore @code{parse_identifier} makes use of state flags to determine
whether a diagnostic is appropriate. Since we change state on a whether a diagnostic is appropriate. Since we change state on a
per-token basis, and don't lex whole lines at a time, this is not a per-token basis, and don't lex whole lines at a time, this is not a
problem. problem.
Another place where state flags are used to change behaviour is whilst Another place where state flags are used to change behaviour is whilst
parsing header names. Normally, a @samp{<} would be lexed as a single lexing header names. Normally, a @samp{<} would be lexed as a single
token. After a @code{#include} directive, though, it should be lexed token. After a @code{#include} directive, though, it should be lexed as
as a single token as far as the nearest @samp{>} character. Note that a single token as far as the nearest @samp{>} character. Note that we
we don't allow the terminators of header names to be escaped; the first don't allow the terminators of header names to be escaped; the first
@samp{"} or @samp{>} terminates the header name. @samp{"} or @samp{>} terminates the header name.
Interpretation of some character sequences depends upon whether we are Interpretation of some character sequences depends upon whether we are
lexing C, C++ or Objective-C, and on the revision of the standard in lexing C, C++ or Objective-C, and on the revision of the standard in
force. For example, @samp{::} is a single token in C++, but two force. For example, @samp{::} is a single token in C++, but in C it is
separate @samp{:} tokens, and almost certainly a syntax error, in C@. two separate @samp{:} tokens and almost certainly a syntax error. Such
Such cases are handled in the main function @samp{_cpp_lex_token}, based cases are handled by @code{_cpp_lex_direct} based upon command-line
upon the flags set in the @samp{cpp_options} structure. flags stored in the @code{cpp_options} structure.
Note we have almost, but not quite, achieved the goal of not stepping @anchor{Lexing a line}
backwards in the input stream. Currently @samp{skip_escaped_newlines} @section Lexing a line
does step back, though with care it should be possible to adjust it so
that this does not happen. For example, one tricky issue is if we meet
a trigraph, but the command line option @option{-trigraphs} is not in
force but @option{-Wtrigraphs} is, we need to warn about it but then
buffer it and continue to treat it as 3 separate characters.
@node Whitespace, Hash Nodes, Lexer, Top @node Whitespace, Hash Nodes, Lexer, Top
@unnumbered Whitespace @unnumbered Whitespace
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment