* doc/cppinternals.texi: Update.

From-SVN: r46009

* doc/cppinternals.texi: Update.
From-SVN: r46009
d3d43aab · Neil Booth · Neil Booth · 3054eeed · d3d43aab · d3d43aab
Commit d3d43aab authored Oct 04, 2001 by Neil Booth Committed by Neil Booth Oct 04, 2001
Hide whitespace changes
Inline Side-by-side

Showing with 312 additions and 117 deletions

gcc/ChangeLog
+4 -0

gcc/doc/cppinternals.texi
+308 -117

No files found.
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
+2001-10-04  Neil Booth  <neil@daikokuya.demon.co.uk>
+	* doc/cppinternals.texi: Update.
 2001-10-04  Eric Christopher  <echristo@redhat.com>
 	* config/mips/mips.c (init_cumulative_args): Remember to set

--- a/gcc/doc/cppinternals.texi
+++ b/gcc/doc/cppinternals.texi
@@ -66,7 +66,8 @@ into another language, under the above conditions for modified versions.
 @contents
 @page
-@node Top, Conventions,, (DIR)
+@node Top
+@top
 @chapter Cpplib---the core of the GNU C Preprocessor
 The GNU C preprocessor in GCC 3.x has been completely rewritten.  It is
@@ -87,16 +88,18 @@ tricky issues encountered.  It also describes certain behaviour we would
 like to preserve, such as the format and spacing of its output.
 @menu
-* Conventions::	    Conventions used in the code.
+* Conventions::         Conventions used in the code.
-* Lexer::	    The combined C, C++ and Objective-C Lexer.
+* Lexer::               The combined C, C++ and Objective-C Lexer.
-* Whitespace::      Input and output newlines and whitespace.
+* Hash Nodes::          All identifiers are entered into a hash table.
-* Hash Nodes::      All identifiers are hashed.
+* Macro Expansion::     Macro expansion algorithm.
-* Macro Expansion:: Macro expansion algorithm.
+* Token Spacing::       Spacing and paste avoidance issues.
-* Files::	    File handling.
+* Line Numbering::      Tracking location within files.
-* Index::           Index.
+* Guard Macros::        Optimizing header files with guard macros.
+* Files::               File handling.
+* Index::               Index.
 @end menu
-@node Conventions, Lexer, Top, Top
+@node Conventions
 @unnumbered Conventions
 @cindex interface
 @cindex header files
@@ -118,9 +121,11 @@ change internals in the future without worrying whether library clients
 are perhaps relying on some kind of undocumented implementation-specific
 behaviour.
-@node Lexer, Whitespace, Conventions, Top
+@node Lexer
 @unnumbered The Lexer
 @cindex lexer
+@cindex newlines
+@cindex escaped newlines
 @section Overview
 The lexer is contained in the file @file{cpplex.c}.  It is a hand-coded
@@ -143,7 +148,7 @@ output.
 @section Lexing a token
 Lexing of an individual token is handled by @code{_cpp_lex_direct} and
 its subroutines.  In its current form the code is quite complicated,
-with read ahead characters and suchlike, since it strives to not step
+with read ahead characters and such-like, since it strives to not step
 back in the character stream in preparation for handling non-ASCII file
 encodings.  The current plan is to convert any such files to UTF-8
 before processing them.  This complexity is therefore unnecessary and
@@ -175,7 +180,7 @@ using the line map code.
 The first token on a logical, i.e.@: unescaped, line has the flag
 @code{BOL} set for beginning-of-line.  This flag is intended for
 internal use, both to distinguish a @samp{#} that begins a directive
-from one that doesn't, and to generate a callback to clients that want
+from one that doesn't, and to generate a call-back to clients that want
 to be notified about the start of every non-directive line with tokens
 on it.  Clients cannot reliably determine this for themselves: the first
 token might be a macro, and the tokens of a macro expansion do not have
@@ -219,9 +224,28 @@ foo
 @end smallexample
 This is a good example of the subtlety of getting token spacing correct
-in the preprocessor; there are plenty of tests in the testsuite for
+in the preprocessor; there are plenty of tests in the test-suite for
 corner cases like this.
+The lexer is written to treat each of @samp{\r}, @samp{\n}, @samp{\r\n}
+and @samp{\n\r} as a single new line indicator.  This allows it to
+transparently preprocess MS-DOS, Macintosh and Unix files without their
+needing to pass through a special filter beforehand.
+We also decided to treat a backslash, either @samp{\} or the trigraph
+@samp{??/}, separated from one of the above newline indicators by
+non-comment whitespace only, as intending to escape the newline.  It
+tends to be a typing mistake, and cannot reasonably be mistaken for
+anything else in any of the C-family grammars.  Since handling it this
+way is not strictly conforming to the ISO standard, the library issues a
+warning wherever it encounters it.
+Handling newlines like this is made simpler by doing it in one place
+only.  The function @code{handle_newline} takes care of all newline
+characters, and @code{skip_escaped_newlines} takes care of arbitrarily
+long sequences of escaped newlines, deferring to @code{handle_newline}
+to handle the newlines themselves.
 The most painful aspect of lexing ISO-standard C and C++ is handling
 trigraphs and backlash-escaped newlines.  Trigraphs are processed before
 any interpretation of the meaning of a character is made, and unfortunately
@@ -255,6 +279,7 @@ should be done even within C-style comments; they can appear in the
 middle of a line, and we want to report diagnostics in the correct
 position for text appearing after the end of the comment.
+@anchor{Invalid identifiers}
 Some identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,
 may be invalid and require a diagnostic.  However, if they appear in a
 macro expansion we don't want to complain with each use of the macro.
@@ -282,94 +307,100 @@ two separate @samp{:} tokens and almost certainly a syntax error.  Such
 cases are handled by @code{_cpp_lex_direct} based upon command-line
 flags stored in the @code{cpp_options} structure.
+Once a token has been lexed, it leads an independent existence.  The
+spelling of numbers, identifiers and strings is copied to permanent
+storage from the original input buffer, so a token remains valid and
+correct even if its source buffer is freed with @code{_cpp_pop_buffer}.
+The storage holding the spellings of such tokens remains until the
+client program calls cpp_destroy, probably at the end of the translation
+unit.
 @anchor{Lexing a line}
 @section Lexing a line
+@cindex token run
-@node Whitespace, Hash Nodes, Lexer, Top
-@unnumbered Whitespace
+When the preprocessor was changed to return pointers to tokens, one
-@cindex whitespace
+feature I wanted was some sort of guarantee regarding how long a
-@cindex newlines
+returned pointer remains valid.  This is important to the stand-alone
-@cindex escaped newlines
+preprocessor, the future direction of the C family front ends, and even
-@cindex paste avoidance
+to cpplib itself internally.
-@cindex line numbers
+Occasionally the preprocessor wants to be able to peek ahead in the
-The lexer has been written to treat each of @samp{\r}, @samp{\n},
+token stream.  For example, after the name of a function-like macro, it
-@samp{\r\n} and @samp{\n\r} as a single new line indicator.  This allows
+wants to check the next token to see if it is an opening parenthesis.
-it to transparently preprocess MS-DOS, Macintosh and Unix files without
+Another example is that, after reading the first few tokens of a
-their needing to pass through a special filter beforehand.
+@code{#pragma} directive and not recognising it as a registered pragma,
+it wants to backtrack and allow the user-defined handler for unknown
-We also decided to treat a backslash, either @samp{\} or the trigraph
+pragmas to access the full @code{#pragma} token stream.  The stand-alone
-@samp{??/}, separated from one of the above newline indicators by
+preprocessor wants to be able to test the current token with the
-non-comment whitespace only, as intending to escape the newline.  It
+previous one to see if a space needs to be inserted to preserve their
-tends to be a typing mistake, and cannot reasonably be mistaken for
+separate tokenization upon re-lexing (paste avoidance), so it needs to
-anything else in any of the C-family grammars.  Since handling it this
+be sure the pointer to the previous token is still valid.  The
-way is not strictly conforming to the ISO standard, the library issues a
+recursive-descent C++ parser wants to be able to perform tentative
-warning wherever it encounters it.
+parsing arbitrarily far ahead in the token stream, and then to be able
+to jump back to a prior position in that stream if necessary.
-Handling newlines like this is made simpler by doing it in one place
-only.  The function @samp{handle_newline} takes care of all newline
+The rule I chose, which is fairly natural, is to arrange that the
-characters, and @samp{skip_escaped_newlines} takes care of arbitrarily
+preprocessor lex all tokens on a line consecutively into a token buffer,
-long sequences of escaped newlines, deferring to @samp{handle_newline}
+which I call a @dfn{token run}, and when meeting an unescaped new line
-to handle the newlines themselves.
+(newlines within comments do not count either), to start lexing back at
+the beginning of the run.  Note that we do @emph{not} lex a line of
-Another whitespace issue only concerns the stand-alone preprocessor: we
+tokens at once; if we did that @code{parse_identifier} would not have
-want to guarantee that re-reading the preprocessed output results in an
+state flags available to warn about invalid identifiers (@pxref{Invalid
-identical token stream.  Without taking special measures, this might not
+identifiers}).
-be the case because of macro substitution.  We could simply insert a
-space between adjacent tokens, but ideally we would like to keep this to
+In other words, accessing tokens that appeared earlier in the current
-a minimum, both for aesthetic reasons and because it causes problems for
+line is valid, but since each logical line overwrites the tokens of the
-people who still try to abuse the preprocessor for things like Fortran
+previous line, tokens from prior lines are unavailable.  In particular,
-source and Makefiles.
+since a directive only occupies a single logical line, this means that
+the directive handlers like the @code{#pragma} handler can jump around
-The token structure contains a flags byte, and two flags are of interest
+in the directive's tokens if necessary.
-here: @samp{PREV_WHITE} and @samp{AVOID_LPASTE}.  @samp{PREV_WHITE}
-indicates that the token was preceded by whitespace; if this is the case
+Two issues remain: what about tokens that arise from macro expansions,
-we need not worry about it incorrectly pasting with its predecessor.
+and what happens when we have a long line that overflows the token run?
-The @samp{AVOID_LPASTE} flag is set by the macro expansion routines, and
-indicates that paste avoidance by insertion of a space to the left of
+Since we promise clients that we preserve the validity of pointers that
-the token may be necessary.  Recursively, the first token of a macro
+we have already returned for tokens that appeared earlier in the line,
-substitution, the first token after a macro substitution, the first
+we cannot reallocate the run.  Instead, on overflow it is expanded by
-token of a substituted argument, and the first token after a substituted
+chaining a new token run on to the end of the existing one.
-argument are all flagged @samp{AVOID_LPASTE} by the macro expander.
+The tokens forming a macro's replacement list are collected by the
-If a token flagged in this way does not have a @samp{PREV_WHITE} flag,
+@code{#define} handler, and placed in storage that is only freed by
-and the routine @code{cpp_avoid_paste} determines that it might be
+@code{cpp_destroy}.  So if a macro is expanded in our line of tokens,
-misinterpreted by the lexer if a space is not inserted between it and
+the pointers to the tokens of its expansion that we return will always
-the immediately preceding token, then stand-alone CPP's output routines
+remain valid.  However, macros are a little trickier than that, since
-will insert a space between them.  To avoid excessive spacing,
+they give rise to three sources of fresh tokens.  They are the built-in
-@code{cpp_avoid_paste} tries hard to only request a space if one is
+macros like @code{__LINE__}, and the @samp{#} and @samp{##} operators
-likely to be necessary, but for reasons of efficiency it is slightly
+for stringifcation and token pasting.  I handled this by allocating
-conservative and might recommend a space where one is not strictly
+space for these tokens from the lexer's token run chain.  This means
-needed.
+they automatically receive the same lifetime guarantees as lexed tokens,
+and we don't need to concern ourselves with freeing them.
-Finally, the preprocessor takes great care to ensure it keeps track of
-both the position of a token in the source file, for diagnostic
+Lexing into a line of tokens solves some of the token memory management
-purposes, and where it should appear in the output file, because using
+issues, but not all.  The opening parenthesis after a function-like
-CPP for other languages like assembler requires this.  The two positions
+macro name might lie on a different line, and the front ends definitely
-may differ for the following reasons:
+want the ability to look ahead past the end of the current line.  So
+cpplib only moves back to the start of the token run at the end of a
-@itemize @bullet
+line if the variable @code{keep_tokens} is zero.  Line-buffering is
-@item
+quite natural for the preprocessor, and as a result the only time cpplib
-Escaped newlines are deleted, so lines spliced in this way are joined to
+needs to increment this variable is whilst looking for the opening
-form a single logical line.
+parenthesis to, and reading the arguments of, a function-like macro.  In
+the near future cpplib will export an interface to increment and
-@item
+decrement this variable, so that clients can share full control over the
-A macro expansion replaces the tokens that form its invocation, but any
+lifetime of token pointers too.
-newlines appearing in the macro's arguments are interpreted as a single
-space, with the result that the macro's replacement appears in full on
+The routine @code{_cpp_lex_token} handles moving to new token runs,
-the same line that the macro name appeared in the source file.  This is
+calling @code{_cpp_lex_direct} to lex new tokens, or returning
-particularly important for stringification of arguments---newlines
+previously-lexed tokens if we stepped back in the token stream.  It also
-embedded in the arguments must appear in the string as spaces.
+checks each token for the @code{BOL} flag, which might indicate a
-@end itemize
+directive that needs to be handled, or require a start-of-line call-back
+to be made.  @code{_cpp_lex_token} also handles skipping over tokens in
-The source file location is maintained in the @code{lineno} member of the
+failed conditional blocks, and invalidates the control macro of the
-@code{cpp_buffer} structure, and the column number inferred from the
+multiple-include optimization if a token was successfully lexed outside
-current position in the buffer relative to the @code{line_base} buffer
+a directive.  In other words, its callers do not need to concern
-variable, which is updated with every newline whether escaped or not.
+themselves with such issues.
-TODO: Finish this.
+@node Hash Nodes
-@node Hash Nodes, Macro Expansion, Whitespace, Top
 @unnumbered Hash Nodes
 @cindex hash table
 @cindex identifiers
@@ -377,12 +408,12 @@ TODO: Finish this.
 @cindex assertions
 @cindex named operators
-When cpplib encounters an ``identifier'', it generates a hash code for it
+When cpplib encounters an ``identifier'', it generates a hash code for
-and stores it in the hash table.  By ``identifier'' we mean tokens with
+it and stores it in the hash table.  By ``identifier'' we mean tokens
-type @samp{CPP_NAME}; this includes identifiers in the usual C sense, as
+with type @code{CPP_NAME}; this includes identifiers in the usual C
-well as keywords, directive names, macro names and so on.  For example,
+sense, as well as keywords, directive names, macro names and so on.  For
-all of @samp{pragma}, @samp{int}, @samp{foo} and @samp{__GNUC__} are identifiers and hashed
+example, all of @code{pragma}, @code{int}, @code{foo} and
-when lexed.
+@code{__GNUC__} are identifiers and hashed when lexed.
 Each node in the hash table contain various information about the
 identifier it represents.  For example, its length and type.  At any one
@@ -392,12 +423,12 @@ time, each identifier falls into exactly one of three categories:
 @item Macros
 These have been declared to be macros, either on the command line or
-with @code{#define}.  A few, such as @samp{__TIME__} are builtins
+with @code{#define}.  A few, such as @code{__TIME__} are built-ins
 entered in the hash table during initialisation.  The hash node for a
 normal macro points to a structure with more information about the
 macro, such as whether it is function-like, how many arguments it takes,
-and its expansion.  Builtin macros are flagged as special, and instead
+and its expansion.  Built-in macros are flagged as special, and instead
-contain an enum indicating which of the various builtin macros it is.
+contain an enum indicating which of the various built-in macros it is.
 @item Assertions
@@ -413,7 +444,7 @@ currently a macro, or a macro that has since been undefined with
 @code{#undef}.
 When preprocessing C++, this category also includes the named operators,
-such as @samp{xor}.  In expressions these behave like the operators they
+such as @code{xor}.  In expressions these behave like the operators they
 represent, but in contexts where the spelling of a token matters they
 are spelt differently.  This spelling distinction is relevant when they
 are operands of the stringizing and pasting macro operators @code{#} and
@@ -429,13 +460,173 @@ hash node with the index of that argument.  This makes duplicated
 argument checking an O(1) operation for each argument.  Similarly, for
 each identifier in the macro's expansion, lookup to see if it is an
 argument, and which argument it is, is also an O(1) operation.  Further,
-each directive name, such as @samp{endif}, has an associated directive
+each directive name, such as @code{endif}, has an associated directive
 enum stored in its hash node, so that directive lookup is also O(1).
-@node Macro Expansion, Files, Hash Nodes, Top
+@node Macro Expansion
 @unnumbered Macro Expansion Algorithm
-@node Files, Index, Macro Expansion, Top
+@c TODO
+@node Token Spacing
+@unnumbered Token Spacing
+@cindex paste avoidance
+@cindex spacing
+@cindex token spacing
+First, let's look at an issue that only concerns the stand-alone
+preprocessor: we want to guarantee that re-reading its preprocessed
+output results in an identical token stream.  Without taking special
+measures, this might not be the case because of macro substitution.  For
+example:
+@smallexample
+#define PLUS +
+#define EMPTY
+#define f(x) =x=
+PLUS -EMPTY- PLUS+ f(=)
+        @expansion{} + + - - + + = = =
+@emph{not}
+        @expansion{} ++ -- ++ ===
+@end smallexample
+One solution would be to simply insert a space between all adjacent
+tokens.  However, we would like to keep space insertion to a minimum,
+both for aesthetic reasons and because it causes problems for people who
+still try to abuse the preprocessor for things like Fortran source and
+Makefiles.
+For now, just notice that the only places we need to be careful about
+@dfn{paste avoidance} are when tokens are added (or removed) from the
+original token stream.  This only occurs because of macro expansion, but
+care is needed in many places: before @strong{and} after each macro
+replacement, each argument replacement, and additionally each token
+created by the @samp{#} and @samp{##} operators.
+Let's look at how the preprocessor gets whitespace output correct
+normally.  The @code{cpp_token} structure contains a flags byte, and one
+of those flags is @code{PREV_WHITE}.  This is flagged by the lexer, and
+indicates that the token was preceded by whitespace of some form other
+than a new line.  The stand-alone preprocessor can use this flag to
+decide whether to insert a space between tokens in the output.
+Now consider the following:
+@smallexample
+#define add(x, y, z) x + y +z;
+sum = add (1,2, 3);
+        @expansion{} sum = 1 + 2 +3;
+@end smallexample
+The interesting thing here is that the tokens @samp{1} and @samp{2} are
+output with a preceding space, and @samp{3} is output without a
+preceding space, but when lexed none of these tokens had that property.
+Careful consideration reveals that @samp{1} gets its preceding
+whitespace from the space preceding @samp{add} in the macro
+@emph{invocation}, @samp{2} gets its whitespace from the space preceding
+the parameter @samp{y} in the macro @emph{replacement list}, and
+@samp{3} has no preceding space because parameter @samp{z} has none in
+the replacement list.
+Once lexed, tokens are effectively fixed and cannot be altered, since
+pointers to them might be held in many places, in particular by
+in-progress macro expansions.  So instead of modifying the two tokens
+above, the preprocessor inserts a special token, which I call a
+@dfn{padding token}, into the token stream in front of every macro
+expansion and expanded macro argument, to indicate that the subsequent
+token should assume its @code{PREV_WHITE} flag from a different
+@dfn{source token}.  In the above example, the source tokens are
+@samp{add} in the macro invocation, and @samp{y} and @samp{z} in the
+macro replacement list, respectively.
+It is quite easy to get multiple padding tokens in a row, for example if
+a macro's first replacement token expands straight into another macro.
+@smallexample
+#define foo bar
+#define bar baz
+[foo]
+        @expansion{} [baz]
+@end smallexample
+Here, two padding tokens with sources @samp{foo} between the brackets,
+and @samp{bar} from foo's replacement list, are generated.  Clearly the
+first padding token is the one that matters.  But what if we happen to
+leave a macro expansion?  Adjusting the above example slightly:
+@smallexample
+#define foo bar
+#define bar EMPTY baz
+#define EMPTY
+[foo] EMPTY;
+        @expansion{} [ baz] ;
+@end smallexample
+As shown, now there should be a space before baz and the semicolon.  Our
+initial algorithm fails for the former, because we would see three
+padding tokens, one per macro invocation, followed by @samp{baz}, which
+would have inherit its spacing from the original source, @samp{foo},
+which has no leading space.  Note that it is vital that cpplib get
+spacing correct in these examples, since any of these macro expansions
+could be stringified, where spacing matters.
+So, I have demonstrated that not just entering macro and argument
+expansions, but leaving them requires special handling too.  So cpplib
+inserts a padding token with a @code{NULL} source token when leaving
+macro expansions and after each replaced argument in a macro's
+replacement list.  It also inserts appropriate padding tokens on either
+side of tokens created by the @samp{#} and @samp{##} operators.
+Now we can see the relationship with paste avoidance: we have to be
+careful about paste avoidance in exactly the same locations we take care
+to get white space correct.  This makes implementation of paste
+avoidance easy: wherever the stand-alone preprocessor is fixing up
+spacing because of padding tokens, and it turns out that no space is
+needed, it has to take the extra step to check that a space is not
+needed after all to avoid an accidental paste.  The function
+@code{cpp_avoid_paste} advises whether a space is required between two
+consecutive tokens.  To avoid excessive spacing, it tries hard to only
+require a space if one is likely to be necessary, but for reasons of
+efficiency it is slightly conservative and might recommend a space where
+one is not strictly needed.
+@node Line Numbering
+@unnumbered Line numbering
+@cindex line numbers
+The preprocessor takes great care to ensure it keeps track of both the
+position of a token in the source file, for diagnostic purposes, and
+where it should appear in the output file, because using CPP for other
+languages like assembler requires this.  The two positions may differ
+for the following reasons:
+@itemize @bullet
+@item
+Escaped newlines are deleted, so lines spliced in this way are joined to
+form a single logical line.
+@item
+A macro expansion replaces the tokens that form its invocation, but any
+newlines appearing in the macro's arguments are interpreted as a single
+space, with the result that the macro's replacement appears in full on
+the same line that the macro name appeared in the source file.  This is
+particularly important for stringification of arguments---newlines
+embedded in the arguments must appear in the string as spaces.
+@end itemize
+The source file location is maintained in the @code{lineno} member of the
+@code{cpp_buffer} structure, and the column number inferred from the
+current position in the buffer relative to the @code{line_base} buffer
+variable, which is updated with every newline whether escaped or not.
+@c FINISH THIS
+@node Guard Macros
+@unnumbered The Multiple-Include Optimization
+@c TODO
+@node Files
 @unnumbered File Handling
 @cindex files
@@ -459,10 +650,10 @@ filesystem queries whilst searching for the correct file.
 For each file we try to open, we store the constructed path in a splay
 tree.  This path first undergoes simplification by the function
 @code{_cpp_simplify_pathname}.  For example,
-@samp{/usr/include/bits/../foo.h} is simplified to
+@file{/usr/include/bits/../foo.h} is simplified to
-@samp{/usr/include/foo.h} before we enter it in the splay tree and try
+@file{/usr/include/foo.h} before we enter it in the splay tree and try
 to @code{open ()} the file.  CPP will then find subsequent uses of
-@samp{foo.h}, even as @samp{/usr/include/foo.h}, in the splay tree and
+@file{foo.h}, even as @file{/usr/include/foo.h}, in the splay tree and
 save system calls.
 Further, it is likely the file contents have also been cached, saving a
@@ -486,7 +677,7 @@ directory on a per-file basis is handled by the function
 Note that a header included with a directory component, such as
 @code{#include "mydir/foo.h"} and opened as
-@samp{/usr/local/include/mydir/foo.h}, will have the complete path minus
+@file{/usr/local/include/mydir/foo.h}, will have the complete path minus
 the basename @samp{foo.h} as the current directory.
 Enough information is stored in the splay tree that CPP can immediately
@@ -503,7 +694,7 @@ command line (or system) include directories to which the mapping
 applies.  This may be higher up the directory tree than the full path to
 the file minus the base name.
-@node Index,, Files, Top
+@node Index
 @unnumbered Index
 @printindex cp