Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
R
riscv-gcc-1
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
lvzhengyang
riscv-gcc-1
Commits
d3d43aab
Commit
d3d43aab
authored
Oct 04, 2001
by
Neil Booth
Committed by
Neil Booth
Oct 04, 2001
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
* doc/cppinternals.texi: Update.
From-SVN: r46009
parent
3054eeed
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
312 additions
and
117 deletions
+312
-117
gcc/ChangeLog
+4
-0
gcc/doc/cppinternals.texi
+308
-117
No files found.
gcc/ChangeLog
View file @
d3d43aab
2001
-
10
-
04
Neil
Booth
<
neil
@daikokuya
.
demon
.
co
.
uk
>
*
doc
/
cppinternals
.
texi
:
Update
.
2001
-
10
-
04
Eric
Christopher
<
echristo
@redhat
.
com
>
2001
-
10
-
04
Eric
Christopher
<
echristo
@redhat
.
com
>
*
config
/
mips
/
mips
.
c
(
init_cumulative_args
)
:
Remember
to
set
*
config
/
mips
/
mips
.
c
(
init_cumulative_args
)
:
Remember
to
set
...
...
gcc/doc/cppinternals.texi
View file @
d3d43aab
...
@@ -66,7 +66,8 @@ into another language, under the above conditions for modified versions.
...
@@ -66,7 +66,8 @@ into another language, under the above conditions for modified versions.
@contents
@contents
@page
@page
@node
Top
,
Conventions
,,
(
DIR
)
@node
Top
@top
@chapter
Cpplib
---
the
core
of
the
GNU
C
Preprocessor
@chapter
Cpplib
---
the
core
of
the
GNU
C
Preprocessor
The
GNU
C
preprocessor
in
GCC
3
.
x
has
been
completely
rewritten
.
It
is
The
GNU
C
preprocessor
in
GCC
3
.
x
has
been
completely
rewritten
.
It
is
...
@@ -87,16 +88,18 @@ tricky issues encountered. It also describes certain behaviour we would
...
@@ -87,16 +88,18 @@ tricky issues encountered. It also describes certain behaviour we would
like
to
preserve
,
such
as
the
format
and
spacing
of
its
output
.
like
to
preserve
,
such
as
the
format
and
spacing
of
its
output
.
@menu
@menu
*
Conventions
::
Conventions
used
in
the
code
.
*
Conventions
::
Conventions
used
in
the
code
.
*
Lexer
::
The
combined
C
,
C
++
and
Objective
-
C
Lexer
.
*
Lexer
::
The
combined
C
,
C
++
and
Objective
-
C
Lexer
.
*
Whitespace
::
Input
and
output
newlines
and
whitespace
.
*
Hash
Nodes
::
All
identifiers
are
entered
into
a
hash
table
.
*
Hash
Nodes
::
All
identifiers
are
hashed
.
*
Macro
Expansion
::
Macro
expansion
algorithm
.
*
Macro
Expansion
::
Macro
expansion
algorithm
.
*
Token
Spacing
::
Spacing
and
paste
avoidance
issues
.
*
Files
::
File
handling
.
*
Line
Numbering
::
Tracking
location
within
files
.
*
Index
::
Index
.
*
Guard
Macros
::
Optimizing
header
files
with
guard
macros
.
*
Files
::
File
handling
.
*
Index
::
Index
.
@end
menu
@end
menu
@node
Conventions
,
Lexer
,
Top
,
Top
@node
Conventions
@unnumbered
Conventions
@unnumbered
Conventions
@cindex
interface
@cindex
interface
@cindex
header
files
@cindex
header
files
...
@@ -118,9 +121,11 @@ change internals in the future without worrying whether library clients
...
@@ -118,9 +121,11 @@ change internals in the future without worrying whether library clients
are
perhaps
relying
on
some
kind
of
undocumented
implementation
-
specific
are
perhaps
relying
on
some
kind
of
undocumented
implementation
-
specific
behaviour
.
behaviour
.
@node
Lexer
,
Whitespace
,
Conventions
,
Top
@node
Lexer
@unnumbered
The
Lexer
@unnumbered
The
Lexer
@cindex
lexer
@cindex
lexer
@cindex
newlines
@cindex
escaped
newlines
@section
Overview
@section
Overview
The
lexer
is
contained
in
the
file
@file
{
cpplex
.
c
}.
It
is
a
hand
-
coded
The
lexer
is
contained
in
the
file
@file
{
cpplex
.
c
}.
It
is
a
hand
-
coded
...
@@ -143,7 +148,7 @@ output.
...
@@ -143,7 +148,7 @@ output.
@section
Lexing
a
token
@section
Lexing
a
token
Lexing
of
an
individual
token
is
handled
by
@code
{
_cpp_lex_direct
}
and
Lexing
of
an
individual
token
is
handled
by
@code
{
_cpp_lex_direct
}
and
its
subroutines
.
In
its
current
form
the
code
is
quite
complicated
,
its
subroutines
.
In
its
current
form
the
code
is
quite
complicated
,
with
read
ahead
characters
and
suchlike
,
since
it
strives
to
not
step
with
read
ahead
characters
and
such
-
like
,
since
it
strives
to
not
step
back
in
the
character
stream
in
preparation
for
handling
non
-
ASCII
file
back
in
the
character
stream
in
preparation
for
handling
non
-
ASCII
file
encodings
.
The
current
plan
is
to
convert
any
such
files
to
UTF
-
8
encodings
.
The
current
plan
is
to
convert
any
such
files
to
UTF
-
8
before
processing
them
.
This
complexity
is
therefore
unnecessary
and
before
processing
them
.
This
complexity
is
therefore
unnecessary
and
...
@@ -175,7 +180,7 @@ using the line map code.
...
@@ -175,7 +180,7 @@ using the line map code.
The
first
token
on
a
logical
,
i
.
e
.
@
:
unescaped
,
line
has
the
flag
The
first
token
on
a
logical
,
i
.
e
.
@
:
unescaped
,
line
has
the
flag
@code
{
BOL
}
set
for
beginning
-
of
-
line
.
This
flag
is
intended
for
@code
{
BOL
}
set
for
beginning
-
of
-
line
.
This
flag
is
intended
for
internal
use
,
both
to
distinguish
a
@samp
{
#
}
that
begins
a
directive
internal
use
,
both
to
distinguish
a
@samp
{
#
}
that
begins
a
directive
from
one
that
doesn
'
t
,
and
to
generate
a
callback
to
clients
that
want
from
one
that
doesn
'
t
,
and
to
generate
a
call
-
back
to
clients
that
want
to
be
notified
about
the
start
of
every
non
-
directive
line
with
tokens
to
be
notified
about
the
start
of
every
non
-
directive
line
with
tokens
on
it
.
Clients
cannot
reliably
determine
this
for
themselves
:
the
first
on
it
.
Clients
cannot
reliably
determine
this
for
themselves
:
the
first
token
might
be
a
macro
,
and
the
tokens
of
a
macro
expansion
do
not
have
token
might
be
a
macro
,
and
the
tokens
of
a
macro
expansion
do
not
have
...
@@ -219,9 +224,28 @@ foo
...
@@ -219,9 +224,28 @@ foo
@end
smallexample
@end
smallexample
This
is
a
good
example
of
the
subtlety
of
getting
token
spacing
correct
This
is
a
good
example
of
the
subtlety
of
getting
token
spacing
correct
in
the
preprocessor
;
there
are
plenty
of
tests
in
the
testsuite
for
in
the
preprocessor
;
there
are
plenty
of
tests
in
the
test
-
suite
for
corner
cases
like
this
.
corner
cases
like
this
.
The
lexer
is
written
to
treat
each
of
@samp
{
\
r
},
@samp
{
\
n
},
@samp
{
\
r
\
n
}
and
@samp
{
\
n
\
r
}
as
a
single
new
line
indicator
.
This
allows
it
to
transparently
preprocess
MS
-
DOS
,
Macintosh
and
Unix
files
without
their
needing
to
pass
through
a
special
filter
beforehand
.
We
also
decided
to
treat
a
backslash
,
either
@samp
{
\
}
or
the
trigraph
@samp
{??
/
}
,
separated
from
one
of
the
above
newline
indicators
by
non
-
comment
whitespace
only
,
as
intending
to
escape
the
newline
.
It
tends
to
be
a
typing
mistake
,
and
cannot
reasonably
be
mistaken
for
anything
else
in
any
of
the
C
-
family
grammars
.
Since
handling
it
this
way
is
not
strictly
conforming
to
the
ISO
standard
,
the
library
issues
a
warning
wherever
it
encounters
it
.
Handling
newlines
like
this
is
made
simpler
by
doing
it
in
one
place
only
.
The
function
@code{
handle_newline
}
takes
care
of
all
newline
characters
,
and
@code{
skip_escaped_newlines
}
takes
care
of
arbitrarily
long
sequences
of
escaped
newlines
,
deferring
to
@code{
handle_newline
}
to
handle
the
newlines
themselves
.
The
most
painful
aspect
of
lexing
ISO
-
standard
C
and
C
++
is
handling
The
most
painful
aspect
of
lexing
ISO
-
standard
C
and
C
++
is
handling
trigraphs
and
backlash
-
escaped
newlines
.
Trigraphs
are
processed
before
trigraphs
and
backlash
-
escaped
newlines
.
Trigraphs
are
processed
before
any
interpretation
of
the
meaning
of
a
character
is
made
,
and
unfortunately
any
interpretation
of
the
meaning
of
a
character
is
made
,
and
unfortunately
...
@@ -255,6 +279,7 @@ should be done even within C-style comments; they can appear in the
...
@@ -255,6 +279,7 @@ should be done even within C-style comments; they can appear in the
middle
of
a
line
,
and
we
want
to
report
diagnostics
in
the
correct
middle
of
a
line
,
and
we
want
to
report
diagnostics
in
the
correct
position
for
text
appearing
after
the
end
of
the
comment
.
position
for
text
appearing
after
the
end
of
the
comment
.
@anchor{
Invalid
identifiers
}
Some
identifiers
,
such
as
@code{
__VA_ARGS__
}
and
poisoned
identifiers
,
Some
identifiers
,
such
as
@code{
__VA_ARGS__
}
and
poisoned
identifiers
,
may
be
invalid
and
require
a
diagnostic
.
However
,
if
they
appear
in
a
may
be
invalid
and
require
a
diagnostic
.
However
,
if
they
appear
in
a
macro
expansion
we
don
'
t
want
to
complain
with
each
use
of
the
macro
.
macro
expansion
we
don
'
t
want
to
complain
with
each
use
of
the
macro
.
...
@@ -282,94 +307,100 @@ two separate @samp{:} tokens and almost certainly a syntax error. Such
...
@@ -282,94 +307,100 @@ two separate @samp{:} tokens and almost certainly a syntax error. Such
cases are handled by @code{_cpp_lex_direct} based upon command-line
cases are handled by @code{_cpp_lex_direct} based upon command-line
flags stored in the @code{cpp_options} structure.
flags stored in the @code{cpp_options} structure.
Once a token has been lexed, it leads an independent existence. The
spelling of numbers, identifiers and strings is copied to permanent
storage from the original input buffer, so a token remains valid and
correct even if its source buffer is freed with @code{_cpp_pop_buffer}.
The storage holding the spellings of such tokens remains until the
client program calls cpp_destroy, probably at the end of the translation
unit.
@anchor{Lexing a line}
@anchor{Lexing a line}
@section Lexing a line
@section Lexing a line
@cindex token run
@node Whitespace, Hash Nodes, Lexer, Top
@unnumbered Whitespace
When the preprocessor was changed to return pointers to tokens, one
@cindex whitespace
feature I wanted was some sort of guarantee regarding how long a
@cindex newlines
returned pointer remains valid. This is important to the stand-alone
@cindex escaped newlines
preprocessor, the future direction of the C family front ends, and even
@cindex paste avoidance
to cpplib itself internally.
@cindex line numbers
Occasionally the preprocessor wants to be able to peek ahead in the
The lexer has been written to treat each of @samp{
\r
}, @samp{
\n
},
token stream. For example, after the name of a function-like macro, it
@samp{
\r\n
} and @samp{
\n\r
} as a single new line indicator. This allows
wants to check the next token to see if it is an opening parenthesis.
it to transparently preprocess MS-DOS, Macintosh and Unix files without
Another example is that, after reading the first few tokens of a
their needing to pass through a special filter beforehand.
@code{#pragma} directive and not recognising it as a registered pragma,
it wants to backtrack and allow the user-defined handler for unknown
We also decided to treat a backslash, either @samp{\} or the trigraph
pragmas to access the full @code{#pragma} token stream. The stand-alone
@samp{??/}, separated from one of the above newline indicators by
preprocessor wants to be able to test the current token with the
non-comment whitespace only, as intending to escape the newline. It
previous one to see if a space needs to be inserted to preserve their
tends to be a typing mistake, and cannot reasonably be mistaken for
separate tokenization upon re-lexing (paste avoidance), so it needs to
anything else in any of the C-family grammars. Since handling it this
be sure the pointer to the previous token is still valid. The
way is not strictly conforming to the ISO standard, the library issues a
recursive-descent C++ parser wants to be able to perform tentative
warning wherever it encounters it.
parsing arbitrarily far ahead in the token stream, and then to be able
to jump back to a prior position in that stream if necessary.
Handling newlines like this is made simpler by doing it in one place
only. The function @samp{handle_newline} takes care of all newline
The rule I chose, which is fairly natural, is to arrange that the
characters, and @samp{skip_escaped_newlines} takes care of arbitrarily
preprocessor lex all tokens on a line consecutively into a token buffer,
long sequences of escaped newlines, deferring to @samp{handle_newline}
which I call a @dfn{token run}, and when meeting an unescaped new line
to handle the newlines themselves.
(newlines within comments do not count either), to start lexing back at
the beginning of the run. Note that we do @emph{not} lex a line of
Another whitespace issue only concerns the stand-alone preprocessor: we
tokens at once; if we did that @code{parse_identifier} would not have
want to guarantee that re-reading the preprocessed output results in an
state flags available to warn about invalid identifiers (@pxref{Invalid
identical token stream. Without taking special measures, this might not
identifiers}).
be the case because of macro substitution. We could simply insert a
space between adjacent tokens, but ideally we would like to keep this to
In other words, accessing tokens that appeared earlier in the current
a minimum, both for aesthetic reasons and because it causes problems for
line is valid, but since each logical line overwrites the tokens of the
people who still try to abuse the preprocessor for things like Fortran
previous line, tokens from prior lines are unavailable. In particular,
source and Makefiles.
since a directive only occupies a single logical line, this means that
the directive handlers like the @code{#pragma} handler can jump around
The token structure contains a flags byte, and two flags are of interest
in the directive's tokens if necessary.
here: @samp{PREV_WHITE} and @samp{AVOID_LPASTE}. @samp{PREV_WHITE}
indicates that the token was preceded by whitespace; if this is the case
Two issues remain: what about tokens that arise from macro expansions,
we need not worry about it incorrectly pasting with its predecessor.
and what happens when we have a long line that overflows the token run?
The @samp{AVOID_LPASTE} flag is set by the macro expansion routines, and
indicates that paste avoidance by insertion of a space to the left of
Since we promise clients that we preserve the validity of pointers that
the token may be necessary. Recursively, the first token of a macro
we have already returned for tokens that appeared earlier in the line,
substitution, the first token after a macro substitution, the first
we cannot reallocate the run. Instead, on overflow it is expanded by
token of a substituted argument, and the first token after a substituted
chaining a new token run on to the end of the existing one.
argument are all flagged @samp{AVOID_LPASTE} by the macro expander.
The tokens forming a macro's replacement list are collected by the
If a token flagged in this way does not have a @samp{PREV_WHITE} flag,
@code{#define} handler, and placed in storage that is only freed by
and the routine @code{cpp_avoid_paste} determines that it might be
@code{cpp_destroy}. So if a macro is expanded in our line of tokens,
misinterpreted by the lexer if a space is not inserted between it and
the pointers to the tokens of its expansion that we return will always
the immediately preceding token, then stand-alone CPP's output routines
remain valid. However, macros are a little trickier than that, since
will insert a space between them. To avoid excessive spacing,
they give rise to three sources of fresh tokens. They are the built-in
@code{cpp_avoid_paste} tries hard to only request a space if one is
macros like @code{__LINE__}, and the @samp{#} and @samp{##} operators
likely to be necessary, but for reasons of efficiency it is slightly
for stringifcation and token pasting. I handled this by allocating
conservative and might recommend a space where one is not strictly
space for these tokens from the lexer's token run chain. This means
needed.
they automatically receive the same lifetime guarantees as lexed tokens,
and we don't need to concern ourselves with freeing them.
Finally, the preprocessor takes great care to ensure it keeps track of
both the position of a token in the source file, for diagnostic
Lexing into a line of tokens solves some of the token memory management
purposes, and where it should appear in the output file, because using
issues, but not all. The opening parenthesis after a function-like
CPP for other languages like assembler requires this. The two positions
macro name might lie on a different line, and the front ends definitely
may differ for the following reasons:
want the ability to look ahead past the end of the current line. So
cpplib only moves back to the start of the token run at the end of a
@itemize @bullet
line if the variable @code{keep_tokens} is zero. Line-buffering is
@item
quite natural for the preprocessor, and as a result the only time cpplib
Escaped newlines are deleted, so lines spliced in this way are joined to
needs to increment this variable is whilst looking for the opening
form a single logical line.
parenthesis to, and reading the arguments of, a function-like macro. In
the near future cpplib will export an interface to increment and
@item
decrement this variable, so that clients can share full control over the
A macro expansion replaces the tokens that form its invocation, but any
lifetime of token pointers too.
newlines appearing in the macro's arguments are interpreted as a single
space, with the result that the macro's replacement appears in full on
The routine @code{_cpp_lex_token} handles moving to new token runs,
the same line that the macro name appeared in the source file. This is
calling @code{_cpp_lex_direct} to lex new tokens, or returning
particularly important for stringification of arguments---newlines
previously-lexed tokens if we stepped back in the token stream. It also
embedded in the arguments must appear in the string as spaces.
checks each token for the @code{BOL} flag, which might indicate a
@end itemize
directive that needs to be handled, or require a start-of-line call-back
to be made. @code{_cpp_lex_token} also handles skipping over tokens in
The source file location is maintained in the @code{lineno} member of the
failed conditional blocks, and invalidates the control macro of the
@code{cpp_buffer} structure, and the column number inferred from the
multiple-include optimization if a token was successfully lexed outside
current position in the buffer relative to the @code{line_base} buffer
a directive. In other words, its callers do not need to concern
variable, which is updated with every newline whether escaped or not.
themselves with such issues.
TODO: Finish this.
@node Hash Nodes
@node Hash Nodes, Macro Expansion, Whitespace, Top
@unnumbered Hash Nodes
@unnumbered Hash Nodes
@cindex hash table
@cindex hash table
@cindex identifiers
@cindex identifiers
...
@@ -377,12 +408,12 @@ TODO: Finish this.
...
@@ -377,12 +408,12 @@ TODO: Finish this.
@cindex assertions
@cindex assertions
@cindex named operators
@cindex named operators
When cpplib encounters an ``identifier'', it generates a hash code for
it
When cpplib encounters an ``identifier'', it generates a hash code for
and stores it in the hash table. By ``identifier'' we mean tokens with
it and stores it in the hash table. By ``identifier'' we mean tokens
type @samp{CPP_NAME}; this includes identifiers in the usual C sense, as
with type @code{CPP_NAME}; this includes identifiers in the usual C
well as keywords, directive names, macro names and so on. For example,
sense, as well as keywords, directive names, macro names and so on. For
all of @samp{pragma}, @samp{int}, @samp{foo} and @samp{__GNUC__} are identifiers and hashe
d
example, all of @code{pragma}, @code{int}, @code{foo} an
d
when lexed.
@code{__GNUC__} are identifiers and hashed
when lexed.
Each node in the hash table contain various information about the
Each node in the hash table contain various information about the
identifier it represents. For example, its length and type. At any one
identifier it represents. For example, its length and type. At any one
...
@@ -392,12 +423,12 @@ time, each identifier falls into exactly one of three categories:
...
@@ -392,12 +423,12 @@ time, each identifier falls into exactly one of three categories:
@item Macros
@item Macros
These have been declared to be macros, either on the command line or
These have been declared to be macros, either on the command line or
with @code{#define}. A few, such as @
samp{__TIME__} are built
ins
with @code{#define}. A few, such as @
code{__TIME__} are built-
ins
entered in the hash table during initialisation. The hash node for a
entered in the hash table during initialisation. The hash node for a
normal macro points to a structure with more information about the
normal macro points to a structure with more information about the
macro, such as whether it is function-like, how many arguments it takes,
macro, such as whether it is function-like, how many arguments it takes,
and its expansion. Builtin macros are flagged as special, and instead
and its expansion. Built
-
in macros are flagged as special, and instead
contain an enum indicating which of the various builtin macros it is.
contain an enum indicating which of the various built
-
in macros it is.
@item Assertions
@item Assertions
...
@@ -413,7 +444,7 @@ currently a macro, or a macro that has since been undefined with
...
@@ -413,7 +444,7 @@ currently a macro, or a macro that has since been undefined with
@code{#undef}.
@code{#undef}.
When preprocessing C++, this category also includes the named operators,
When preprocessing C++, this category also includes the named operators,
such as @
samp
{xor}. In expressions these behave like the operators they
such as @
code
{xor}. In expressions these behave like the operators they
represent, but in contexts where the spelling of a token matters they
represent, but in contexts where the spelling of a token matters they
are spelt differently. This spelling distinction is relevant when they
are spelt differently. This spelling distinction is relevant when they
are operands of the stringizing and pasting macro operators @code{#} and
are operands of the stringizing and pasting macro operators @code{#} and
...
@@ -429,13 +460,173 @@ hash node with the index of that argument. This makes duplicated
...
@@ -429,13 +460,173 @@ hash node with the index of that argument. This makes duplicated
argument checking an O(1) operation for each argument. Similarly, for
argument checking an O(1) operation for each argument. Similarly, for
each identifier in the macro's expansion, lookup to see if it is an
each identifier in the macro's expansion, lookup to see if it is an
argument, and which argument it is, is also an O(1) operation. Further,
argument, and which argument it is, is also an O(1) operation. Further,
each directive name, such as @
samp
{endif}, has an associated directive
each directive name, such as @
code
{endif}, has an associated directive
enum stored in its hash node, so that directive lookup is also O(1).
enum stored in its hash node, so that directive lookup is also O(1).
@node Macro Expansion
, Files, Hash Nodes, Top
@node Macro Expansion
@unnumbered Macro Expansion Algorithm
@unnumbered Macro Expansion Algorithm
@node Files, Index, Macro Expansion, Top
@c TODO
@node Token Spacing
@unnumbered Token Spacing
@cindex paste avoidance
@cindex spacing
@cindex token spacing
First, let's look at an issue that only concerns the stand-alone
preprocessor: we want to guarantee that re-reading its preprocessed
output results in an identical token stream. Without taking special
measures, this might not be the case because of macro substitution. For
example:
@smallexample
#define PLUS +
#define EMPTY
#define f(x) =x=
+PLUS -EMPTY- PLUS+ f(=)
@expansion{} + + - - + + = = =
@emph{not}
@expansion{} ++ -- ++ ===
@end smallexample
One solution would be to simply insert a space between all adjacent
tokens. However, we would like to keep space insertion to a minimum,
both for aesthetic reasons and because it causes problems for people who
still try to abuse the preprocessor for things like Fortran source and
Makefiles.
For now, just notice that the only places we need to be careful about
@dfn{paste avoidance} are when tokens are added (or removed) from the
original token stream. This only occurs because of macro expansion, but
care is needed in many places: before @strong{and} after each macro
replacement, each argument replacement, and additionally each token
created by the @samp{#} and @samp{##} operators.
Let's look at how the preprocessor gets whitespace output correct
normally. The @code{cpp_token} structure contains a flags byte, and one
of those flags is @code{PREV_WHITE}. This is flagged by the lexer, and
indicates that the token was preceded by whitespace of some form other
than a new line. The stand-alone preprocessor can use this flag to
decide whether to insert a space between tokens in the output.
Now consider the following:
@smallexample
#define add(x, y, z) x + y +z;
sum = add (1,2, 3);
@expansion{} sum = 1 + 2 +3;
@end smallexample
The interesting thing here is that the tokens @samp{1} and @samp{2} are
output with a preceding space, and @samp{3} is output without a
preceding space, but when lexed none of these tokens had that property.
Careful consideration reveals that @samp{1} gets its preceding
whitespace from the space preceding @samp{add} in the macro
@emph{invocation}, @samp{2} gets its whitespace from the space preceding
the parameter @samp{y} in the macro @emph{replacement list}, and
@samp{3} has no preceding space because parameter @samp{z} has none in
the replacement list.
Once lexed, tokens are effectively fixed and cannot be altered, since
pointers to them might be held in many places, in particular by
in-progress macro expansions. So instead of modifying the two tokens
above, the preprocessor inserts a special token, which I call a
@dfn{padding token}, into the token stream in front of every macro
expansion and expanded macro argument, to indicate that the subsequent
token should assume its @code{PREV_WHITE} flag from a different
@dfn{source token}. In the above example, the source tokens are
@samp{add} in the macro invocation, and @samp{y} and @samp{z} in the
macro replacement list, respectively.
It is quite easy to get multiple padding tokens in a row, for example if
a macro's first replacement token expands straight into another macro.
@smallexample
#define foo bar
#define bar baz
[foo]
@expansion{} [baz]
@end smallexample
Here, two padding tokens with sources @samp{foo} between the brackets,
and @samp{bar} from foo's replacement list, are generated. Clearly the
first padding token is the one that matters. But what if we happen to
leave a macro expansion? Adjusting the above example slightly:
@smallexample
#define foo bar
#define bar EMPTY baz
#define EMPTY
[foo] EMPTY;
@expansion{} [ baz] ;
@end smallexample
As shown, now there should be a space before baz and the semicolon. Our
initial algorithm fails for the former, because we would see three
padding tokens, one per macro invocation, followed by @samp{baz}, which
would have inherit its spacing from the original source, @samp{foo},
which has no leading space. Note that it is vital that cpplib get
spacing correct in these examples, since any of these macro expansions
could be stringified, where spacing matters.
So, I have demonstrated that not just entering macro and argument
expansions, but leaving them requires special handling too. So cpplib
inserts a padding token with a @code{NULL} source token when leaving
macro expansions and after each replaced argument in a macro's
replacement list. It also inserts appropriate padding tokens on either
side of tokens created by the @samp{#} and @samp{##} operators.
Now we can see the relationship with paste avoidance: we have to be
careful about paste avoidance in exactly the same locations we take care
to get white space correct. This makes implementation of paste
avoidance easy: wherever the stand-alone preprocessor is fixing up
spacing because of padding tokens, and it turns out that no space is
needed, it has to take the extra step to check that a space is not
needed after all to avoid an accidental paste. The function
@code{cpp_avoid_paste} advises whether a space is required between two
consecutive tokens. To avoid excessive spacing, it tries hard to only
require a space if one is likely to be necessary, but for reasons of
efficiency it is slightly conservative and might recommend a space where
one is not strictly needed.
@node Line Numbering
@unnumbered Line numbering
@cindex line numbers
The preprocessor takes great care to ensure it keeps track of both the
position of a token in the source file, for diagnostic purposes, and
where it should appear in the output file, because using CPP for other
languages like assembler requires this. The two positions may differ
for the following reasons:
@itemize @bullet
@item
Escaped newlines are deleted, so lines spliced in this way are joined to
form a single logical line.
@item
A macro expansion replaces the tokens that form its invocation, but any
newlines appearing in the macro's arguments are interpreted as a single
space, with the result that the macro's replacement appears in full on
the same line that the macro name appeared in the source file. This is
particularly important for stringification of arguments---newlines
embedded in the arguments must appear in the string as spaces.
@end itemize
The source file location is maintained in the @code{lineno} member of the
@code{cpp_buffer} structure, and the column number inferred from the
current position in the buffer relative to the @code{line_base} buffer
variable, which is updated with every newline whether escaped or not.
@c FINISH THIS
@node Guard Macros
@unnumbered The Multiple-Include Optimization
@c TODO
@node Files
@unnumbered File Handling
@unnumbered File Handling
@cindex files
@cindex files
...
@@ -459,10 +650,10 @@ filesystem queries whilst searching for the correct file.
...
@@ -459,10 +650,10 @@ filesystem queries whilst searching for the correct file.
For each file we try to open, we store the constructed path in a splay
For each file we try to open, we store the constructed path in a splay
tree. This path first undergoes simplification by the function
tree. This path first undergoes simplification by the function
@code{_cpp_simplify_pathname}. For example,
@code{_cpp_simplify_pathname}. For example,
@
samp
{/usr/include/bits/../foo.h} is simplified to
@
file
{/usr/include/bits/../foo.h} is simplified to
@
samp
{/usr/include/foo.h} before we enter it in the splay tree and try
@
file
{/usr/include/foo.h} before we enter it in the splay tree and try
to @code{open ()} the file. CPP will then find subsequent uses of
to @code{open ()} the file. CPP will then find subsequent uses of
@
samp{foo.h}, even as @samp
{/usr/include/foo.h}, in the splay tree and
@
file{foo.h}, even as @file
{/usr/include/foo.h}, in the splay tree and
save system calls.
save system calls.
Further, it is likely the file contents have also been cached, saving a
Further, it is likely the file contents have also been cached, saving a
...
@@ -486,7 +677,7 @@ directory on a per-file basis is handled by the function
...
@@ -486,7 +677,7 @@ directory on a per-file basis is handled by the function
Note that a header included with a directory component, such as
Note that a header included with a directory component, such as
@code{#include "
mydir
/
foo
.
h
"} and opened as
@code{#include "
mydir
/
foo
.
h
"} and opened as
@
samp
{/usr/local/include/mydir/foo.h}, will have the complete path minus
@
file
{/usr/local/include/mydir/foo.h}, will have the complete path minus
the basename @samp{foo.h} as the current directory.
the basename @samp{foo.h} as the current directory.
Enough information is stored in the splay tree that CPP can immediately
Enough information is stored in the splay tree that CPP can immediately
...
@@ -503,7 +694,7 @@ command line (or system) include directories to which the mapping
...
@@ -503,7 +694,7 @@ command line (or system) include directories to which the mapping
applies. This may be higher up the directory tree than the full path to
applies. This may be higher up the directory tree than the full path to
the file minus the base name.
the file minus the base name.
@node Index
,, Files, Top
@node Index
@unnumbered Index
@unnumbered Index
@printindex cp
@printindex cp
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment