Commit 50668cf6 by Geoffrey Keating Committed by Geoffrey Keating

Index: gcc/ChangeLog

2005-03-14  Geoffrey Keating  <geoffk@apple.com>

	* doc/cppopts.texi (-fexec-charset): Add concept index entry.
	(-fwide-exec-charset): Likewise.
	(-finput-charset): Likewise.
	* doc/invoke.texi (Warning Options): Document -Wnormalized=.
	* c-opts.c (c_common_handle_option): Handle -Wnormalized=.
	* c.opt (Wnormalized): New.

Index: libcpp/ChangeLog
2005-03-14  Geoffrey Keating  <geoffk@apple.com>

	* init.c (cpp_create_reader): Default warn_normalize to normalized_C.
	* charset.c: Update for new format of ucnid.h.
	(ucn_valid_in_identifier): Update for new format of ucnid.h.
	Add NST parameter, and update it; update callers.
	(cpp_valid_ucn): Add NST parameter, update callers.  Replace abort
	with cpp_error.
	(convert_ucn): Pass normalize_state to cpp_valid_ucn.
	* internal.h (struct normalize_state): New.
	(INITIAL_NORMALIZE_STATE): New.
	(NORMALIZE_STATE_RESULT): New.
	(NORMALIZE_STATE_UPDATE_IDNUM): New.
	(_cpp_valid_ucn): New.
	* lex.c (warn_about_normalization): New.
	(forms_identifier_p): Add normalize_state parameter, update callers.
	(lex_identifier): Add normalize_state parameter, update callers.  Keep
	the state current.
	(lex_number): Likewise.
	(_cpp_lex_direct): Pass normalize_state to subroutines.  Check
	it with warn_about_normalization.
	* makeucnid.c: New.
	* ucnid.h: Replace.
	* ucnid.pl: Remove.
	* ucnid.tab: Make appropriate for input to makeucnid.c.  Remove
	comments about obsolete version of C++.
	* include/cpplib.h (enum cpp_normalize_level): New.
	(struct cpp_options): Add warn_normalize field.

Index: gcc/testsuite/ChangeLog
2005-03-14  Geoffrey Keating  <geoffk@apple.com>

	* gcc.dg/cpp/normalize-1.c: New.
	* gcc.dg/cpp/normalize-2.c: New.
	* gcc.dg/cpp/normalize-3.c: New.
	* gcc.dg/cpp/normalize-4.c: New.
	* gcc.dg/cpp/ucnid-4.c: New.
	* gcc.dg/cpp/ucnid-5.c: New.
	* g++.dg/cpp/normalize-1.C: New.
	* g++.dg/cpp/ucnid-1.C: New.

From-SVN: r96459
parent cd8b38b9
2005-03-14 Geoffrey Keating <geoffk@apple.com>
* doc/cppopts.texi (-fexec-charset): Add concept index entry.
(-fwide-exec-charset): Likewise.
(-finput-charset): Likewise.
* doc/invoke.texi (Warning Options): Document -Wnormalized=.
* c-opts.c (c_common_handle_option): Handle -Wnormalized=.
* c.opt (Wnormalized): New.
2005-03-14 Devang Patel <dpatel@apple.com>
* doc/invoke.texi: Add reference to Visibility document.
......
......@@ -460,6 +460,19 @@ c_common_handle_option (size_t scode, const char *arg, int value)
cpp_opts->warn_multichar = value;
break;
case OPT_Wnormalized_:
if (!value || (arg && strcasecmp (arg, "none") == 0))
cpp_opts->warn_normalize = normalized_none;
else if (!arg || strcasecmp (arg, "nfkc") == 0)
cpp_opts->warn_normalize = normalized_KC;
else if (strcasecmp (arg, "id") == 0)
cpp_opts->warn_normalize = normalized_identifier_C;
else if (strcasecmp (arg, "nfc") == 0)
cpp_opts->warn_normalize = normalized_C;
else
error ("argument %qs to %<-Wnormalized%> not recognized", arg);
break;
case OPT_Wreturn_type:
warn_return_type = value;
break;
......
......@@ -285,6 +285,10 @@ Wnonnull
C ObjC Var(warn_nonnull)
Warn about NULL being passed to argument slots marked as requiring non-NULL
Wnormalized=
C ObjC C++ ObjC++ Joined
-Wnormalized=<id|nfc|nfkc> Warn about non-normalised Unicode strings
Wold-style-cast
C++ ObjC++ Var(warn_old_style_cast)
Warn if a C-style cast is used in a program
......
......@@ -530,12 +530,14 @@ ignored. The default is 8.
@item -fexec-charset=@var{charset}
@opindex fexec-charset
@cindex character set, execution
Set the execution character set, used for string and character
constants. The default is UTF-8. @var{charset} can be any encoding
supported by the system's @code{iconv} library routine.
@item -fwide-exec-charset=@var{charset}
@opindex fwide-exec-charset
@cindex character set, wide execution
Set the wide execution character set, used for wide string and
character constants. The default is UTF-32 or UTF-16, whichever
corresponds to the width of @code{wchar_t}. As with
......@@ -545,6 +547,7 @@ problems with encodings that do not fit exactly in @code{wchar_t}.
@item -finput-charset=@var{charset}
@opindex finput-charset
@cindex character set, input
Set the input character set, used for translation from the character
set of the input file to the source character set used by GCC@. If the
locale does not specify, or GCC cannot get this information from the
......
......@@ -3039,6 +3039,51 @@ Do not warn if a multicharacter constant (@samp{'FOOF'}) is used.
Usually they indicate a typo in the user's code, as they have
implementation-defined values, and should not be used in portable code.
@item -Wnormalized=<none|id|nfc|nfkc>
@opindex Wnormalized
@cindex NFC
@cindex NFKC
@cindex character set, input normalization
In ISO C and ISO C++, two identifiers are different if they are
different sequences of characters. However, sometimes when characters
outside the basic ASCII character set are used, you can have two
different character sequences that look the same. To avoid confusion,
the ISO 10646 standard sets out some @dfn{normalization rules} which
when applied ensure that two sequences that look the same are turned into
the same sequence. GCC can warn you if you are using identifiers which
have not been normalized; this option controls that warning.
There are four levels of warning that GCC supports. The default is
@option{-Wnormalized=nfc}, which warns about any identifier which is
not in the ISO 10646 ``C'' normalized form, @dfn{NFC}. NFC is the
recommended form for most uses.
Unfortunately, there are some characters which ISO C and ISO C++ allow
in identifiers that when turned into NFC aren't allowable as
identifiers. That is, there's no way to use these symbols in portable
ISO C or C++ and have all your identifiers in NFC.
@option{-Wnormalized=id} suppresses the warning for these characters.
It is hoped that future versions of the standards involved will correct
this, which is why this option is not the default.
You can switch the warning off for all characters by writing
@option{-Wnormalized=none}. You would only want to do this if you
were using some other normalization scheme (like ``D''), because
otherwise you can easily create bugs that are literally impossible to see.
Some characters in ISO 10646 have distinct meanings but look identical
in some fonts or display methodologies, especially once formatting has
been applied. For instance @code{\u207F}, ``SUPERSCRIPT LATIN SMALL
LETTER N'', will display just like a regular @code{n} which has been
placed in a superscript. ISO 10646 defines the @dfn{NFKC}
normalisation scheme to convert all these into a standard form as
well, and GCC will warn if your code is not in NFKC if you use
@option{-Wnormalized=nfkc}. This warning is comparable to warning
about every identifier that contains the letter O because it might be
confused with the digit 0, and so is not the default, but may be
useful as a local coding convention if the programming environment is
unable to be fixed to display these characters distinctly.
@item -Wno-deprecated-declarations
@opindex Wno-deprecated-declarations
Do not warn about uses of functions, variables, and types marked as
......
2005-03-14 Geoffrey Keating <geoffk@apple.com>
* gcc.dg/cpp/normalize-1.c: New.
* gcc.dg/cpp/normalize-2.c: New.
* gcc.dg/cpp/normalize-3.c: New.
* gcc.dg/cpp/normalize-4.c: New.
* gcc.dg/cpp/ucnid-4.c: New.
* gcc.dg/cpp/ucnid-5.c: New.
* g++.dg/cpp/normalize-1.C: New.
* g++.dg/cpp/ucnid-1.C: New.
2005-03-14 Alexandre Oliva <aoliva@redhat.com>
* gcc.dg/pr18628.c: New.
......
/* { dg-do preprocess } */
/* { dg-options "-Wnormalized=id" } */
\u00AA
\u00B7
\u0F43 /* { dg-warning "not in NFC" } */
a\u05B8\u05B9\u05B9\u05BBb
a\u05BB\u05B9\u05B8\u05B9b /* { dg-warning "not in NFC" } */
\u09CB
\u09C7\u09BE /* { dg-warning "not in NFC" } */
\u0B4B
\u0B47\u0B3E /* { dg-warning "not in NFC" } */
\u0BCA
\u0BC6\u0BBE /* { dg-warning "not in NFC" } */
\u0BCB
\u0BC7\u0BBE /* { dg-warning "not in NFC" } */
\u0CCA
\u0CC6\u0CC2 /* { dg-warning "not in NFC" } */
\u0D4A
\u0D46\u0D3E /* { dg-warning "not in NFC" } */
\u0D4B
\u0D47\u0D3E /* { dg-warning "not in NFC" } */
K
\u212A
\u03AC
\u1F71 /* { dg-warning "not in NFC" } */
\uAC00
\u1100\u1161
\uAC01
\u1100\u1161\u11A8
\uAC00\u11A8
/* { dg-do preprocess } */
/* { dg-options "-pedantic" } */
\u00AA /* { dg-error "not valid in an identifier" } */
\u00AB /* { dg-error "not valid in an identifier" } */
\u00B6 /* { dg-error "not valid in an identifier" } */
\u00BA /* { dg-error "not valid in an identifier" } */
\u00C0
\u00D6
\u0384
\u0669 /* { dg-error "not valid in an identifier" } */
A\u0669 /* { dg-error "not valid in an identifier" } */
0\u00BA /* { dg-error "not valid in an identifier" } */
0\u0669 /* { dg-error "not valid in an identifier" } */
\u0E59
A\u0E59
/* { dg-do preprocess } */
/* { dg-options "-std=c99" } */
\u00AA
\u00B7
\u0F43 /* { dg-warning "not in NFC" } */
a\u05B8\u05B9\u05B9\u05BBb
a\u05BB\u05B9\u05B8\u05B9b /* { dg-warning "not in NFC" } */
\u09CB
\u09C7\u09BE /* { dg-warning "not in NFC" } */
\u0B4B
\u0B47\u0B3E /* { dg-warning "not in NFC" } */
\u0BCA
\u0BC6\u0BBE /* { dg-warning "not in NFC" } */
\u0BCB
\u0BC7\u0BBE /* { dg-warning "not in NFC" } */
\u0CCA
\u0CC6\u0CC2 /* { dg-warning "not in NFC" } */
\u0D4A
\u0D46\u0D3E /* { dg-warning "not in NFC" } */
\u0D4B
\u0D47\u0D3E /* { dg-warning "not in NFC" } */
K
\u212A /* { dg-warning "not in NFC" } */
\u03AC
\u1F71 /* { dg-warning "not in NFC" } */
\uAC00
\u1100\u1161 /* { dg-warning "not in NFC" } */
\uAC01
\u1100\u1161\u11A8 /* { dg-warning "not in NFC" } */
\uAC00\u11A8 /* { dg-warning "not in NFC" } */
/* { dg-do preprocess } */
/* { dg-options "-std=c99 -Wnormalized=nfkc" } */
\u00AA /* { dg-warning "not in NFKC" } */
\u00B7
\u0F43 /* { dg-warning "not in NFC" } */
a\u05B8\u05B9\u05B9\u05BBb
a\u05BB\u05B9\u05B8\u05B9b /* { dg-warning "not in NFC" } */
\u09CB
\u09C7\u09BE /* { dg-warning "not in NFC" } */
\u0B4B
\u0B47\u0B3E /* { dg-warning "not in NFC" } */
\u0BCA
\u0BC6\u0BBE /* { dg-warning "not in NFC" } */
\u0BCB
\u0BC7\u0BBE /* { dg-warning "not in NFC" } */
\u0CCA
\u0CC6\u0CC2 /* { dg-warning "not in NFC" } */
\u0D4A
\u0D46\u0D3E /* { dg-warning "not in NFC" } */
\u0D4B
\u0D47\u0D3E /* { dg-warning "not in NFC" } */
K
\u212A /* { dg-warning "not in NFC" } */
\u03AC
\u1F71 /* { dg-warning "not in NFC" } */
\uAC00
\u1100\u1161 /* { dg-warning "not in NFC" } */
\uAC01
\u1100\u1161\u11A8 /* { dg-warning "not in NFC" } */
\uAC00\u11A8 /* { dg-warning "not in NFC" } */
/* { dg-do preprocess } */
/* { dg-options "-std=c99 -Wnormalized=id" } */
\u00AA
\u00B7
\u0F43 /* { dg-warning "not in NFC" } */
a\u05B8\u05B9\u05B9\u05BBb
a\u05BB\u05B9\u05B8\u05B9b /* { dg-warning "not in NFC" } */
\u09CB
\u09C7\u09BE /* { dg-warning "not in NFC" } */
\u0B4B
\u0B47\u0B3E /* { dg-warning "not in NFC" } */
\u0BCA
\u0BC6\u0BBE /* { dg-warning "not in NFC" } */
\u0BCB
\u0BC7\u0BBE /* { dg-warning "not in NFC" } */
\u0CCA
\u0CC6\u0CC2 /* { dg-warning "not in NFC" } */
\u0D4A
\u0D46\u0D3E /* { dg-warning "not in NFC" } */
\u0D4B
\u0D47\u0D3E /* { dg-warning "not in NFC" } */
K
\u212A
\u03AC
\u1F71 /* { dg-warning "not in NFC" } */
\uAC00
\u1100\u1161
\uAC01
\u1100\u1161\u11A8
\uAC00\u11A8
/* { dg-do preprocess } */
/* { dg-options "-std=c99 -Wnormalized=none" } */
\u00AA
\u00B7
\u0F43
a\u05B8\u05B9\u05B9\u05BBb
a\u05BB\u05B9\u05B8\u05B9b
\u09CB
\u09C7\u09BE
\u0B4B
\u0B47\u0B3E
\u0BCA
\u0BC6\u0BBE
\u0BCB
\u0BC7\u0BBE
\u0CCA
\u0CC6\u0CC2
\u0D4A
\u0D46\u0D3E
\u0D4B
\u0D47\u0D3E
K
\u212A
\u03AC
\u1F71
\uAC00
\u1100\u1161
\uAC01
\u1100\u1161\u11A8
\uAC00\u11A8
/* { dg-do preprocess } */
/* { dg-options "-std=c99" } */
\u00AA
\u00AB /* { dg-error "not valid in an identifier" } */
\u00B6 /* { dg-error "not valid in an identifier" } */
\u00BA
\u00C0
\u00D6
\u0384
\u0669 /* { dg-error "not valid at the start of an identifier" } */
A\u0669
0\u00BA
0\u0669
\u0E59 /* { dg-error "not valid at the start of an identifier" } */
A\u0E59
/* { dg-do preprocess } */
/* { dg-options "-std=c99 -pedantic" } */
\u00AA
\u00AB /* { dg-error "not valid in an identifier" } */
\u00B6 /* { dg-error "not valid in an identifier" } */
\u00BA
\u00C0
\u00D6
\u0384 /* { dg-error "not valid in an identifier" } */
\u0669 /* { dg-error "not valid at the start of an identifier" } */
A\u0669
0\u00BA
0\u0669
\u0E59 /* { dg-error "not valid at the start of an identifier" } */
A\u0E59
2005-03-14 Geoffrey Keating <geoffk@apple.com>
* init.c (cpp_create_reader): Default warn_normalize to normalized_C.
* charset.c: Update for new format of ucnid.h.
(ucn_valid_in_identifier): Update for new format of ucnid.h.
Add NST parameter, and update it; update callers.
(cpp_valid_ucn): Add NST parameter, update callers. Replace abort
with cpp_error.
(convert_ucn): Pass normalize_state to cpp_valid_ucn.
* internal.h (struct normalize_state): New.
(INITIAL_NORMALIZE_STATE): New.
(NORMALIZE_STATE_RESULT): New.
(NORMALIZE_STATE_UPDATE_IDNUM): New.
(_cpp_valid_ucn): New.
* lex.c (warn_about_normalization): New.
(forms_identifier_p): Add normalize_state parameter, update callers.
(lex_identifier): Add normalize_state parameter, update callers. Keep
the state current.
(lex_number): Likewise.
(_cpp_lex_direct): Pass normalize_state to subroutines. Check
it with warn_about_normalization.
* makeucnid.c: New.
* ucnid.h: Replace.
* ucnid.pl: Remove.
* ucnid.tab: Make appropriate for input to makeucnid.c. Remove
comments about obsolete version of C++.
* include/cpplib.h (enum cpp_normalize_level): New.
(struct cpp_options): Add warn_normalize field.
2005-03-11 Geoffrey Keating <geoffk@apple.com>
* directives.c (glue_header_name): Update call to cpp_spell_token.
......
......@@ -22,7 +22,6 @@ Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */
#include "system.h"
#include "cpplib.h"
#include "internal.h"
#include "ucnid.h"
/* Character set handling for C-family languages.
......@@ -786,43 +785,128 @@ width_to_mask (size_t width)
return ((size_t) 1 << width) - 1;
}
/* A large table of unicode character information. */
enum {
/* Valid in a C99 identifier? */
C99 = 1,
/* Valid in a C99 identifier, but not as the first character? */
DIG = 2,
/* Valid in a C++ identifier? */
CXX = 4,
/* NFC representation is not valid in an identifier? */
CID = 8,
/* Might be valid NFC form? */
NFC = 16,
/* Might be valid NFKC form? */
NKC = 32,
/* Certain preceding characters might make it not valid NFC/NKFC form? */
CTX = 64
};
static const struct {
/* Bitmap of flags above. */
unsigned char flags;
/* Combining class of the character. */
unsigned char combine;
/* Last character in the range described by this entry. */
unsigned short end;
} ucnranges[] = {
#include "ucnid.h"
};
/* Returns 1 if C is valid in an identifier, 2 if C is valid except at
the start of an identifier, and 0 if C is not valid in an
identifier. We assume C has already gone through the checks of
_cpp_valid_ucn. The algorithm is a simple binary search on the
table defined in cppucnid.h. */
_cpp_valid_ucn. Also update NST for C if returning nonzero. The
algorithm is a simple binary search on the table defined in
ucnid.h. */
static int
ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c)
ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c,
struct normalize_state *nst)
{
int mn, mx, md;
mn = -1;
mx = ARRAY_SIZE (ucnranges);
while (mx - mn > 1)
if (c > 0xFFFF)
return 0;
mn = 0;
mx = ARRAY_SIZE (ucnranges) - 1;
while (mx != mn)
{
md = (mn + mx) / 2;
if (c < ucnranges[md].lo)
if (c <= ucnranges[md].end)
mx = md;
else if (c > ucnranges[md].hi)
mn = md;
else
goto found;
mn = md + 1;
}
return 0;
found:
/* When -pedantic, we require the character to have been listed by
the standard for the current language. Otherwise, we accept the
union of the acceptable sets for C++98 and C99. */
if (! (ucnranges[mn].flags & (C99 | CXX)))
return 0;
if (CPP_PEDANTIC (pfile)
&& ((CPP_OPTION (pfile, c99) && !(ucnranges[md].flags & C99))
&& ((CPP_OPTION (pfile, c99) && !(ucnranges[mn].flags & C99))
|| (CPP_OPTION (pfile, cplusplus)
&& !(ucnranges[md].flags & CXX))))
&& !(ucnranges[mn].flags & CXX))))
return 0;
/* Update NST. */
if (ucnranges[mn].combine != 0 && ucnranges[mn].combine < nst->prev_class)
nst->level = normalized_none;
else if (ucnranges[mn].flags & CTX)
{
bool safe;
cppchar_t p = nst->previous;
/* Easy cases from Bengali, Oriya, Tamil, Jannada, and Malayalam. */
if (c == 0x09BE)
safe = p != 0x09C7; /* Use 09CB instead of 09C7 09BE. */
else if (c == 0x0B3E)
safe = p != 0x0B47; /* Use 0B4B instead of 0B47 0B3E. */
else if (c == 0x0BBE)
safe = p != 0x0BC6 && p != 0x0BC7; /* Use 0BCA/0BCB instead. */
else if (c == 0x0CC2)
safe = p != 0x0CC6; /* Use 0CCA instead of 0CC6 0CC2. */
else if (c == 0x0D3E)
safe = p != 0x0D46 && p != 0x0D47; /* Use 0D4A/0D4B instead. */
/* For Hangul, characters in the range AC00-D7A3 are NFC/NFKC,
and are combined algorithmically from a sequence of the form
1100-1112 1161-1175 11A8-11C2
(if the third is not present, it is treated as 11A7, which is not
really a valid character).
Unfortunately, C99 allows (only) the NFC form, but C++ allows
only the combining characters. */
else if (c >= 0x1161 && c <= 0x1175)
safe = p < 0x1100 || p > 0x1112;
else if (c >= 0x11A8 && c <= 0x11C2)
safe = (p < 0xAC00 || p > 0xD7A3 || (p - 0xAC00) % 28 != 0);
else
{
/* Uh-oh, someone updated ucnid.h without updating this code. */
cpp_error (pfile, CPP_DL_ICE, "Character %x might not be NFKC", c);
safe = true;
}
if (!safe && c < 0x1161)
nst->level = normalized_none;
else if (!safe)
nst->level = MAX (nst->level, normalized_identifier_C);
}
else if (ucnranges[mn].flags & NKC)
;
else if (ucnranges[mn].flags & NFC)
nst->level = MAX (nst->level, normalized_C);
else if (ucnranges[mn].flags & CID)
nst->level = MAX (nst->level, normalized_identifier_C);
else
nst->level = normalized_none;
nst->previous = c;
nst->prev_class = ucnranges[mn].combine;
/* In C99, UCN digits may not begin identifiers. */
if (CPP_OPTION (pfile, c99) && (ucnranges[md].flags & DIG))
if (CPP_OPTION (pfile, c99) && (ucnranges[mn].flags & DIG))
return 2;
return 1;
......@@ -853,7 +937,8 @@ ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c)
cppchar_t
_cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
const uchar *limit, int identifier_pos)
const uchar *limit, int identifier_pos,
struct normalize_state *nst)
{
cppchar_t result, c;
unsigned int length;
......@@ -873,7 +958,10 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
else if (str[-1] == 'U')
length = 8;
else
abort();
{
cpp_error (pfile, CPP_DL_ICE, "In _cpp_valid_ucn but not a UCN");
length = 4;
}
result = 0;
do
......@@ -915,10 +1003,11 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
CPP_OPTION (pfile, warn_dollars) = 0;
cpp_error (pfile, CPP_DL_PEDWARN, "'$' in identifier or number");
}
NORMALIZE_STATE_UPDATE_IDNUM (nst);
}
else if (identifier_pos)
{
int validity = ucn_valid_in_identifier (pfile, result);
int validity = ucn_valid_in_identifier (pfile, result, nst);
if (validity == 0)
cpp_error (pfile, CPP_DL_ERROR,
......@@ -950,9 +1039,10 @@ convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
int rval;
struct cset_converter cvt
= wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc;
struct normalize_state nst = INITIAL_NORMALIZE_STATE;
from++; /* Skip u/U. */
ucn = _cpp_valid_ucn (pfile, &from, limit, 0);
ucn = _cpp_valid_ucn (pfile, &from, limit, 0, &nst);
rval = one_cppchar_to_utf8 (ucn, &bufp, &bytesleft);
if (rval)
......
......@@ -236,6 +236,19 @@ typedef CPPCHAR_SIGNED_T cppchar_signed_t;
/* Style of header dependencies to generate. */
enum cpp_deps_style { DEPS_NONE = 0, DEPS_USER, DEPS_SYSTEM };
/* The possible normalization levels, from most restrictive to least. */
enum cpp_normalize_level {
/* In NFKC. */
normalized_KC = 0,
/* In NFC. */
normalized_C,
/* In NFC, except for subsequences where being in NFC would make
the identifier invalid. */
normalized_identifier_C,
/* Not normalized at all. */
normalized_none
};
/* This structure is nested inside struct cpp_reader, and
carries all the options visible to the command line. */
struct cpp_options
......@@ -373,6 +386,10 @@ struct cpp_options
/* Holds the name of the input character set. */
const char *input_charset;
/* The minimum permitted level of normalization before a warning
is generated. */
enum cpp_normalize_level warn_normalize;
/* True to warn about precompiled header files we couldn't use. */
bool warn_invalid_pch;
......
......@@ -153,6 +153,7 @@ cpp_create_reader (enum c_lang lang, hash_table *table,
CPP_OPTION (pfile, dollars_in_ident) = 1;
CPP_OPTION (pfile, warn_dollars) = 1;
CPP_OPTION (pfile, warn_variadic_macros) = 1;
CPP_OPTION (pfile, warn_normalize) = normalized_C;
/* Default CPP arithmetic to something sensible for the host for the
benefit of dumb users like fix-header. */
......
......@@ -564,8 +564,31 @@ extern unsigned char *_cpp_copy_replacement_text (const cpp_macro *,
extern size_t _cpp_replacement_text_len (const cpp_macro *);
/* In charset.c. */
/* The normalization state at this point in the sequence.
It starts initialized to all zeros, and at the end
'level' is the normalization level of the sequence. */
struct normalize_state
{
/* The previous character. */
cppchar_t previous;
/* The combining class of the previous character. */
unsigned char prev_class;
/* The lowest normalization level so far. */
enum cpp_normalize_level level;
};
#define INITIAL_NORMALIZE_STATE { 0, 0, normalized_KC }
#define NORMALIZE_STATE_RESULT(st) ((st)->level)
/* We saw a character that matches ISIDNUM(), update a
normalize_state appropriately. */
#define NORMALIZE_STATE_UPDATE_IDNUM(st) \
((st)->previous = 0, (st)->prev_class = 0)
extern cppchar_t _cpp_valid_ucn (cpp_reader *, const unsigned char **,
const unsigned char *, int);
const unsigned char *, int,
struct normalize_state *state);
extern void _cpp_destroy_iconv (cpp_reader *);
extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
unsigned char *, size_t, size_t,
......
......@@ -53,9 +53,6 @@ static const struct token_spelling token_spellings[N_TTYPES] = { TTYPE_TABLE };
static void add_line_note (cpp_buffer *, const uchar *, unsigned int);
static int skip_line_comment (cpp_reader *);
static void skip_whitespace (cpp_reader *, cppchar_t);
static cpp_hashnode *lex_identifier (cpp_reader *, const uchar *, bool);
static void lex_number (cpp_reader *, cpp_string *);
static bool forms_identifier_p (cpp_reader *, int);
static void lex_string (cpp_reader *, cpp_token *, const uchar *);
static void save_comment (cpp_reader *, cpp_token *, const uchar *, cppchar_t);
static void create_literal (cpp_reader *, cpp_token *, const uchar *,
......@@ -430,10 +427,36 @@ name_p (cpp_reader *pfile, const cpp_string *string)
return 1;
}
/* After parsing an identifier or other sequence, produce a warning about
sequences not in NFC/NFKC. */
static void
warn_about_normalization (cpp_reader *pfile,
const cpp_token *token,
const struct normalize_state *s)
{
if (CPP_OPTION (pfile, warn_normalize) < NORMALIZE_STATE_RESULT (s)
&& !pfile->state.skipping)
{
/* Make sure that the token is printed using UCNs, even
if we'd otherwise happily print UTF-8. */
unsigned char *buf = xmalloc (cpp_token_len (token));
size_t sz;
sz = cpp_spell_token (pfile, token, buf, false) - buf;
if (NORMALIZE_STATE_RESULT (s) == normalized_C)
cpp_error_with_line (pfile, CPP_DL_WARNING, token->src_loc, 0,
"`%.*s' is not in NFKC", sz, buf);
else
cpp_error_with_line (pfile, CPP_DL_WARNING, token->src_loc, 0,
"`%.*s' is not in NFC", sz, buf);
}
}
/* Returns TRUE if the sequence starting at buffer->cur is invalid in
an identifier. FIRST is TRUE if this starts an identifier. */
static bool
forms_identifier_p (cpp_reader *pfile, int first)
forms_identifier_p (cpp_reader *pfile, int first,
struct normalize_state *state)
{
cpp_buffer *buffer = pfile->buffer;
......@@ -457,7 +480,8 @@ forms_identifier_p (cpp_reader *pfile, int first)
&& (buffer->cur[1] == 'u' || buffer->cur[1] == 'U'))
{
buffer->cur += 2;
if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first))
if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first,
state))
return true;
buffer->cur -= 2;
}
......@@ -467,7 +491,8 @@ forms_identifier_p (cpp_reader *pfile, int first)
/* Lex an identifier starting at BUFFER->CUR - 1. */
static cpp_hashnode *
lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn)
lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
struct normalize_state *nst)
{
cpp_hashnode *result;
const uchar *cur;
......@@ -482,13 +507,16 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn)
cur++;
}
pfile->buffer->cur = cur;
if (starts_ucn || forms_identifier_p (pfile, false))
if (starts_ucn || forms_identifier_p (pfile, false, nst))
{
/* Slower version for identifiers containing UCNs (or $). */
do {
while (ISIDNUM (*pfile->buffer->cur))
pfile->buffer->cur++;
} while (forms_identifier_p (pfile, false));
{
pfile->buffer->cur++;
NORMALIZE_STATE_UPDATE_IDNUM (nst);
}
} while (forms_identifier_p (pfile, false, nst));
result = _cpp_interpret_identifier (pfile, base,
pfile->buffer->cur - base);
}
......@@ -524,7 +552,8 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn)
/* Lex a number to NUMBER starting at BUFFER->CUR - 1. */
static void
lex_number (cpp_reader *pfile, cpp_string *number)
lex_number (cpp_reader *pfile, cpp_string *number,
struct normalize_state *nst)
{
const uchar *cur;
const uchar *base;
......@@ -537,11 +566,14 @@ lex_number (cpp_reader *pfile, cpp_string *number)
/* N.B. ISIDNUM does not include $. */
while (ISIDNUM (*cur) || *cur == '.' || VALID_SIGN (*cur, cur[-1]))
cur++;
{
cur++;
NORMALIZE_STATE_UPDATE_IDNUM (nst);
}
pfile->buffer->cur = cur;
}
while (forms_identifier_p (pfile, false));
while (forms_identifier_p (pfile, false, nst));
number->len = cur - base;
dest = _cpp_unaligned_alloc (pfile, number->len + 1);
......@@ -897,9 +929,13 @@ _cpp_lex_direct (cpp_reader *pfile)
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':
result->type = CPP_NUMBER;
lex_number (pfile, &result->val.str);
break;
{
struct normalize_state nst = INITIAL_NORMALIZE_STATE;
result->type = CPP_NUMBER;
lex_number (pfile, &result->val.str, &nst);
warn_about_normalization (pfile, result, &nst);
break;
}
case 'L':
/* 'L' may introduce wide characters or strings. */
......@@ -922,7 +958,12 @@ _cpp_lex_direct (cpp_reader *pfile)
case 'S': case 'T': case 'U': case 'V': case 'W': case 'X':
case 'Y': case 'Z':
result->type = CPP_NAME;
result->val.node = lex_identifier (pfile, buffer->cur - 1, false);
{
struct normalize_state nst = INITIAL_NORMALIZE_STATE;
result->val.node = lex_identifier (pfile, buffer->cur - 1, false,
&nst);
warn_about_normalization (pfile, result, &nst);
}
/* Convert named operators to their proper types. */
if (result->val.node->flags & NODE_OPERATOR)
......@@ -1067,8 +1108,10 @@ _cpp_lex_direct (cpp_reader *pfile)
result->type = CPP_DOT;
if (ISDIGIT (*buffer->cur))
{
struct normalize_state nst = INITIAL_NORMALIZE_STATE;
result->type = CPP_NUMBER;
lex_number (pfile, &result->val.str);
lex_number (pfile, &result->val.str, &nst);
warn_about_normalization (pfile, result, &nst);
}
else if (*buffer->cur == '.' && buffer->cur[1] == '.')
buffer->cur += 2, result->type = CPP_ELLIPSIS;
......@@ -1151,11 +1194,13 @@ _cpp_lex_direct (cpp_reader *pfile)
case '\\':
{
const uchar *base = --buffer->cur;
struct normalize_state nst = INITIAL_NORMALIZE_STATE;
if (forms_identifier_p (pfile, true))
if (forms_identifier_p (pfile, true, &nst))
{
result->type = CPP_NAME;
result->val.node = lex_identifier (pfile, base, true);
result->val.node = lex_identifier (pfile, base, true, &nst);
warn_about_normalization (pfile, result, &nst);
break;
}
buffer->cur++;
......
#! /usr/bin/perl -w
use strict;
# Convert cppucnid.tab to cppucnid.h. We use two arrays of length
# 65536 to represent the table, since this is nice and simple. The
# first array holds the tags indicating which ranges are valid in
# which contexts. The second array holds the language name associated
# with each element.
our(@tags, @names);
@tags = ("") x 65536;
@names = ("") x 65536;
# Array mapping tag numbers to standard #defines
our @stds;
# Current standard and language
our($curstd, $curlang);
# First block of the file is a template to be saved for later.
our @template;
while (<>) {
chomp;
last if $_ eq '%%';
push @template, $_;
};
# Second block of the file is the UCN tables.
# The format looks like this:
#
# [std]
#
# ; language
# xxxx-xxxx xxxx xxxx-xxxx ....
#
# with comment lines starting with #.
while (<>) {
chomp;
/^#/ and next;
/^\s*$/ and next;
/^\[(.+)\]$/ and do {
$curstd = $1;
next;
};
/^; (.+)$/ and do {
$curlang = $1;
next;
};
process_range(split);
}
# Print out the template, inserting as requested.
$\ = "\n";
for (@template) {
print("/* Automatically generated from cppucnid.tab, do not edit */"),
next if $_ eq "[dne]";
print_table(), next if $_ eq "[table]";
print;
}
sub print_table {
my($lo, $hi);
my $prevname = "";
for ($lo = 0; $lo <= $#tags; $lo = $hi) {
$hi = $lo;
$hi++ while $hi <= $#tags
&& $tags[$hi] eq $tags[$lo]
&& $names[$hi] eq $names[$lo];
# Range from $lo to $hi-1.
# Don't make entries for ranges that are not valid idchars.
next if ($tags[$lo] eq "");
my $tag = $tags[$lo];
$tag = " ".$tag if $tag =~ /^C99/;
if ($names[$lo] eq $prevname) {
printf(" { 0x%04x, 0x%04x, %-11s },\n",
$lo, $hi-1, $tag);
} else {
printf(" { 0x%04x, 0x%04x, %-11s }, /* %s */\n",
$lo, $hi-1, $tag, $names[$lo]);
}
$prevname = $names[$lo];
}
}
# The line is a list of four-digit hexadecimal numbers or
# pairs of such numbers. Each is a valid identifier character
# from the given language, under the given standard.
sub process_range {
for my $range (@_) {
if ($range =~ /^[0-9a-f]{4}$/) {
my $i = hex($range);
if ($tags[$i] eq "") {
$tags[$i] = $curstd;
} else {
$tags[$i] = $curstd . "|" . $tags[$i];
}
if ($names[$i] ne "" && $names[$i] ne $curlang) {
warn sprintf ("language overlap: %s/%s at %x (tag %d)",
$names[$i], $curlang, $i, $tags[$i]);
next;
}
$names[$i] = $curlang;
} elsif ($range =~ /^ ([0-9a-f]{4}) - ([0-9a-f]{4}) $/x) {
my ($start, $end) = (hex($1), hex($2));
my $i;
for ($i = $start; $i <= $end; $i++) {
if ($tags[$i] eq "") {
$tags[$i] = $curstd;
} else {
$tags[$i] = $curstd . "|" . $tags[$i];
}
if ($names[$i] ne "" && $names[$i] ne $curlang) {
warn sprintf ("language overlap: %s/%s at %x (tag %d)",
$names[$i], $curlang, $i, $tags[$i]);
next;
}
$names[$i] = $curlang;
}
} else {
warn "malformed range expression $range";
}
}
}
/* Table of UCNs which are valid in identifiers.
Copyright (C) 2003 Free Software Foundation, Inc.
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2, or (at your option) any
later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */
[dne]
/* This file reproduces the table in ISO/IEC 9899:1999 (C99) Annex
D, which is itself a reproduction from ISO/IEC TR 10176:1998, and
the similar table from ISO/IEC 14882:1988 (C++98) Annex E, which is
a reproduction of ISO/IEC PDTR 10176. Unfortunately these tables
are not identical. */
#ifndef LIBCPP_UCNID_H
#define LIBCPP_UCNID_H
#define C99 1
#define CXX 2
#define DIG 4
struct ucnrange
{
unsigned short lo, hi;
unsigned short flags;
};
static const struct ucnrange ucnranges[] = {
[table]
};
#endif /* LIBCPP_UCNID_H */
%%
; Table of UCNs which are valid in identifiers.
; Copyright (C) 2003, 2005 Free Software Foundation, Inc.
;
; This program is free software; you can redistribute it and/or modify it
; under the terms of the GNU General Public License as published by the
; Free Software Foundation; either version 2, or (at your option) any
; later version.
;
; This program is distributed in the hope that it will be useful,
; but WITHOUT ANY WARRANTY; without even the implied warranty of
; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
; GNU General Public License for more details.
;
; You should have received a copy of the GNU General Public License
; along with this program; if not, write to the Free Software
; Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
;
; This file reproduces the table in ISO/IEC 9899:1999 (C99) Annex
; D, which is itself a reproduction from ISO/IEC TR 10176:1998, and
; the similar table from ISO/IEC 14882:1988 (C++98) Annex E, which is
; a reproduction of ISO/IEC PDTR 10176. Unfortunately these tables
; are not identical.
[C99]
......@@ -141,7 +119,6 @@ ac00-d7a3
0b3d 1fbe 203f-2040 2102 2107 210a-2113 2115 2118-211d 2124 2126 2128
212a-2131 2133-2138 2160-2182 3005-3007 3021-3029
[C99|DIG]
; Digits
0660-0669 06f0-06f9 0966-096f 09e6-09ef 0a66-0a6f 0ae6-0aef 0b66-0b6f
0be7-0bef 0c66-0c6f 0ce6-0cef 0d66-0d6f 0e50-0e59 0ed0-0ed9 0f20-0f33
......@@ -201,16 +178,12 @@ ac00-d7a3
; Malayalam
0d05-0d0c 0d0e-0d10 0d12-0d28 0d2a-0d39 0d60-0d61
# CORRECTION: Exclude 0e50-0e59 from the Thai range and make a fake
# Digits range for it, to match C99. cppcharset.c knows that C++
# doesn't distinguish digits from other UCNs valid in identifiers.
; Thai
0e01-0e30 0e32-0e33 0e40-0e46 0e4f-0e49 0e5a-0e5b
0e01-0e30 0e32-0e33 0e40-0e46 0e4f-0e5b
; Digits
0e50-0e59
# CORRECTION: Change 0e0d to 0e8d (typo in standard; see C++ DR 131)
; Lao
0e81-0e82 0e84 0e87-0e88 0e8a 0e8d 0e94-0e97 0e99-0e9f 0ea1-0ea3 0ea5
0ea7 0eaa-0eab 0ead-0eb0 0eb2 0eb3 0ebd 0ec0-0ec4 0ec6
......@@ -224,7 +197,6 @@ ac00-d7a3
; Katakana
30a1-30fe
# CORRECTION: language spelled "Bopmofo" in C++98.
; Bopomofo
3105-312c
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment