技術/HTTP/FormやCookieのkey名に"="を含めたらどうなるのか？

技術 / HTTP / FormやCookieのkey名に"="を含めたらどうなるのか？

id: 1002 所有者: msakamoto-sf 作成日: 2011-07-31 16:43:40
カテゴリ: HTTP ネットワーク

[ Prev ]
[ Next ]
[ 技術 ]

仕事中、FormやCookieのkey名に"="を含めたらどうなるのだろう？という話題が出た。

<input type="text" name="foo=bar" value="" />

や、

Set-Cookie: foo=bar=baz

の時、どんな挙動がみられるのか？

ということで、FormとCookieのそれぞれについて、最初にRFC上での"="の扱いについて調べ、続いて実際に動作を確認してみたメモ。

Form、つまりURLのqueryやx-www-form-urlencodedの場合
Cookieの"NAME=VALUE"の場合
- Netscape仕様, RFC6265の調査
- PHPでの実験

Form、つまりURLのqueryやx-www-form-urlencodedの場合

URLのquery部分やx-www-form-urlencodedなど、Formからのsubmitなどの場合について調べてみる。

RFC2396, RFC2616, RFC3986 の調査

まずRFC上に何か規定がないか調べてみる。
HTTPで使われているURLについてはRFC2616で次のように記されている。(see: http://www.studyinghttp.net/cgi-bin/rfc.cgi?2616#Sec3.2.2 )

http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]

"host"や"query"にどのような文字を使えるのかについては、RFC2396のAppendix.A に記されている。
(see: http://tools.ietf.org/html/rfc2396 )

A. Collected BNF for URI

     URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
     absoluteURI   = scheme ":" ( hier_part | opaque_part )
     relativeURI   = ( net_path | abs_path | rel_path ) [ "?" query ]
...
     query         = *uric
...
     uric          = reserved | unreserved | escaped
     reserved      = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                     "$" | ","
     unreserved    = alphanum | mark
     mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
                     "(" | ")"

     escaped       = "%" hex hex
     hex           = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                             "a" | "b" | "c" | "d" | "e" | "f"

ここで"="が"reserved"に含まれていることが確認できる。では"reserved", "unreserved"の違いは何か？これについてもRFC2396の"2.2. Reserved Characters"以下に記されている。

2.2. Reserved Characters

  Many URI include components consisting of or delimited by, certain
  special characters.  These characters are called "reserved", since
  their usage within the URI component is limited to their reserved
  purpose.  If the data for a URI component would conflict with the
  reserved purpose, then the conflicting data must be escaped before
  forming the URI.
...
2.3. Unreserved Characters

  Data characters that are allowed in a URI but do not have a reserved
  purpose are called unreserved.  These include upper and lower case
  letters, decimal digits, and a limited set of punctuation marks and
  symbols.

ざっくりまとめれば、"reserved"に指定している文字セットはURLを構成する各パートでの区切り文字などに使われているため、データとして使うのであれば"escape"せよ、と書いてある。"escape"というのはいわゆる"%"+16進数で表現する「パーセントエンコーディング」となり、詳細は同RFCの"2.4. Escape Sequences"を参照のこと。

URLについてはURNと共にRFC3986にてURIに包含された。RFC3986での"query"について見てみる。
(see : http://tools.ietf.org/html/rfc3986#section-3.4 )

3.4.  Query

  The query component contains non-hierarchical data that, along with
  data in the path component (Section 3.3), serves to identify a
  resource within the scope of the URI's scheme and naming authority
  (if any).  The query component is indicated by the first question
  mark ("?") character and terminated by a number sign ("#") character
  or by the end of the URI.

     query       = *( pchar / "/" / "?" )

  The characters slash ("/") and question mark ("?") may represent data
  within the query component.  Beware that some older, erroneous
  implementations may not handle such data correctly when it is used as
  the base URI for relative references (Section 5.1), apparently
  because they fail to distinguish query data from path data when
  looking for hierarchical separators.  However, as query components
  are often used to carry identifying information in the form of
  "key=value" pairs and one frequently used value is a reference to
  another URI, it is sometimes better for usability to avoid percent-
  encoding those characters.

具体的な文字セットについては"Appendix A. Collected ABNF for URI"参照：

Appendix A.  Collected ABNF for URI
  URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
 ...
  pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

  query         = *( pchar / "/" / "?" )
...
  pct-encoded   = "%" HEXDIG HEXDIG
  unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
...
  sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                / "*" / "+" / "," / ";" / "="

RFC2396では"reserved"で一括りになっていたのが、やや細分化され"sub-delims"と":"と"@"になっている。とはいえここでもやはり"="は"sub-delims"扱い、つまりデータとして使うのであればパーセントエンコーディングが必要とされていることが読み取れる。

とはいえ、ここまでのRFC2616, RFC2396, RFC3986で記されているのは "=" が query における区切り文字であること迄であり、一体何を区切るのか、つまり query 自体の詳細には踏み込んでいない。

CGI, W3C HTML 4.01, RFC1866の調査

Webの世界ではqueryはCGIなど動的処理に使われる。そこでCGIのQUERY_STRINGの定義に何かヒントが無いか見てみる。(see : http://www.studyinghttp.net/cgi#QUERY_STRING )

検索文字列のための URL 構文は RFC 2396 [2] の section 3 にて記述される。 QUERY_STRING 値は、大文字・小文字を区別する。
QUERY_STRING = query-string
query-string = *uric
uric         = reserved | unreserved | escaped
問い合わせ文字列を解析し、デコードする際、その解析の詳細や予約文字、非 US-ASCII 文字についてのサポートは、状況に依存する。例えば、HTML 文書からのフォーム提出 [18] は、application/x-www-form-urlencoded 符号化を使用し、その場合は文字 "+", "&", "=" が予約されており、非 US-ASCII 文字については ISO 8859-1 エンコーディングが使用されているであろう。

ここでもquery内部の詳しいフォーマットまでは定義されていない。では、HTML文書からのフォーム提出ではどのように記されているのか見てみる。上記CGIの"HTML 文書からのフォーム提出 [18] "はW3Cの"HTML 4.01 Specification"を参照しているので、まずはHTML4.01でのフォームの仕様を調べてみる。

すると "17.13.4 Form content types" でようやく求めたいた記述が見つかった。
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4

application/x-www-form-urlencoded

This is the default content type. Forms submitted with this content type must be encoded as follows:
1. Control names and values are escaped. Space characters are replaced by `+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
2. The control names/values are listed in the order they appear in the document. The name is separated from the value by `=' and name/value pairs are separated from each other by `&'.

Control names and values are escaped.この一文により、フォームのname属性はエスケープされることが規定される。
同ページの"17.4 The INPUT element"ではinputタグのname属性がCDATAとなっており、データの自由度は高い。

"application/x-www-form-urlencoded"についてはRFC1866(HTML 2.0)が初出・・・らしい。

http://tools.ietf.org/html/rfc1866
- 8.2. Form Submission 参照

実験

簡単なFormを作成し、name属性に改行や空白、"="などを混ぜてみる。

<form action="./formtest.php" method="">
<input type="text" name=" ?abc
def =
ghi /
" value="foo" /><br />
<input type="submit" />
</form>

このようなフォームをChrome, IE9, Firefox5でsubmitしてみる。

Chrome, Firefox5.0:

...?+%3Fabc%0D%0Adef+%3D%0D%0Aghi+%2F%0D%0A+=foo

IE9:

...?+%3Fabc%250D%250Adef+%3D%250D%250Aghi+%2F%250D%250A+=foo

IE9だけ改行文字でパーセントエンコーディングが二重に処理されている・・・。

PHPではどのようにデコードされるか？

<?php var_export($_GET);

Chrome, Firefox5.0:

array (
  '?abc
def_=
ghi_/
_' => 'foo',
)

IE9:

array (
  '?abc%0D%0Adef_=%0D%0Aghi_/%0D%0A_' => 'foo',
)

key名の先頭のスペースは除去され、それ以外の空白は"_"に置換されている。改行についてはChrome, Firefoxの場合はそのままデコードされ、IEの場合は二重のパーセントエンコーディングが一回デコードされた段階になっている。

PHPの場合、ドットやスペースは"_"に置換される。これはregister_globalsにより自動的に変数に展開されていた時代、ドットが変数名を表す文字セットに含まれていなかったことが影響していると思われる。
http://jp.php.net/manual/ja/language.variables.external.php

注意:
変数名のドットやスペースはアンダースコアに変換されます。
たとえば <input name="a.b" /> は $_REQUEST["a_b"] となります。

Cookieの"NAME=VALUE"の場合

Cookieの"NAME=VALUE"にエンコードされていない"="が含まれるとどうなるか？

Netscape仕様, RFC6265の調査

まずNetscapeの仕様を見てみる。
http://www.futomi.com/lecture/cookie/specification.html

NAME=VALUE

ここには、セミコロン、カンマ、スペースを排除した文字列が入ります。セミコロン、カンマ、スペースが含まれるようなデータを設定する必要がある場合には、URLエンコードのような何かしらのエンコードが推奨されます。ただし、エンコード自体は、まったく定義されているわけではありませんし、要求されるものではありません。（日本語を扱う場合には、URLエンコードをする必要があります。）

この文章からは、

NA=M%3DE=VA%3DLU=E

のような極端な例に対してどう処理すれば良いのかは読み取れない。

念のためweb.archive.orgに残されている原文を確認してみる。
http://web.archive.org/web/20020803110822/http://wp.netscape.com/newsref/std/cookie_spec.html

NAME=VALUE
This string is a sequence of characters excluding semi-colon, comma and white space. If there is a need to place such data in the name or value, some encoding method such as URL style %XX encoding is recommended, though no encoding is defined or required.

日本語と同様である。セミコロン、カンマ、ホワイトスペースが予約語扱いでパーセントエンコーディングが必要、というのは理解できる。
"="は予約語ではないのか？

最新のCookieはRFC6265となり、こちらを調べてみる。
http://www.ietf.org/rfc/rfc6265.txt

4.1.  Set-Cookie

   The Set-Cookie HTTP response header is used to send cookies from the
   server to the user agent.

4.1.1.  Syntax

   Informally, the Set-Cookie response header contains the header name
   "Set-Cookie" followed by a ":" and a cookie.  Each cookie begins with
   a name-value-pair, followed by zero or more attribute-value pairs.
   Servers SHOULD NOT send Set-Cookie headers that fail to conform to
   the following grammar:

 set-cookie-header = "Set-Cookie:" SP set-cookie-string
 set-cookie-string = cookie-pair *( ";" SP cookie-av )
 cookie-pair       = cookie-name "=" cookie-value
 cookie-name       = token
 cookie-value      = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
 cookie-octet      = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
                       ; US-ASCII characters excluding CTLs,
                       ; whitespace DQUOTE, comma, semicolon,
                       ; and backslash
 token             = <token, defined in [RFC2616], Section 2.2>

先に"cookie-value"を構成する"cookie-octet"を見てみると、"="(0x3D)が"%x3C-5B"の範囲内に含まれている。つまり特別扱いはされず、英数字と同様に"cookie-value"中に出てきても問題はないことになる。
"cookie-name"を構成する"token"についてはRFC2616のSection2.2を参照してみる。
http://www.ietf.org/rfc/rfc2616.txt

2.2 Basic Rules
(...)
Many HTTP/1.1 header field values consist of words separated by LWS
or special characters. These special characters MUST be in a quoted
string to be used within a parameter value (as defined in section
3.6).

token          = 1*<any CHAR except CTLs or separators>
separators     = "(" | ")" | "<" | ">" | "@"
               | "," | ";" | ":" | "\" | <">
               | "/" | "[" | "]" | "?" | "="
               | "{" | "}" | SP | HT

これによると"token"には"="は含まれない。つまり"cookie-name"には"="を含まない、ということになる。

つまり

na=me=va=lue

という例では、"cookie-name"には"="を含まないので

cookie-name : "na"
"="
cookie-value : "me=va=lue"

となることが推測される。

RFC6265に戻ると、"5.2. The Set-Cookie Header" にその推測を裏付ける記述がある。このセクションはブラウザなどのUserAgentが"Set-Cookie"レスポンスヘッダをどう処理すれば良いのかを記している。
少し長いが、"cookie-name", "cookie-value"の処理を記している箇所をまるごと抜粋する。
http://www.ietf.org/rfc/rfc6265.txt

5.2.  The Set-Cookie Header

   When a user agent receives a Set-Cookie header field in an HTTP
   response, the user agent MAY ignore the Set-Cookie header field in
   its entirety.  For example, the user agent might wish to block
   responses to "third-party" requests from setting cookies (see
   Section 7.1).

   If the user agent does not ignore the Set-Cookie header field in its
   entirety, the user agent MUST parse the field-value of the Set-Cookie
   header field as a set-cookie-string (defined below).

   NOTE: The algorithm below is more permissive than the grammar in
   Section 4.1.  For example, the algorithm strips leading and trailing
   whitespace from the cookie name and value (but maintains internal
   whitespace), whereas the grammar in Section 4.1 forbids whitespace in
   these positions.  User agents use this algorithm so as to
   interoperate with servers that do not follow the recommendations in
   Section 4.

   A user agent MUST use an algorithm equivalent to the following
   algorithm to parse a "set-cookie-string":

   1.  If the set-cookie-string contains a %x3B (";") character:

          The name-value-pair string consists of the characters up to,
          but not including, the first %x3B (";"), and the unparsed-
          attributes consist of the remainder of the set-cookie-string
          (including the %x3B (";") in question).

       Otherwise:

          The name-value-pair string consists of all the characters
          contained in the set-cookie-string, and the unparsed-
          attributes is the empty string.

   2.  If the name-value-pair string lacks a %x3D ("=") character,
       ignore the set-cookie-string entirely.

   3.  The (possibly empty) name string consists of the characters up
       to, but not including, the first %x3D ("=") character, and the
       (possibly empty) value string consists of the characters after
       the first %x3D ("=") character.

   4.  Remove any leading or trailing WSP characters from the name
       string and the value string.

   5.  If the name string is empty, ignore the set-cookie-string
       entirely.

   6.  The cookie-name is the name string, and the cookie-value is the
       value string.

ポイントとなるのが"3."の処理で、日本語に直すと「"name"文字列は最初の%x3D("=")文字まででなおかつ"="自体は含まず、"value"文字列は最初の%x3D("=")文字列より後ろ」となる。
ここにおいてようやく、RFCに沿うなら

na=me=va=lue

→

cookie-name : "na"
"="
cookie-value : "me=va=lue"

と解釈することが確認できた。

PHPでの実験

PHPで以下のようにsetcookie()を呼んでみる。

setrawcookie("abc=def%3Dghi", "ABC=DEF%3DGHI");

PHP 4.4.9 環境では問題なく動作したが、PHP 5.2.17環境では次のようなWarningが発生し、Set-Cookieも出力されなかった。

PHP Warning:  Cookie names can not contain any of the following '=,; \t\r\n\013\014' in ...

setcookie()の場合も同様。

仕方なくheader()で直接Set-Cookieを出力し、ブラウザごとの反応を見てみた。

header('Set-Cookie: abc=def%3Dghi=ABC=DEF%3DGHI');

→

HTTP/1.1 200 OK
...
Set-Cookie: abc=def%3Dghi=ABC=DEF%3DGHI

PHP側でも$_COOKIEをダンプしてみる。

var_export($_COOKIE);

Chromeの場合

・オプション→高度な設定→プライバシー→コンテンツの設定→Cookie→すべてのCookieとサイトデータ...

名前：abc
コンテンツ：def%3Dghi=ABC=DEF%3DGHI

・Set-Cookieされた後にブラウザが送信したCookieヘッダ：

Cookie: abc=def%3Dghi=ABC=DEF%3DGHI

・PHP側の$_COOKIE:

array( 'abc' => 'def=ghi=ABC=DEF=GHI' )

IE9の場合

・アドレスバーに"javascript:prompt('', document.cookie)

abc=def%3Dghi=ABC=DEF%3DGHI

・Set-Cookieされた後にブラウザが送信したCookieヘッダ：
(Chromeと同じ)

・PHP側の$_COOKIE:
(Chromeと同じ)

Firefox5の場合

・"Tools"→"Options"→"Privacy"→"History"→"remove individual cookies"

Name:abc
Content:def%3Dghi=ABC=DEF%3DGHI

・Set-Cookieされた後にブラウザが送信したCookieヘッダ：
(Chromeと同じ)

・PHP側の$_COOKIE:
(Chromeと同じ)

3行でまとめ：

Formの場合：ブラウザが自動的に"="をパーセントエンコードしてくれる。
Cookieの場合：最初の"="までがname、それ以降がvalueとして扱われる。
PHPなど、処理系によっては独自のルールが定められているので注意。

個人的には、配列など特殊な形で渡したい場合を除き、あまり凝った記号を使わずに済ませたい。
配列などの場合は処理系の定めたフォーマットに従うことになるが、あまり逸脱したり混乱させるような記号は使わずに済ませたい。
どこに地雷が埋まっているか分からないので、なるべく安全ルートを通りたいな、と。そんな感じ。