[Mew-dist 08981] Correction for RFC2047 (Re: 悩ましいこと)

1999年 5月 26日 (水) 04:17:22 JST

後藤＠太陽計測です

もしかしたら かずさんが現在行っている修正とぶつかるかもしれないけど、
ヘッダでの rfc2047 encoding/decodingに関する不具合報告とパッチです。


>>>>> at Thu, 20 May 1999 16:30:12 +0900, kazz <kazu at example.com> said,

kazz> (1) RFC 2047 のヘッダ符号

kazz> - 復号： Subject: で text なので、mew-decode-quoted は関係ないのですが、
kazz> 今は関係してしまっている。

これに関する件と、プラスアルファのパッチです。(for mew-1.94b29)

なのですが、

kazz> - 符号： 折り返しが甘い。符号化と折り返しを同時にやるのではなく、とり
kazz> あえず符号化し、あとから折り返すところを見付けるアルゴリズムの方がいい
kazz> ような気がしています。こうすると、復号のアルゴリズムと対称になりますし。

このあたりにからんで、かずさんのほうですでに手を入れているとややこしいの
で、一応パッチ自体は参考までに、ということで、本パッチが対処している問題
点と具体例を挙げておきます。


1. non-structuredな text部で double quote <">文字が特別扱いしてdecode
   してしまっている問題。
   (mew-header-decode-regionを修正)

   これは既に報告されている問題ですのですね。

  (例)n
     Subject: " =?ISO-2022-JP?B?GyRCJCIbKEI=?= "
  => Subject: " =?ISO-2022-JP?B?GyRCJCIbKEI=?= "
  (パッチ後)
  => Subject: " あ "


2. decode時、闇雲に "\\?=[ \t]=\\?" を探し出してこれを "?==?"に
   連結してしまう問題。
   (mew-header-decode-regionを修正)

  (例1)
     Subject: =?ISO-8859-1?Q?a?= ?= =?
  => Subject: a ?==?
  (パッチ後)
  => Subject: a ?= =?

  (例2)
    Subject: =?ISO-2022-JP?B?GyRCJCIbKEI=?= =?= =?ISO-2022-JP?B?GyRCJCIbKEI=?=
  =>Subject: あ=?= あ
  (パッチ後)
  =>Subject: あ =?= あ


3. structuredな phrase部で quoted-pairな double quote <">の判定が
   あまく、正しい処理を行っていなかった問題。
   (mew-header-decode-regionを修正)

  (例)
     From: "\"\\" =?ISO-2022-JP?B?GyRCJCIbKEI=?= "\"\\" <akr at example.com>
  => From: "\"\\" =?ISO-2022-JP?B?GyRCJCIbKEI=?= "\"\\" <akr at example.com>
  (パッチ後)
  => From: "\"\\" あ "\"\\" <akr at example.com> 


4. encode時、mew-header-encode-split後に、要encodeな文字列を
   encodeするさい、その文字列が単一のcharsetで表現できない場合に
   elispレベルのエラーとなりencodeできなかった問題。
   (mew-header-encode-textを修正)

   具体的には、iso-8859-8の文字とiso-2022-jpの文字が連続している場合。
   mew-charset-guess-region が"unknown" を返し、その際の処理を考慮して
   いないため、エラーとなっていた。
   パッチでは単一エンコードに収まるような小片に分割するための
   mew-header-encode-split-by-charsetを追加した。

  (例)
   iso-8859-8の code 0xE0 の <aleph>  
   iso-2022-jpの「零」

     Subject: <aleph>零" 
   =>error
  (パッチ後)
   =>Subject: =?iso-8859-8?B?4A==?=
      =?iso-2022-jp?B?GyRCTm0bKEI=?=
  (正解)
   =>Subject: =?iso-8859-8?B?4A==?= =?iso-2022-jp?B?GyRCTm0bKEI=?=

  修正個所を多くしないパッチのため、「正解」になるようなコードは
  とりあえず保留


これらの例は Tanaka Akira <akr at example.com>さんの doodleに含まれる
テストパターンを試して、そのうちの致命的で、なおかつMewの現コードの
構造に沿って修正できるものを対処したつもりです。

このパターンに関しては、つい先ごろ 

Subject: [Mew-dist 08751] Re: B-encoded header routine
From: Tanaka Akira <akr at example.com>
Date: Sat, 8 May 1999 02:59:35 +0900
Message-Id: <rsqzp3gkbg9.fsf at example.com>

でも話題になったものです。

ちなみにこの意地悪テストを通すと他にもいっぱいfailになります。(^^;

かずさんもテストパターンをお持ちと思いますが、このパターン、試して
みませんか？ doodle付属のTESTPATをmewにかけるための若干のコードを
作ってあります。 ご要望とあらばテスト結果もすぐ送れますがいかがでしょうか。

--- Regards,
 Shun-ichi Goto  <gotoh at example.com>
   R&D Group, TAIYO Corp., Tokyo, JAPAN

-------------- next part --------------
--- mew-bq.el.orig	Thu May 13 04:54:26 1999
+++ mew-bq.el	Tue May 25 14:13:16 1999
@@ -30,6 +30,27 @@
 (defconst mew-header-decode-regex 
   "=\\?\\([^?]+\\)\\?\\(.\\)\\?\\([^?]+\\)\\?=")
 
+(defvar mew-cs-coding-database nil)
+
+(let ((db mew-cs-database)
+      elem charset csl cs)
+  (while db
+    (setq charset (car (car db))
+	  csl (car (cdr (car db)))
+	  db (cdr db))
+    (while csl
+      (setq cs (car csl)
+	    csl (cdr csl))
+      (if (eq cs 'ascii)
+	  ()				; do nothing
+	(setq elem (assq cs mew-cs-coding-database))
+	(if elem
+	    ;; append enabled charset
+	    (nconc elem (list charset))
+	  ;; newly create
+	  (setq mew-cs-coding-database 
+		(cons (list cs charset) mew-cs-coding-database)))))))
+	  
 ;;;
 ;;;
 ;;;
@@ -263,7 +284,7 @@
 ;; of results. If it is longer than 75, `str' is split and `substr1' and
 ;; `substr2' are encoded.... Repeat this recursively but not so deeply.
 
-(defun mew-header-encode-string (str)
+(defun mew-header-encode-string-subr (str)
   (let ((encoded-word (mew-header-encode str)))
     (if (> (length encoded-word) mew-encode-word-max-length)
         (let ((med (/ (length str) 2))
@@ -271,10 +292,57 @@
           (while (< i med)
             (setq i (+ i (mew-charlen (mew-aref str i)))))
           (append
-           (mew-header-encode-string (substring str 0 i))
-           (mew-header-encode-string (substring str i nil))))
+           (mew-header-encode-string-subr (substring str 0 i))
+           (mew-header-encode-string-subr (substring str i nil))))
       (list encoded-word))))
 
+(defun mew-header-encode-string (str)
+  ;; split string with charset
+  (let ((str-cs-list (mew-header-encode-split-string-by-charset str))
+	result)
+    (while str-cs-list
+      (setq result (nconc result 
+			  (mew-header-encode-string-subr (car str-cs-list)))
+	    str-cs-list (cdr str-cs-list)))
+    result))
+
+
+(defun mew-header-encode-split-string-by-charset (str)
+  "Split STR to be encoded as single charset."
+  (let ((len (length str))
+	(last 0)
+	(pos 0)
+	result cs ch available using charsets and tmpl)
+    (while (< pos len)
+      (setq ch (mew-aref str pos)
+	    cs (char-charset ch))
+      (if (null available)
+	  (setq available (cdr (assq cs mew-cs-coding-database))))
+      (if (or (eq cs 'ascii) (memq cs using))
+	  (setq pos (+ pos (mew-charlen ch))) ; accept
+	;; update for new charset
+	(setq charsets (cdr (assq cs mew-cs-coding-database))
+	      and nil
+	      tmpl available)
+	;; find common charset in AVAILABLE and CHARSETS
+	(while tmpl
+	  (if (member (car tmpl) charsets)
+	      (setq and (cons (car tmpl) and)))
+	  (setq tmpl (cdr tmpl)))
+	;; check it
+	(if and
+	    ;; accept this char
+	    (setq available and
+		  using (cons cs using)
+		  pos (+ pos (mew-charlen ch)))
+	  ;; Do not accept this char, split at this point
+	  (setq result (cons (substring str last pos) result)
+		last pos
+		available nil
+		using nil))))
+    (setq result (cons (substring str last pos) result))
+    (nreverse result)))
+
 (defun mew-header-encode-split-string (str)
   "Split STR to need-to-encode string and non-encode-string."
   (let ((start 0) beg end ret)
@@ -595,31 +663,49 @@
     (if (and (not unfold) 
 	     (not (re-search-forward mew-header-decode-regex nil t)))
 	(mew-header-unfold-region type)
-      (let ((regex (concat "\"\\|" mew-header-decode-regex))
-	    beg end cs-str)
+      (let ((endq-regex "\\(\\\\*\\)\"")
+	    (next-regex (concat "[ \t]*" mew-header-decode-regex))
+	    regex beg end cs-str esc )
+	(if (or mew-decode-quoted (memq type '(text comma-text)))
+	    ;; unstructured or allow decoding in quoted-string
+	    (setq regex mew-header-decode-regex)
+	  ;; structured or deny decoding in quoted-string
+	  (setq regex (concat endq-regex "\\|" mew-header-decode-regex)))
 	(mew-header-unfold-region type)
 	;; In Page 10 of RFC 2047 says, "When displaying a particular 
 	;; header field that contains multiple 'encoded-word's, any 
 	;; 'linear-white-space' that separates a pair of adjacent 
 	;; 'encoded-word's is ignored".
 	(goto-char (point-min))
-	(while (re-search-forward "\\?=[ \t]+=\\?" nil t)
-	  (replace-match "?==?" nil t))
-	(goto-char (point-min))
 	(while (re-search-forward regex nil t)
 	  ;; encoded-word in quoted-string should not be decoded
 	  ;; according to RFC 2047. However, if users wish
 	  ;; (ie mew-decode-quoted is *non-nil*), decode it.
-	  (if (eq (char-after (match-beginning 0)) ?\")
-	      (if (or mew-decode-quoted (re-search-forward "[^\\]\"" nil t))
-		  (goto-char (match-end 0)))
-	    (setq beg (match-beginning 0)
-		  end (match-end 0)
-		  cs-str (mew-header-decode (mew-match 1)
-					    (mew-match 2)
-					    (mew-match 3)))
-	    (delete-region beg end)
-	    (insert cs-str)))))
+	  (goto-char (match-beginning 0))
+	  ;; count escapes leads to double quote char
+	  (if (not (eq (following-char) ?\\))
+	      (setq esc 0)
+	    (setq esc (- (match-end 1) (match-beginning 1))))
+	  (forward-char esc)
+	  ;; Now pointer is skipped leading escapes
+	  (if (not (eq (following-char) ?\"))
+	    (while (looking-at next-regex)
+	      (setq beg (match-beginning 0)
+		    end (match-end 0)
+		    cs-str (mew-header-decode (mew-match 1)
+					      (mew-match 2)
+					      (mew-match 3)))
+	      (delete-region beg end)
+	      (insert cs-str))
+	    ;; beginning of quoted string
+	    (goto-char (match-end 0))
+	    (if (= 0 (% esc 2))
+	      ;; skip to end of quoted-string
+		(while (and (re-search-forward endq-regex nil t)
+			    (setq esc (- (match-end 1) (match-beginning 1)))
+			    (= 1 (% esc 2)))
+		  (goto-char (match-end 0))))
+	    ))))
     (goto-char (point-max))))
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;