utf 8 - Java XMLReader not clearing multi-byte UTF-8 encoded attributes -


i've got strange situation sax contenthandler being handed bad attributes xmlreader. document being parsed utf-8 multi-byte characters inside xml attributes. appears happen these attributes being accumulated each time handler called. rather being passed in succession, concatenated onto previous node's value.

here example demonstrates using public data (wikipedia).

public class mycontenthandler extends org.xml.sax.helpers.defaulthandler {      public static void main(string[] args) {         try {             org.xml.sax.xmlreader reader = org.xml.sax.helpers.xmlreaderfactory.createxmlreader();             reader.setcontenthandler(new mycontenthandler());             reader.parse("http://en.wikipedia.org/w/api.php?format=xml&action=query&list=allpages&apfilterredir=redirects&apdir=descending");          } catch (exception ex) {             ex.printstacktrace();         }     }      public void startelement(string uri, string localname, string qname, org.xml.sax.attributes attributes) {         if ("p".equals(qname)) {             string title = attributes.getvalue("title");             system.out.println(title);         }     } } 

update: complete example produces (apologies cantonese speakers vulgar output):

𩧢 𩧢𨳒 𩧢𨳒🛅 𩧢𨳒🛅🛄 𩧢𨳒🛅🛄🛃 𩧢𨳒🛅🛄🛃🛂 𩧢𨳒🛅🛄🛃🛂🛁 𩧢𨳒🛅🛄🛃🛂🛁🛀 𩧢𨳒🛅🛄🛃🛂🛁🛀🚿 𩧢𨳒🛅🛄🛃🛂🛁🛀🚿🚾 

does have clue happening , how fix it? comes in document doesn't match happening debug through snippet.

seems bug in jre included version of xerces (com.sun.org.apache.xerces.internal.parsers.saxparser). below notes.

the version bundled jre 1.6.0_24, v2.4.0, v2.5.0, v2.6.0 does do accumulation of attributes.

xerces-j v1.4.4 does not appear have bug.

xerces2-j v2.6.1, v2.6.2, v2.9.0, 2.11.0 does not appear have bug.

you can tell versions tested bisecting version history. appears fixed between v2.6.0 , v2.6.1. i'm kind of surprised jre hasn't been updated fixed in main xerces 7 years ago!


Comments

Popular posts from this blog

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

fortran - Function return type mismatch -

queue - mq_receive: message too long -