I was working on Java application recently when I got the following exception

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1a) was found in the element content of the document.

I was using Castor in an attempt to unmarshal (XML->Java Objects) an XML string. The XML originated from a bookmarks file that was uploaded into the application by the user, tidied up, transformed and then stored in a Oracle database. It was at the point at which the XML string was returned from the database and was being converted into Java objects that this error occurred.

I immediately assumed it to be some kind of character encoding conversion problem.

With some databases (e.g. MySQL) it is possible to set data to be passed as UTF8 by passing certain settings via the JDBC url. I do not use MySQL so I do not know whether this would fix my problem; although I suspect it would not.

I tried various methods to convert my XML string into valid UTF8 and I was pretty sure that I had achieved satisfactory UTF8 conversion but I still got the error.

This is when I discovered that not all valid UTF8 characters are valid XML characters, which probably makes sense (what with control characters and such) but I have never had to think about this before.

After spending several hours previously messing with numerous UTF8 conversion techniques I eventually found a solution. I found it in the Xalan mailing list. I am reproducing this solution here because it was not mentioned in the context of the "Unicode: 0x1a" error, if it was I would have found the solution more quickly. The XML standard specifies which UTF8 characters are valid in XML documents, so it is possible to take a UTF8 document and filter out all the invalid characters using a method like this:

  /**
     * This method ensures that the output String has only
     * valid XML unicode characters as specified by the
     * XML 1.0 standard. For reference, please see
     * <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the
     * standard</a>. This method will return an empty
     * String if the input is null or empty.
     *
     * @param in The String whose non-valid characters we want to remove.
     * @return The in String, stripped of non-valid characters.
     */
    public String stripNonValidXMLCharacters(String in) {
        StringBuffer out = new StringBuffer(); // Used to hold the output.
        char current; // Used to reference the current character.

        if (in == null || ("".equals(in))) return ""; // vacancy test.
        for (int i = 0; i < in.length(); i++) {
            current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
            if ((current == 0x9) ||
                (current == 0xA) ||
                (current == 0xD) ||
                ((current >= 0x20) && (current <= 0xD7FF)) ||
                ((current >= 0xE000) && (current <= 0xFFFD)) ||
                ((current >= 0x10000) && (current <= 0x10FFFF)))
                out.append(current);
        }
        return out.toString();
    }    

I hope somebody else finds this useful (and it saves them a few hours of head scratching), alternatively if people reading this know of a better solution then please do let me know!

Tags :
Comments
 
Thank you so very much, i was searching the internet almost this wil afternoon.... Thankkss!
I think that the solution is broken because java chars are less than FFFF so some comparisons are useless. The idea is to compare Unicode code points not UTF16 code units(java char type). Here is my solution:

     /**

     * This method ensures that the output String has only valid XML unicode characters as specified by the

     * XML 1.0 standard. For reference, please see the

     * standard. This method will return an empty String if the input is null or empty.

     *

     * @author Donoiu Cristian, GPL

     * @param  The String whose non-valid characters we want to remove.

     * @return The in String, stripped of non-valid characters.

     */

    public static String removeInvalidXMLCharacters(String s) {

        StringBuilder out = new StringBuilder();                // Used to hold the output.

    	int codePoint;                                          // Used to reference the current character.

    	//String ss = "\ud801\udc00";                           // This is actualy one unicode character, represented by two code units!!!.

    	//System.out.println(ss.codePointCount(0, ss.length()));// See: 1

		int i=0;

    	while(i<s.length()) {

    		System.out.println("i=" + i);

    		codePoint = s.codePointAt(i);                       // This is the unicode code of the character.

			if ((codePoint == 0x9) ||          				    // Consider testing larger ranges first to improve speed. 

					(codePoint == 0xA) ||

					(codePoint == 0xD) ||

					((codePoint >= 0x20) && (codePoint <= 0xD7FF)) ||

					((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) ||

					((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {

				out.append(Character.toChars(codePoint));

			}				

			i+= Character.charCount(codePoint);                 // Increment with the number of code units(java chars) needed to represent a Unicode char.  

    	}

    	return out.toString();

    } 

Hope it helps.
Thanks a ton for this code. It really saved a lot of search and work !!
We're using Java 1.4.2 and a few methods can't seem to be found. Is there an equivalent for...? -Character.toChars() -Character.charCount() I used the following: codePoint = s.charAt(i);
Yeay, thanks for the Get-Out-Of-Jail-Free card :)
Thanks a lot! It saved my day too :)
Excellent ! Exactly what I was needing.
Here's the .NET version:

        public static string stripNonValidXMLCharacters(string s)

        {

            StringBuilder _validXML = new StringBuilder(s.Length, s.Length); // Used to hold the output.

            char current; // Used to reference the current character.

            char[] charArray = s.ToCharArray();



            if (string.IsNullOrEmpty(s)) return string.Empty; // vacancy test.



            for (int i = 0; i < charArray.Length; i++)

            {

                current = charArray[i]; // NOTE: No IndexOutOfBoundsException caught here; it should not happen.

                if ((current == 0x9) ||

                    (current == 0xA) ||

                    (current == 0xD) ||

                    ((current >= 0x20) && (current <= 0xD7FF)) ||

                    ((current >= 0xE000) && (current <= 0xFFFD)) ||

                    ((current >= 0x10000) && (current <= 0x10FFFF)))

                    _validXML.Append(current);

            }

            return _validXML.ToString();

Also wanted to say, thanks! This just saved me a serious parsing headache.
THANK YOU THANK YOU THANK YOU Who ever posted the php function saved me.



function strip_invalid_xml_chars2( $in )

{



$out = "";



$length = strlen($in);



for ( $i = 0; $i < $length; $i++)

{



$current = ord($in{$i});



if ( ($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF)))

{

        $out .= chr($current);

}

else

{

        $out .= " ";

}





}



return $out;





}





you can eliminate the for loop with regular expressions: $clean_string=preg_replace('/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/u','',$input_sting)
Thank you!!! Saved me a lot of time and frustration with this function.
Thank you, I was indeed looking for such a solution for long
Just wanted to add my thanks for this - and add some google bait. I was seeing An invalid XML character (Unicode: 0xb) was found in the CDATA section for my XML, now fixed using this snippet. Dave
Hi Mark, Just wanted to say "Thanks!" I am pulling data from an AS400 and this saved the day for me!! Cheers! Bob
Thank you so much for sharing this code. That saved me on a major project I am working on.
Thanks for the solution, this saved me a lot of time!
Thanks, found your blog after javax.servlet.ServletException: javax.xml.transform.TransformerException: com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: An invalid XML character (Unicode: 0x0) was found in the element content of the document.
Thanks. Such a big problem was solved in few mins. Thanks you very much dude.
Thanks!!
Thank you very much Mark
Thanks... This really helped!
tnx!
Hi Mark and everyone, Thank you so bloody much! Questions: Is the character 'range' up-to-date? And does it apply to CData Sections as well?
Hi Håkan,

I think the character ranges are correct according to the latest XML specification (Fourth edition, last edited 29 September 2006).

http://www.w3.org/TR/xml/#charsets

Also, CDATA sections are about storing strings of characters that are not to be treated as markup. I am almost certain that they should not contain out of range characters (as this would invalidate the entire XML document).

Mark, Thanks so much I have one more question. Can I use this range of characters for validation of XML documents not using UTF-8 as encoding (sorry if this is a newbie question)? The XML documents I deal with are client product feeds and may not be encoded in UTF-8.

I found this a very difficult question to answer. I find reading detailed specifications hard at the best of times! I have done some digging around and I am still not certain that I have the right answer. My guess is whatever character encoding you use, valid XML must only contain characters from the defined range. I base my guess on the following:

"Remember that character encodings, despite their name do not apply to characters - they apply to byte sequences which represent characters. If you have a char variable in Java it has no character encoding as far as you are concerned, it’s just that character."

However, if you do decide to opt for a character encoding other than UTF-8/UTF-16 you will have additional constraints in order to satisfy XML validity.

  • You must have an encoding declaration
  • All data must be encoded in the encoding named in the declaration
  • All byte sequences must be legal for that encoding
Thank you for this function, here is the PHP version: function strip_invalid_xml_chars( $in ) { $out = ""; // Used to hold the output. $current; // Used to reference the current character. if ( empty($in) ) { return ""; // vacancy test. } $length = strlen($in); for ( $i = 0; $i < $length; $i++) { $current = ord($in{$i}); if ( ($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) { $out .= chr($current); } else { $out .= " "; } } return $out; }
Greetings, Thanks a lot Mark, I was able to fix a XML transformation problem by stripping all the non-valid XML characters.
Mil gracias.
If you just want to validate a string (and not replace the characters), you can do it easily with a regular expression:

public static boolean iSValidXMLText(String xml) {
boolean valid = true;

if( xml != null ) {
valid = xml.matches("^([\\x09\\x0A\\x0D\\x20-\\x7E]|" //# ASCII
+ "[\\xC2-\\xDF][\\x80-\\xBF]|" //# non-overlong 2-byte
+ "\\xE0[\\xA0-\\xBF][\\x80-\\xBF]|" //# excluding overlongs
+ "[\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}|" //# straight 3-byte
+ "\\xED[\\x80-\\x9F][\\x80-\\xBF]|" //# excluding surrogates
+ "\\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}|" //# planes 1-3
+ "[\\xF1-\\xF3][\\x80-\\xBF]{3}|" //# planes 4-15
+ "\\xF4[\\x80-\\x8F][\\x80-\\xBF]{2})*$"); //# plane 16
}

return valid;
}

(borrowed from Here)
Could you please help me write the regualr expression in java
Anybody ever done something like this over an entire file?
I used the method, but could not get it to work. Still get $#19; an invalid xml char. I decoded the String using UTF-8, str.charAt(i) returned $,#,1,9, and ; respectively. Individually, these are valid xml char. Could you advice what I did wrong? Thanks
I'm working on doing that I'll post it when I'm done
In the case of marshalling binary files, the solution is finally to encode the PDF flow in 64 base. More informations here : http://www.javaworld.com/javaworld/javatips/jw-javatip117.html In my case, I use : sun.misc.BASE64Encoder and sun.misc.BASE64Decoder.
Please note: Your code is broken.
the condition "current <= 0x10FFFF" will not work as the character has 16bit.
java stores Strings as UTF-16, and "exposes" this encoding in charAt(x).
this is at least "qurix" in java, e.g. String.length() returns too much.
fixing the code would require to "decode" UTF-16.
Couldn't you also do this? We did not want to actually remove the characters just replace them. Feedback/corrections welcome:

public String stripNonValidXMLCharacters(String in) {

	StringBuffer out = new StringBuffer(); // Used to hold the output.

	char current; // Used to reference the current character.



	if (in == null || ("".equals(in))) {

	    return "";

	} // vacancy test.

	for (int i = 0; i < in.length(); i++) {

	    current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.  

	    if ((current == 0x9) ||

		    (current == 0xA) ||

		    (current == 0xD) ||

		    ((current >= 0x20) && (current <= 0xD7FF)) ||

		    ((current >= 0xE000) && (current <= 0xFFFD)) ||

		    ((current >= 0x10000) && (current <= 0x10FFFF))) {

		out.append(current);

	    } else {

		out.append("_");

	    }

	}



	return out.toString();

    }

Thank you so much. I had a huge data set that we converted into XML and about 300 xml failed fue to invalid char in them. I was trying to code in some method do this. But what you have is much better then what I had in my mind. THANK YOU!!!
Hi, Thx for your solution. (I apologize I do not speak very good English) I have the same problem, I use jaxb and jaxWs. I have a web service that builds and forwards an Adobe PDF file in a String object. And a service client’s rebuilds this PDF file on the other side. When the web service client unmarshall the response flow, it generates this exception : An invalid XML character (Unicode: 0x2) was found in the element content of the document. My PDF Document stored in the String object contains valid characters for the PDF reader, but invalid for the xml parser. When I uses your solution, invalid characters are well removed, and the web service work correctly. But when I recovers the pdf after calling my service, this one is corrupted. The removed characters are necessary for read the PDF properly. I try multiple solutions to convert the String object to UTF8 encoding, but nothing works. Those characters are always presents and obviously necessary. Is there a solution to replace (Not remove) invalid characters in the original String and successfully pass the xml parser And recover invalid characters after the parsing step to rebuild at the identical the original message ? Thanks for your help.
Works like a charm... This was a real time saver for me. Thanks
Good piece piece of code! Had problems getting MSXML to parse some apparently valid characters but it seems they weren't. Cheers.
Mark, I owe you much beer. Can I use this function (while giving you credit) in my AntiSamy project? It's published under the BSD license. Cheers, Arshan
Arshan, please use it freely with my blessing. (it wasn't my code originally anyhow, I just reproduced it in this context). Mark
Absolutely brilliant! Saved me from a big headache! Thanks!
very nice, fixed some problems we have overhere. Here a delphi version:

function ValidXmlString(Value :  wideString) : WideString;

var

  NewLen,

  idx     : integer;

  CurChar : word;

begin

  // initialize

  Result := '';

  NewLen := 0;

 

  // check for empty string

  if Length(Value) = 0 then

     exit;

 

  // init length of result

  Result := stringofchar(' ',length(Value));

 

  // loop and check valid xml characters

  for idx := 1 to length(Value) do

  begin

    CurChar := ord(Value[idx]);

    if (CurChar   = $9)     or  (CurChar =  $A)       or (CurChar = $D) or

       ((CurChar >= $20)    and (CurChar <= $D7FF))   or

       ((CurChar >= $E000)  and (CurChar <= $FFFD))   or

       ((CurChar >= $10000) and (CurChar <= $10FFFF)) then

    begin

       inc(NewLen);

       Result[NewLen] := Value[idx];

    end;

  end;

 

  // adjust size of result

  Result := copy(Result,1,NewLen);

end; // ValidXmlString

Thank you very much!
I am using the service in my code , but not able to test it :( please post if somebody has tested the service before.The service will trim 0x13 if somebody is having this character plz post.
Thanks you very much, I am sure you have saved atleast 10hrs of our time. Keep posting this kind of useful stuff.
if you just want to get rid of control characters you can use a regex. In c#: xml = Regex.Replace(xml, "&\\#x(?:0[0-8BCEF]|1[0-9A-F]);", "");
More than one year later and the solution you've posted is still up to date! Thanks a lot for the code Mark. It certainly saved me a lot of headaches! Cheers, Ricardo
Thanks for the help!
Thanks a lot!

function stripNonValidXMLCharacters(in)

{

    var current;

    var out = "";



    for (var i = 0; i < in.length; i++)

    {

        current = in.charCodeAt(i);



        if ((current == 0x9) ||

            (current == 0xA) ||

            (current == 0xD) ||

            ((current >= 0x20) && (current <= 0xD7FF)) ||

            ((current >= 0xE000) && (current <= 0xFFFD)) ||

            ((current >= 0x10000) && (current <= 0x10FFFF)))

            out += in.charAt(i);

    }

    

    return out;

}

Thanks for helping me identify this problem.
hey Leni.. isn't 'in' a reserved keyword in javascript ?
"in" is a reserved word in JavaScript!
Thanks a Lot. I saved a lot of time and it worked.otherwise, I would have wasted so much of my time trying to find a solution for this problem Thanks once again
Thanks very much Mark!
Anyone know why XML parsers or deserializaion objects through when an invalid character is found? Just curious about the motivation for raising throwing an exception and not ignoring it.
Hi Bock,

Good question! I will try to answer it! In the above error the XML parser is correctly reporting that this data does not conform to the strict guidelines of the XML specification. It is the parser's job to report any non-compliance to any extent (even if it seems minor). However, in my particular case it is sufficient to discard the non-standard characters and continue working with the data BUT in another situation this course of action might not be appropriate.

Mark
Thanks very much Mark!
Saved me a bunch of time. Thanks!
Thanks a lot.
Doesn't filter this out: § The double S looking character breaks an XML document.
Or the Scala Version (I'm a scala noob, so don't rib me too hard)...

class ValidXmlChar(chr:Char) { 

    def isValidXmlChr() = { 

        ((chr == 0x9) || 

         (chr == 0xA) || 

         (chr == 0xD) || 

         ((chr >= 0x20) && (chr <= 0xD7FF)) || 

         ((chr >= 0xE000) && (chr <= 0xFFFD)) || 

         ((chr >= 0x10000) && (chr <= 0x10FFFF))) 

    } 

} 

 

object ValidXmlChar { 

    implicit def char2ValidXMLChar(chr:Char) = new ValidXmlChar(chr) 

} 

 

import ValidXmlChar.{_} 

 

object Main { 

 

    def main(args: Array[String]) :Unit = { 

        // Assuming args(0) is a string you want to clean 

        println(stripNonValidXMLCharacters(args(0))) 

    } 

 

    def stripNonValidXMLCharacters(s:String) = { 

        val sb = new StringBuffer() 

        s.filter(_.isValidXmlChr).foreach(sb.append(_)) 

        sb 

    } 

 

}

Thanks a lot
worked perfectly! Thanks
Thanks a lot! This function is very useful
Thanks a ton! This code snippet helped us big time!!
Very efficient, thanks a mill
I appreciate your help, I was having the very same problem with JAXB when unmarshalling objects. Thx a lot.
Saving me too, thanks a lot :D
Thank you!!!!!!!