I was working on Java application recently when I got the following exception
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1a) was found in the element content of the document.
I was using Castor in an attempt to unmarshal (XML->Java Objects) an XML string. The XML originated from a bookmarks file that was uploaded into the application by the user, tidied up, transformed and then stored in a Oracle database. It was at the point at which the XML string was returned from the database and was being converted into Java objects that this error occurred.
I immediately assumed it to be some kind of character encoding conversion problem.
With some databases (e.g. MySQL) it is possible to set data to be passed as UTF8 by passing certain settings via the JDBC url. I do not use MySQL so I do not know whether this would fix my problem; although I suspect it would not.
I tried various methods to convert my XML string into valid UTF8 and I was pretty sure that I had achieved satisfactory UTF8 conversion but I still got the error.
This is when I discovered that not all valid UTF8 characters are valid XML characters, which probably makes sense (what with control characters and such) but I have never had to think about this before.
After spending several hours previously messing with numerous UTF8 conversion techniques I eventually found a solution. I found it in the Xalan mailing list. I am reproducing this solution here because it was not mentioned in the context of the "Unicode: 0x1a" error, if it was I would have found the solution more quickly. The XML standard specifies which UTF8 characters are valid in XML documents, so it is possible to take a UTF8 document and filter out all the invalid characters using a method like this:
/** * This method ensures that the output String has only * valid XML unicode characters as specified by the * XML 1.0 standard. For reference, please see * <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the * standard</a>. This method will return an empty * String if the input is null or empty. * * @param in The String whose non-valid characters we want to remove. * @return The in String, stripped of non-valid characters. */ public String stripNonValidXMLCharacters(String in) { StringBuffer out = new StringBuffer(); // Used to hold the output. char current; // Used to reference the current character. if (in == null || ("".equals(in))) return ""; // vacancy test. for (int i = 0; i < in.length(); i++) { current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen. if ((current == 0x9) || (current == 0xA) || (current == 0xD) || ((current >= 0x20) && (current <= 0xD7FF)) || ((current >= 0xE000) && (current <= 0xFFFD)) || ((current >= 0x10000) && (current <= 0x10FFFF))) out.append(current); } return out.toString(); }
I hope somebody else finds this useful (and it saves them a few hours of head scratching), alternatively if people reading this know of a better solution then please do let me know!
/**
* This method ensures that the output String has only valid XML unicode characters as specified by the
* XML 1.0 standard. For reference, please see the
* standard. This method will return an empty String if the input is null or empty.
*
* @author Donoiu Cristian, GPL
* @param The String whose non-valid characters we want to remove.
* @return The in String, stripped of non-valid characters.
*/
public static String removeInvalidXMLCharacters(String s) {
StringBuilder out = new StringBuilder(); // Used to hold the output.
int codePoint; // Used to reference the current character.
//String ss = "\ud801\udc00"; // This is actualy one unicode character, represented by two code units!!!.
//System.out.println(ss.codePointCount(0, ss.length()));// See: 1
int i=0;
while(i<s.length()) {
System.out.println("i=" + i);
codePoint = s.codePointAt(i); // This is the unicode code of the character.
if ((codePoint == 0x9) || // Consider testing larger ranges first to improve speed.
(codePoint == 0xA) ||
(codePoint == 0xD) ||
((codePoint >= 0x20) && (codePoint <= 0xD7FF)) ||
((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) ||
((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
out.append(Character.toChars(codePoint));
}
i+= Character.charCount(codePoint); // Increment with the number of code units(java chars) needed to represent a Unicode char.
}
return out.toString();
}
Hope it helps.
public static string stripNonValidXMLCharacters(string s)
{
StringBuilder _validXML = new StringBuilder(s.Length, s.Length); // Used to hold the output.
char current; // Used to reference the current character.
char[] charArray = s.ToCharArray();
if (string.IsNullOrEmpty(s)) return string.Empty; // vacancy test.
for (int i = 0; i < charArray.Length; i++)
{
current = charArray[i]; // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
if ((current == 0x9) ||
(current == 0xA) ||
(current == 0xD) ||
((current >= 0x20) && (current <= 0xD7FF)) ||
((current >= 0xE000) && (current <= 0xFFFD)) ||
((current >= 0x10000) && (current <= 0x10FFFF)))
_validXML.Append(current);
}
return _validXML.ToString();
function strip_invalid_xml_chars2( $in )
{
$out = "";
$length = strlen($in);
for ( $i = 0; $i < $length; $i++)
{
$current = ord($in{$i});
if ( ($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF)))
{
$out .= chr($current);
}
else
{
$out .= " ";
}
}
return $out;
}
I think the character ranges are correct according to the latest XML specification (Fourth edition, last edited 29 September 2006).
http://www.w3.org/TR/xml/#charsets
Also, CDATA sections are about storing strings of characters that are not to be treated as markup. I am almost certain that they should not contain out of range characters (as this would invalidate the entire XML document).
I found this a very difficult question to answer. I find reading detailed specifications hard at the best of times! I have done some digging around and I am still not certain that I have the right answer. My guess is whatever character encoding you use, valid XML must only contain characters from the defined range. I base my guess on the following:
However, if you do decide to opt for a character encoding other than UTF-8/UTF-16 you will have additional constraints in order to satisfy XML validity.
- You must have an encoding declaration
- All data must be encoded in the encoding named in the declaration
- All byte sequences must be legal for that encoding
public static boolean iSValidXMLText(String xml) {
boolean valid = true;
if( xml != null ) {
valid = xml.matches("^([\\x09\\x0A\\x0D\\x20-\\x7E]|" //# ASCII
+ "[\\xC2-\\xDF][\\x80-\\xBF]|" //# non-overlong 2-byte
+ "\\xE0[\\xA0-\\xBF][\\x80-\\xBF]|" //# excluding overlongs
+ "[\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}|" //# straight 3-byte
+ "\\xED[\\x80-\\x9F][\\x80-\\xBF]|" //# excluding surrogates
+ "\\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}|" //# planes 1-3
+ "[\\xF1-\\xF3][\\x80-\\xBF]{3}|" //# planes 4-15
+ "\\xF4[\\x80-\\x8F][\\x80-\\xBF]{2})*$"); //# plane 16
}
return valid;
}
(borrowed from Here)
the condition "current <= 0x10FFFF" will not work as the character has 16bit.
java stores Strings as UTF-16, and "exposes" this encoding in charAt(x).
this is at least "qurix" in java, e.g. String.length() returns too much.
fixing the code would require to "decode" UTF-16.
public String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in))) {
return "";
} // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
if ((current == 0x9) ||
(current == 0xA) ||
(current == 0xD) ||
((current >= 0x20) && (current <= 0xD7FF)) ||
((current >= 0xE000) && (current <= 0xFFFD)) ||
((current >= 0x10000) && (current <= 0x10FFFF))) {
out.append(current);
} else {
out.append("_");
}
}
return out.toString();
}
function ValidXmlString(Value : wideString) : WideString;
var
NewLen,
idx : integer;
CurChar : word;
begin
// initialize
Result := '';
NewLen := 0;
// check for empty string
if Length(Value) = 0 then
exit;
// init length of result
Result := stringofchar(' ',length(Value));
// loop and check valid xml characters
for idx := 1 to length(Value) do
begin
CurChar := ord(Value[idx]);
if (CurChar = $9) or (CurChar = $A) or (CurChar = $D) or
((CurChar >= $20) and (CurChar <= $D7FF)) or
((CurChar >= $E000) and (CurChar <= $FFFD)) or
((CurChar >= $10000) and (CurChar <= $10FFFF)) then
begin
inc(NewLen);
Result[NewLen] := Value[idx];
end;
end;
// adjust size of result
Result := copy(Result,1,NewLen);
end; // ValidXmlString
function stripNonValidXMLCharacters(in)
{
var current;
var out = "";
for (var i = 0; i < in.length; i++)
{
current = in.charCodeAt(i);
if ((current == 0x9) ||
(current == 0xA) ||
(current == 0xD) ||
((current >= 0x20) && (current <= 0xD7FF)) ||
((current >= 0xE000) && (current <= 0xFFFD)) ||
((current >= 0x10000) && (current <= 0x10FFFF)))
out += in.charAt(i);
}
return out;
}
Good question! I will try to answer it! In the above error the XML parser is correctly reporting that this data does not conform to the strict guidelines of the XML specification. It is the parser's job to report any non-compliance to any extent (even if it seems minor). However, in my particular case it is sufficient to discard the non-standard characters and continue working with the data BUT in another situation this course of action might not be appropriate.
Mark
class ValidXmlChar(chr:Char) {
def isValidXmlChr() = {
((chr == 0x9) ||
(chr == 0xA) ||
(chr == 0xD) ||
((chr >= 0x20) && (chr <= 0xD7FF)) ||
((chr >= 0xE000) && (chr <= 0xFFFD)) ||
((chr >= 0x10000) && (chr <= 0x10FFFF)))
}
}
object ValidXmlChar {
implicit def char2ValidXMLChar(chr:Char) = new ValidXmlChar(chr)
}
import ValidXmlChar.{_}
object Main {
def main(args: Array[String]) :Unit = {
// Assuming args(0) is a string you want to clean
println(stripNonValidXMLCharacters(args(0)))
}
def stripNonValidXMLCharacters(s:String) = {
val sb = new StringBuffer()
s.filter(_.isValidXmlChr).foreach(sb.append(_))
sb
}
}





