RssFeed.Create(): hexadecimal value 0x1F, is an invalid character

Topics: Argotic.Core
Aug 3, 2007 at 1:39 AM
When I call:

feed = RssFeed.Create(feedUri)

I'm getting the following error thrown:

A first chance exception of type 'System.Xml.XmlException' occurred in Argotic.Core.dll
ERROR: '', hexadecimal value 0x1F, is an invalid character. Line 1, position 1.

but only intermittantly on some feeds:

http://feeds.feedburner.com/aprclive
http://feeds.feedburner.com/OTN_TechCasts
http://feeds.feedburner.com/TopOfThePods
http://feeds.gigavox.com/gigavox/channel/podcastacademy

The problem generally seems to be with Feedburner feeds, and is very intermittent. A feed that doesn;t work today, may work tomorrow.
Coordinator
Aug 6, 2007 at 11:52 PM
Bruce,

This typically occurs when non-UTF8 characters are used in a UTF8 encoded feed. I has thought I had fixed the framework to attempt to convert/strip invalid characters, but obviously I missed something. If you can, next time you see this exception, can you extract the raw XML for the feed raising the exception and either email it to me or post it on this thread, and I will try to get a fix for this issue.
Developer
Jan 20, 2008 at 6:58 AM
Hi Oppositional,

I have the latest version and am experiencing the same problem on a lot of feeds (again especially feedburner).


http://google-latlong.blogspot.com/feeds/posts/default


is an example where the error occurs.


Thanks,

Gary
Coordinator
Feb 20, 2008 at 10:30 PM
Created work item: Feeds that contain invalid hexadecimal characters fail to load., to ensure this issue is tested/fixed in the next release of the framework.
Coordinator
Feb 27, 2008 at 11:05 PM
Edited Feb 27, 2008 at 11:05 PM
Dear Bruce,

It would be extremely helpful if you could save the actual XML of a feed that is failing and post it here so that I can validate the fix. I will attempt to much up a sample feed with bad character data as well. Currently all feeds listed above load fine with the CTP bits.
Feb 28, 2008 at 12:19 AM
Edited Feb 28, 2008 at 12:21 AM
Hi Oppositional,

I've not seen this problem for months now. I don't know what has changed, but my aggregator software (and hence the Argotic library) definately has not. (And I check these feeds every day, using Argotic).

Bruce.
Coordinator
Feb 28, 2008 at 1:51 PM
The ephemeral nature of this issue is frustrating, but please let me know if you encounter it again.
Developer
Apr 5, 2008 at 5:22 AM
Hi Oppositional,

The latest 2008 release of the framework still posses this bug but I think I've fixed it.

Some RSS feeds come back as GZIP'd even if you don't ask for it, hence when it tries to load it into the parser you get an error.

I changed SyndicationEncodingUtility.cs line 252 so that it is the following and it appears to have resvoled it -


{
using (HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse())
{

Stream responseStream = responseStream = response.GetResponseStream ( );
if ( response.ContentEncoding.ToLower ( ).Contains ( "gzip" ) )
responseStream = new GZipStream ( responseStream , CompressionMode.Decompress );
else if ( response.ContentEncoding.ToLower ( ).Contains ( "deflate" ) )
responseStream = new DeflateStream ( responseStream , CompressionMode.Decompress );



if (encoding != null)
{
return SyndicationEncodingUtility.CreateSafeNavigator(responseStream, encoding);
}
else
{
return SyndicationEncodingUtility.CreateSafeNavigator(responseStream);
}
}
}
Coordinator
Apr 6, 2008 at 4:34 PM
garazy,

I think you have just provided the missing piece to the puzzle that is this issue. Thanks! I had been banging my head on this, and hadn't considered gzip compressed resources as the cause. I will implement handling for the case when the resource is gzip compressed, and will open a work item for this issue.
Jun 17, 2008 at 12:06 PM
Edited Jun 23, 2008 at 11:59 AM
Hey
I'm Jan from and I making a website where the content comes from rss.
I've release 2008.0.1.0 of Argotic and I got still the problem with characters that are been read as "rectangle" and saved in the sql database as "?".  Als we get sometines Arabic and other characters special characters like Â

Maby it's usefull to say that we are from Belgium and all our rss items are in Dutch (Nederlands).

I looked into the source code of our Argotic version and found that the solution(gzip problem) of garazy has been applied.
 if (contentEncoding.Contains("GZIP"))
                {
                    stream  = new GZipStream(response.GetResponseStream(), CompressionMode.Decompress);
                }



It's really diffucult to give you an example because the rss items change to much.
Url example on 23 june 2008: http://www.site.kifkif.be/kifkif/rss.php?cat_id=4&page_class=three&open_menu_id=24
This item from the rssfeed "De controverse rond l’affaire Guigue" gets read as "De controverse rond l[]affaire Guigue"  (I use this [] as the rectangle, because I can't copy paste the rectangle character in this text editor)  and is saved in the sql database as "De controverse rond l?affaire Guigue".

A part of the html source: 
<?xml version="1.0" encoding="iso-8859-1" ?><rss version="2.00" xmlns:msxsl="urn:schemas-microsoft-com:xslt">
...
<div style="clear: both;"/><div class="entry"><h3><a href="http://site.kifkif.be/kifkif/nieuws.php?nws_id=1729&amp;page_class=three&amp;open_menu_id=24">De controverse rond l’affaire Guigue</a>


Is there any one who found a working solution for this problem?
Coordinator
Jul 1, 2008 at 6:04 PM
The framework will preprocess feed data to remove common issues like invalid hexadecimal characters in XML. It does an inspection of the encoding attribute of the XML document
to determine which encoding to use when processing the stream data. If the feed content is encoded in a format that is different than that specified in the XML document, you can pass a
SyndicationResourceLoadSettings instance as a parameter that allows you to specify the correct encoding for the feed.

Hope this helps, if not please open an new issue and if possivle provide sample XML data to test issue against.
Jul 16, 2008 at 7:40 AM
Edited Jul 16, 2008 at 7:45 AM
Thank you for the repley.
I have done your solution and put the CharacterEncoding to default (unicode give an error).

            Dim settings As New SyndicationResourceLoadSettings
            settings.CharacterEncoding = System.Text.Encoding.Default

            Dim feed As New Argotic.Syndication.RssFeed
            feed = Argotic.Syndication.RssFeed.Create(New Uri(rssItem.Hyperlink), settings)


This resolve the rectangle and ? character convertion problem.
But now we got problems with other characters like é that's been translated to é and ë to ë

                 ?John Mayer wijst huwelijksaanzoek af?  (before)
                 ’John Mayer wijst huwelijksaanzoek af’   (after)

example:   Meintjes café nog één keer open    (this is before we applied the solution with the settings)
                Meintjes café nog één keer open        (this is with the settings to defaul characterEncoding)

                Oud België    (before)
                Oud België  (after)

Is there a way that the character encoding can be autodetected and the feed been read in that encoding?
Because we have a verry large number of feeds (arround the 1500) it's difficult to get everything good with the same setting.
Coordinator
Jul 16, 2008 at 2:21 PM
Solving the pre-processing of invalid hexadecimal characters prior to an XmlReader/XPathNavigator involves processing the feed content using a StreamReader, which in turn requires the use of the proper Encoding. The framework utilizes the explicitly specified character encoding in the settings, and if not present will do an inspection of the 'encoding' attribute of the xml declaration and use the value of that attribute to determine the encoding to use.

Please mail me a few sample feed file or links to feeds you are having an issue with and I will work on solving this problem, I am beginning to think I need to reconsider how I am attempting to provide feeed data source safety, as it is the likely root of this character encoding issue.

Thanks for bringing this to my attention.