Be cautious using XPathDocument and DTDs.

by Sander Gerz July 27, 2006 09:29

From the docs:

“There are two ways to read an XML document in the System.Xml.XPath namespace. One is to read an XML document using the read-only XPathDocument class and the other is to read an XML document using the editable XmlDocument class in the System.Xml namespace.“

Further on, on the same page, it says:

“ The following example illustrates using the XPathDocument class's string constructor to read an XML document.“

However, there is no example. Perhaps that's the reason for an average rating of 2.31 out of 9. BTW, this string constructor is in fact interpreted as a URI. So why they didn't choose to implement a constructor using a the Uri-class is beyond me.

What it fails to mention is that you need to be extra careful reading an XML document with XPathDocument, especialy when it has a reference to a DTD. And even worse, when you don't control the location where this DTD is supposed to be.

Consider this simple XML file:

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE Library  SYSTEM "http://www.devtips.net/bijlagen/sample.dtd">

<Library>

  <family>

    <title>Our Family</title>

    <parent role="mother">Christina</parent>

    <parent role="father">Jean Luc</parent>

    <child role="daughter">Sofia</child>

    <child role="son">Pedro</child>

  </family>

</Library>

Will this load with:

XPathDocument doc = new XPathDocument(@"d:\temp\test.xml");

? Yes, it will. But when an administrator decides to rename the DTD, or the server running www.devtips.net goes offline, or as happened recently, a firewall blocks you from accessing the server... it will fail! Because in the construction of XPathDocument, there's an http GET command to see if it can access the DTD. It's not doing anything with the DTD. It's for just in case. So while XPathDocument is initially set up to be a faster alternative to XmlDocument, you'll have the additional overhead of an http request that needs to be resolved. Imagine that server being on the other side of the globe!

You will also get an exception (i.e. a WebException) if it fails to get to the DTD for any reason, thus breaking your app, because it will not load the XML file.

What about when a XML Schema is referenced? Well, then there's no problem. It's not trying to get the remote schema.

In conclusion, be careful using XPathDocument, it may not be as fast as you thought it was. And avoid it when your XML file references a DTD on a location that you have no control over.

 

Currently rated 5.0 by 3 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Tags:

Get GoogleBot to crash your .NET 2.0 site - update

by Sander Gerz July 11, 2006 09:36

Good news... I hope.

A few days ago, I posted a story on a bug that Nix found in ASP.NET 2.0. According to this thread, the bug was confirmed and they're working on a solution.

Currently rated 2.0 by 3 people

  • Currently 2/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Tags:

.NET

Tip/Trick: List Running ASP.NET Worker Processes and Kill/Restart them from the command-line [a better way]

by Sander Gerz July 02, 2006 19:37

Scott Guthrie posts a trick on a quick way to kill a process on your system, or kill and restart an ASP.NET or IIS worker process.  I tried to post a comment on his trick, but the commenting system is not working. So I'll give my opinion here, leaving me with a bit more room to elaborate.

Scott's suggesting that you use taskkill to kill a process running the application pool. That's all nice and neat, but how do you know what process to kill? If you have multiple application pools, you might just kill the wrong one. A much better solution is to use the little known iisapp command. In fact, iisapp is a vb-script located in %winnt%\system32. Run it from the command prompt without parameters, and you get a list of application pools with their associated process ids.

C:\WINDOWS\system32>iisapp
W3WP.exe PID: 3328   AppPoolId: DefaultAppPool
W3WP.exe PID: 232   AppPoolId: AppPool ASPNET2.0

The command IIsApp /a DefaultAppPool /r will recycle the specified application pool. Not only is this a lot easier, it's less error prone, thus safer to use. What if you kill the wrong process? I.e. by mistyping or by the fact that after you listed your processes, the application pool has recycled already.

There are a few other commands that few are aware of. E.g.

issweb /query

This will give you a list of configured websites, their status, port number and hostheader. You can also use iisweb to create, pause, stop, start and delete websites. iisvdir will do something similar for virtual folders.

With iisback you can backup and restore the IIS configuration. In fact, if you do a listing of .vbs-files from within %winnt%\system32 you may find some other hidden gems.

Hope this helps... too.

Sander

Currently rated 2.9 by 10 people

  • Currently 2.9/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Tags:

.NET

Get GoogleBot to crash your .NET 2.0 site

by Sander Gerz July 01, 2006 21:31

If you’re developing in ASP.NET 2.0 and you’re using url rewriting, you should proceed with caution. Especially if you value your ranking in search engines. I’m posting this as a follow up and with reference to the original find on this post.  The issue came about in a thread on the Community Server 2.0 forums. I was fast to post a solution to the problem, but obviously, it’s more about working around the issue than actually solving the root cause.

Url rewriting is mostly used to make URL’s more friendly to read. Also, if you have migrated from one site to another and you want accommodate people still linking to old urls, rewriting is a common practice. Nonetheless, if you use this on a public internet website, it won’t be long until you see the following exceptions listed in your Event Log:

   Exception type: HttpException
   Exception message: Cannot use a leading .. to exit above the top directory.
...
   Stack trace:    at System.Web.Util.UrlPath.ReduceVirtualPath(String path)
...

For any site of significance, and www.codes-sources.com is most certainly one of them, this exception was logged more than 2000 times … every hour. Now, that’s something to notice, right? And the effect on your ranking in the search engines? Well, within a few days, your site is either kicked out completely from the index or the index contains nothing more than the url of your site without content. Nobody looking for content you may have on your site will be directed to it. Worried? Read on.

My personal instinct would be: it’s something I did wrong. So one takes a long time trying to figure out what that is. But in fact, it’s a bug in a .NET component that’s not easy to trace and reproduce. If you don’t check the event logs every once in a while, it can surely be missed. Let’s take a look at what’s going on:

A first note to people trying to reproduce the issue, the bug does not appear using Cassini (the built-in web server in Visual Studio 2005). You need to have a running IIS 6 web server on Windows 2003. Doesn’t matter if it’s in a VPC or on an actual server.

If you’re using url rewriting in .NET 2.0, you have the Context.RewritePath method at your disposal. Here’s a sample project for testing.

1.       First you create a page, say page.aspx

2.       In this page, you can put whatever you want; it doesn’t really matter. For example:

<%=Request("ID")%>

3.       Then you add your rewriting HttpModule, with the following implementation:

Public Class Rewriter

    Implements System.Web.IHttpModule

     Public Sub Dispose() Implements System.Web.IHttpModule.Dispose

     End Sub

     Public Sub Init(ByVal context As System.Web.HttpApplication) Implements System.Web.IHttpModule.Init

        AddHandler context.BeginRequest, AddressOf Me.HandleBeginRequest

    End Sub

     Private Sub HandleBeginRequest(ByVal [source] As [Object], ByVal e As EventArgs)

        Dim app As System.Web.HttpApplication = CType([source], System.Web.HttpApplication)

        app.Context.RewritePath(“~/page.aspx?ID=1”, False) ' sidenote, same effect when using “/page.aspx?ID=1”

    End Sub

End Class

As you can see, it’s a simple example rewriting all urls to page.aspx?ID=1. It’s does not serve a specific function, other than show the problem at hand. Now, add the HttpModule in the Web.Config file.

With Fiddler (available at www.fiddlertool.com), you can create web requests and analyze the result in very good detail. It’s especially useful in this case, as you can create a request specific for certain user-agents. So download the tool and setup your ASP.NET 2.0 site on an IIS 6.0 environment. One thing to note as well, is that this site needs to be running under its own hostheader, not as a virtual directory.
Once installed, you take your web browser and go to

http://localsitename/default.aspx

The page default.aspx will be rewritten as page.aspx?ID=1 and everything works just fine.

Now, open up Fiddler and create the following request:

Accept: */*
Accept-Encoding: gzip, x-gzip
User-Agent: Mozilla/4.0

Set the url to

http://localsitename/default.aspx

and hit Execute. You should get status code 200, meaning OK. Now set the url to

http://localsitename/justafolder/default.aspx

and after you hit OK, again, you will get a 200 code. No problems so far.

Now, change the request to

User-Agent: Mozilla/5.0
instead of
User-Agent: Mozilla/4.0

Hit Execute and bang… error 500, indicating an application error.
Here’s a list of user-agent entries that will result in an error:

Mozilla/1.0
Mozilla/2.0
Mozilla/5.0
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Yahoo-Blogs/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html )
Mozilla/2.0 (compatible; Ask Jeeves/Teoma; +http://sp.ask.com/docs/about/tech_crawling.html)
Mozilla/5.0 (compatible; BecomeBot/3.0; MSIE 6.0 compatible; +http://www.become.com/site_owners.html)
Mozilla/5.0 (compatible; Konqueror/.... (Tous les users agent de Konqueror que j'ai testés plantent)
Etc...

Some funny details:
Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1 <= no error
Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.0.1) <= error 500!

Ok, so let’s try to explain what happens. If you call RewritePath with the rebaseClientPath parameter set to “True”, the virtual path is reset. So why set it to False? Well, the setting of rebaseClientPath affects the action-tag of a form.

If I have an url http://mysite/myfolder/mypage.aspx which is rewritten to http://mysite/page.aspx?id=mypage, the form tag will we set as follows.

With rebaseClientPath set to true:

<form name="form1" method="post" action="page.aspx?ID=mapage" id="form1">

But with rebaseClientPath set to false:

<form name="form1" method="post" action="../page.aspx?ID=mapage" id="form1">

In case of a postback, the action in the latter situation (with rebaseClientPath set to false) is correct. Not in the first instance, because there is no page.aspx in the subfolder /myfolder.

Now, a workaround would be to manually set the UrlPostback to the correct location, but the ramafications are significant, and may affect the execution of the javascript manipulation the UrlPostback on a number of browsers.

What’s really troublesome is that it affects only the production version of a website. It’s not visible during development (usually you don’t let searchengines index your development and test environment, right?). Also, it happens only under IIS 6.0, using ASP.NET 2.0. And the only variable in this case is the user agent with which the site is accessed.

Now we know what the issue is, how do we resolve it? Well, doing it yourself is not easy and without risk, but here goes.

It has to do with the capabilities of the user-agent. When the site is hit by a user-agent having Mozilla/5.0 in the string, ASP.NET will be using System.Web.UI.Html32TextWriter. I’ve traced the bug with Intercept Studio and can confirm this. If you use another user-agent, for example Mozilla/4.0, System.Web.UI.HtmlTextWriter will be used. This is called adaptive rendering, which can lead to weird behavior.

So the problem lies in Html32TextWriter, but it’s unclear where this goes wrong exactly. The exception is thrown in System.Web.Util.UrlPath.ReduceVirtualPath(), but by the time the stack reaches this method, we’re already six method calls away from the last usage of Html32TextWriter (being used in System.Web.UI.HtmlControls.HtmlForm.RenderAttributes). If you have time to walk the stack with Reflector, go ahead. There are about 23 method calls in all.

You may want to wait for a fix to come from Microsoft, but if you can’t wait that long, there’s a hack solution to try for yourself. Say for instance you want to fix the issue with the Yahoo searchbot, Yahoo! Slurp (you would need to apply this to all user-agents in a similar fashion.)

Since Visual Studio 2005 we have the capability to create .browser files. These .browser files contain a definition of the capabilities of browsers. In your web project, add a folder called “App_Browser” and create a new file (i.e.: yahooslurp.browser). In this file, you put

<!--

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

-->

<browsers>
<
browser id="Slurp" parentID="Mozilla">
<
identification>
<
userAgent match="Slurp" />
</
identification>
<
capabilities>
<
capability name="browser" value="Yahoo!Slurp" />
<
capability name="Version" value="4.0" />
<
capability name="MajorVersion" value="4" />
<
capability name="MinorVersionString" value="" />
<
capability name="MinorVersion" value=".0" />
<
capability name="activexcontrols" value="true" />
<
capability name="backgroundsounds" value="true" />
<
capability name="cookies" value="true" />
<
capability name="css1" value="true" />
<
capability name="css2" value="true" />
<
capability name="ecmascriptversion" value="1.2" />
<
capability name="frames" value="true" />
<
capability name="javaapplets" value="true" />
<
capability name="javascript" value="true" />
<
capability name="jscriptversion" value="5.0" />
<
capability name="supportsCallback" value="true" />
<
capability name="supportsFileUpload" value="true" />
<
capability name="supportsMultilineTextBoxDisplay" value="true" />
<
capability name="supportsMaintainScrollPositionOnPostback" value="true" />
<
capability name="supportsVCard" value="true" />
<
capability name="supportsXmlHttp" value="true" />
<
capability name="tables" value="true" />
<
capability name="vbscript" value="true" />
<
capability name="w3cdomversion" value="1.0" />
<
capability name="xml" value="true" />
<
capability name="tagwriter" value="System.Web.UI.HtmlTextWriter" />
</
capabilities>
</
browser>
</
browsers>

Now, restart your website and redo the tests with Fiddler, setting the user-agent to

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Voilà, no more 500 error message. Repeat the steps for the different user-agents out there.

There may be other possibilities to fix this problem, but this one seems to be the most straightforward to implement and doesn’t require you to recompile the code. With solutions like CommunityServer, you simply don’t have this option anyway. It will suffice to add the App_Browser folder and the various .browser files to the root folder of the website and it will work.

At the end of the original post by Nix, there are some closing remarks. For example, if you create the site as a virtual directory under a website root, the problem does not appear. Changing the user-agent in a header, as suggested by Poppyto, might work, but the consequences are uncertain. Also, you would need to recompile.

Now, you might think, oh well, just add the appropriate .browser files and we're done. But that’s a bit short sighted. You never know when Google or any other search-engine decides to change the user-agent string. Your first notification would come from the exception messages in your event log. I manage about 7 big Windows 2003 servers in our datacenter. Perhaps not much, but I know what issues can come along every once in a while). Both as a developer and a system administrator, I would rather prevent issues, than handle them afterwards.

If you want to debate the issue, please go to the thread at the Community Server forums.

PS: Why did I write this post? Well, the entire explanation is written in French, and not everyone can read this. Also, Nix and I know each other and he’s been helpful on a number of occasions, so if there’s anything I can do in return, like explaining the issue in English, I’m more than happy to help. Finally, this problem needs a proper solution, and the sites and servers I host can be just as much affected. I was fortunate not to have too much url rewriting in my sites, but that doesn’t affect the scope of the issue.

 PS: sorry if this post is popping up in your rss-reader again, but the text-editor in .Text keeps mangling my writing.

Currently rated 5.0 by 2 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Tags:

.NET

Powered by BlogEngine.NET 1.4.5.0
Theme by Mads Kristensen | Modified by Mooglegiant