Handling Googlebot URL detected errors.
We tend to use Google Webmaster Tools to monitor our main site and in particular the Crawl Errors that it detects. Sometimes we are a little confused as to where the errors are coming from since the 'source' URL is sometimes the self same page indicated as in error, and others indicate pages where we fail to find the link referenced as in error.
That said it has proved generally useful and mostly they are trivial to fix. What it has been difficult to discover, is a good reference guide to the topic of Search Engine Friendly URLs known as SEF. Whilst acknowledging that the subject of SEF can be quite involved, our searches have yet to reveal a good comprehensive article upon the best design and implementation mechanisms. It is even more difficult to discover a good guide to resolving problems. Having found nothing suitable we decided to create this post as a record of our investigations and perhaps act as a guide for others.
Errors reported by Google Webmaster tools generally fall into several categories.
The first in the 'standard' 404 error messages. These are (usually) the easiest to resolve and involve setting up suitable redirection URLs within Joomla, either using the standard 'Redirect Manager' or with a third party component such as SH404SEF, Akeeba Admin Tools Pro, etc., to mention only two.
The second involves 'soft errors'. These are typically where Joomla has returned an error and perhaps displayed a page with an 'access' error. One example might be the attempted access of a Kunena forum post of a non-existent entry or of a deleted thread. Another example we have seen involves the attempted display of a menu item to which the require access group permissions are not correct. Attempts to use the Joomla standard tools do not seem to impact the result, so we have to use the same method to resolve these as in the group three below.
The third type involve '500' system errors. These are the result of trying to access a URL which does not exist, such as the display of an item for which a 'view' does not exist. This type of error cannot be resolved by a URL direction within Joomla so an external redirection has to be used such as using Apache mod_redirect and the .htaccess file.
It is beyond the scope of this post to go into the complexity of .htaccess entries. We did however find two very useful web sites which aid in determining the correct entries to enter into the .htaccess file.
The first is '.HtAccess 301 Redirect Generator Tool', which permits one to enter a list of specific URL pairs separated by commas, with the first entry being the 'original' URL and the second being the desired resultant URL. The tool will then create a list of the required command to be entered into the htaccess file. The entries are very specific as the tool is designed to do exact URL redirecting. If one desires to change it to something more generic one can always change the supplied command if one is feeling competent.
The second is ‘htaccess tester’ which permits one to test the rules against selected source URLs so that one can see what the output URL would actually look like.
Webmaster tools says that the error indicates that "Googlebot couldn't access the contents of this URL because the server had an internal error when trying to process the request.
These errors tend to be with the server itself, not with the request." The URL's that we were seeing that generated 500 system errors were addresses that had an additional 'view=' string appended to the .html in the URL, which is usually referred to as a query string. The 500 error message is often shown as:
'500 - View not found [name, type, prefix]: xxxxxx, html, contentView'
where xxxxxx is the name of the view being searched for. These (to us at least) seem to indicate that the URL is at fault, which is why we are seeing the error. Leaving aside the question of where Googlebot is obtaining the addresses from, (and we sometimes have been unable to discover the source), it is often desirable to resolve these errors.
The method we have adopted involves the following steps:
- Export the list of addresses from Webmaster Tools and load into a spread sheet such as Excel.
- Inspect the URL within the list and create the 'pairs' required for using the '.HtAccess 301 Redirect Generator Tool’. This list is the 'from -> to' pairs required by the tool.
- Extract the generated htaccess commands and using an editor modify as required.
- Using the existing '.htaccess' file, enter the above commands at the appropriate location(s) in the file.
- Using our newly created '.htaccess' file, use the ‘ htaccess tester’ and paste in our file and then test using a variety of URL. It is important to test not only the URLs that were formally in error, which should ideally now work correctly, but also other URLs that worked correctly originally to ensure that we have not inadvertently cause any new problems.
One thing that has to be considered is that the order in which commands are entered in the .htaccess file is critical. It is important therefore that when testing, that the htaccess file
should be considered in its entirety, otherwise the final result will not necessarily be correct upon the live site.
Disclaimer: We have no connection with the two referenced tools mentioned above, but wish to thank their creators for their contribution in making the tools freely available on the web.