Blame - Python-2.7.5/Tools/webchecker/webchecker.py - toolchain/python

blob: 2ef82c4d7fb33ed5b9d2493749282c1972ad2cfe [file] [log] [blame]

Andrew Hsieh	9a7616f	2013-05-21 20:32:42 +0800	[diff] [blame]	1	#! /usr/bin/env python
				2
				3	# Original code by Guido van Rossum; extensive changes by Sam Bayer,
				4	# including code to check URL fragments.
				5
				6	"""Web tree checker.
				7
				8	This utility is handy to check a subweb of the world-wide web for
				9	errors. A subweb is specified by giving one or more ``root URLs''; a
				10	page belongs to the subweb if one of the root URLs is an initial
				11	prefix of it.
				12
				13	File URL extension:
				14
				15	In order to easy the checking of subwebs via the local file system,
				16	the interpretation of ``file:'' URLs is extended to mimic the behavior
				17	of your average HTTP daemon: if a directory pathname is given, the
				18	file index.html in that directory is returned if it exists, otherwise
				19	a directory listing is returned. Now, you can point webchecker to the
				20	document tree in the local file system of your HTTP daemon, and have
				21	most of it checked. In fact the default works this way if your local
				22	web tree is located at /usr/local/etc/httpd/htdpcs (the default for
				23	the NCSA HTTP daemon and probably others).
				24
				25	Report printed:
				26
				27	When done, it reports pages with bad links within the subweb. When
				28	interrupted, it reports for the pages that it has checked already.
				29
				30	In verbose mode, additional messages are printed during the
				31	information gathering phase. By default, it prints a summary of its
				32	work status every 50 URLs (adjustable with the -r option), and it
				33	reports errors as they are encountered. Use the -q option to disable
				34	this output.
				35
				36	Checkpoint feature:
				37
				38	Whether interrupted or not, it dumps its state (a Python pickle) to a
				39	checkpoint file and the -R option allows it to restart from the
				40	checkpoint (assuming that the pages on the subweb that were already
				41	processed haven't changed). Even when it has run till completion, -R
				42	can still be useful -- it will print the reports again, and -Rq prints
				43	the errors only. In this case, the checkpoint file is not written
				44	again. The checkpoint file can be set with the -d option.
				45
				46	The checkpoint file is written as a Python pickle. Remember that
				47	Python's pickle module is currently quite slow. Give it the time it
				48	needs to load and save the checkpoint file. When interrupted while
				49	writing the checkpoint file, the old checkpoint file is not
				50	overwritten, but all work done in the current run is lost.
				51
				52	Miscellaneous:
				53
				54	- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
				55
				56	- Webchecker honors the "robots.txt" convention. Thanks to Skip
				57	Montanaro for his robotparser.py module (included in this directory)!
				58	The agent name is hardwired to "webchecker". URLs that are disallowed
				59	by the robots.txt file are reported as external URLs.
				60
				61	- Because the SGML parser is a bit slow, very large SGML files are
				62	skipped. The size limit can be set with the -m option.
				63
				64	- When the server or protocol does not tell us a file's type, we guess
				65	it based on the URL's suffix. The mimetypes.py module (also in this
				66	directory) has a built-in table mapping most currently known suffixes,
				67	and in addition attempts to read the mime.types configuration files in
				68	the default locations of Netscape and the NCSA HTTP daemon.
				69
				70	- We follow links indicated by <A>, <FRAME> and <IMG> tags. We also
				71	honor the <BASE> tag.
				72
				73	- We now check internal NAME anchor links, as well as toplevel links.
				74
				75	- Checking external links is now done by default; use -x to disable
				76	this feature. External links are now checked during normal
				77	processing. (XXX The status of a checked link could be categorized
				78	better. Later...)
				79
				80	- If external links are not checked, you can use the -t flag to
				81	provide specific overrides to -x.
				82
				83	Usage: webchecker.py [option] ... [rooturl] ...
				84
				85	Options:
				86
				87	-R -- restart from checkpoint file
				88	-d file -- checkpoint filename (default %(DUMPFILE)s)
				89	-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
				90	-n -- reports only, no checking (use with -R)
				91	-q -- quiet operation (also suppresses external links report)
				92	-r number -- number of links processed per round (default %(ROUNDSIZE)d)
				93	-t root -- specify root dir which should be treated as internal (can repeat)
				94	-v -- verbose operation; repeating -v will increase verbosity
				95	-x -- don't check external links (these are often slow to check)
				96	-a -- don't check name anchors
				97
				98	Arguments:
				99
				100	rooturl -- URL to start checking
				101	(default %(DEFROOT)s)
				102
				103	"""
				104
				105
				106	__version__ = "$Revision$"
				107
				108
				109	import sys
				110	import os
				111	from types import *
				112	import StringIO
				113	import getopt
				114	import pickle
				115
				116	import urllib
				117	import urlparse
				118	import sgmllib
				119	import cgi
				120
				121	import mimetypes
				122	import robotparser
				123
				124	# Extract real version number if necessary
				125	if __version__[0] == '$':
				126	_v = __version__.split()
				127	if len(_v) == 3:
				128	__version__ = _v[1]
				129
				130
				131	# Tunable parameters
				132	DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
				133	CHECKEXT = 1 # Check external references (1 deep)
				134	VERBOSE = 1 # Verbosity level (0-3)
				135	MAXPAGE = 150000 # Ignore files bigger than this
				136	ROUNDSIZE = 50 # Number of links processed per round
				137	DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
				138	AGENTNAME = "webchecker" # Agent name for robots.txt parser
				139	NONAMES = 0 # Force name anchor checking
				140
				141
				142	# Global variables
				143
				144
				145	def main():
				146	checkext = CHECKEXT
				147	verbose = VERBOSE
				148	maxpage = MAXPAGE
				149	roundsize = ROUNDSIZE
				150	dumpfile = DUMPFILE
				151	restart = 0
				152	norun = 0
				153
				154	try:
				155	opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:t:vxa')
				156	except getopt.error, msg:
				157	sys.stdout = sys.stderr
				158	print msg
				159	print __doc__%globals()
				160	sys.exit(2)
				161
				162	# The extra_roots variable collects extra roots.
				163	extra_roots = []
				164	nonames = NONAMES
				165
				166	for o, a in opts:
				167	if o == '-R':
				168	restart = 1
				169	if o == '-d':
				170	dumpfile = a
				171	if o == '-m':
				172	maxpage = int(a)
				173	if o == '-n':
				174	norun = 1
				175	if o == '-q':
				176	verbose = 0
				177	if o == '-r':
				178	roundsize = int(a)
				179	if o == '-t':
				180	extra_roots.append(a)
				181	if o == '-a':
				182	nonames = not nonames
				183	if o == '-v':
				184	verbose = verbose + 1
				185	if o == '-x':
				186	checkext = not checkext
				187
				188	if verbose > 0:
				189	print AGENTNAME, "version", __version__
				190
				191	if restart:
				192	c = load_pickle(dumpfile=dumpfile, verbose=verbose)
				193	else:
				194	c = Checker()
				195
				196	c.setflags(checkext=checkext, verbose=verbose,
				197	maxpage=maxpage, roundsize=roundsize,
				198	nonames=nonames
				199	)
				200
				201	if not restart and not args:
				202	args.append(DEFROOT)
				203
				204	for arg in args:
				205	c.addroot(arg)
				206
				207	# The -t flag is only needed if external links are not to be
				208	# checked. So -t values are ignored unless -x was specified.
				209	if not checkext:
				210	for root in extra_roots:
				211	# Make sure it's terminated by a slash,
				212	# so that addroot doesn't discard the last
				213	# directory component.
				214	if root[-1] != "/":
				215	root = root + "/"
				216	c.addroot(root, add_to_do = 0)
				217
				218	try:
				219
				220	if not norun:
				221	try:
				222	c.run()
				223	except KeyboardInterrupt:
				224	if verbose > 0:
				225	print "[run interrupted]"
				226
				227	try:
				228	c.report()
				229	except KeyboardInterrupt:
				230	if verbose > 0:
				231	print "[report interrupted]"
				232
				233	finally:
				234	if c.save_pickle(dumpfile):
				235	if dumpfile == DUMPFILE:
				236	print "Use ``%s -R'' to restart." % sys.argv[0]
				237	else:
				238	print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
				239	dumpfile)
				240
				241
				242	def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
				243	if verbose > 0:
				244	print "Loading checkpoint from %s ..." % dumpfile
				245	f = open(dumpfile, "rb")
				246	c = pickle.load(f)
				247	f.close()
				248	if verbose > 0:
				249	print "Done."
				250	print "Root:", "\n ".join(c.roots)
				251	return c
				252
				253
				254	class Checker:
				255
				256	checkext = CHECKEXT
				257	verbose = VERBOSE
				258	maxpage = MAXPAGE
				259	roundsize = ROUNDSIZE
				260	nonames = NONAMES
				261
				262	validflags = tuple(dir())
				263
				264	def __init__(self):
				265	self.reset()
				266
				267	def setflags(self, **kw):
				268	for key in kw.keys():
				269	if key not in self.validflags:
				270	raise NameError, "invalid keyword argument: %s" % str(key)
				271	for key, value in kw.items():
				272	setattr(self, key, value)
				273
				274	def reset(self):
				275	self.roots = []
				276	self.todo = {}
				277	self.done = {}
				278	self.bad = {}
				279
				280	# Add a name table, so that the name URLs can be checked. Also
				281	# serves as an implicit cache for which URLs are done.
				282	self.name_table = {}
				283
				284	self.round = 0
				285	# The following are not pickled:
				286	self.robots = {}
				287	self.errors = {}
				288	self.urlopener = MyURLopener()
				289	self.changed = 0
				290
				291	def note(self, level, format, *args):
				292	if self.verbose > level:
				293	if args:
				294	format = format%args
				295	self.message(format)
				296
				297	def message(self, format, *args):
				298	if args:
				299	format = format%args
				300	print format
				301
				302	def __getstate__(self):
				303	return (self.roots, self.todo, self.done, self.bad, self.round)
				304
				305	def __setstate__(self, state):
				306	self.reset()
				307	(self.roots, self.todo, self.done, self.bad, self.round) = state
				308	for root in self.roots:
				309	self.addrobot(root)
				310	for url in self.bad.keys():
				311	self.markerror(url)
				312
				313	def addroot(self, root, add_to_do = 1):
				314	if root not in self.roots:
				315	troot = root
				316	scheme, netloc, path, params, query, fragment = \
				317	urlparse.urlparse(root)
				318	i = path.rfind("/") + 1
				319	if 0 < i < len(path):
				320	path = path[:i]
				321	troot = urlparse.urlunparse((scheme, netloc, path,
				322	params, query, fragment))
				323	self.roots.append(troot)
				324	self.addrobot(root)
				325	if add_to_do:
				326	self.newlink((root, ""), ("<root>", root))
				327
				328	def addrobot(self, root):
				329	root = urlparse.urljoin(root, "/")
				330	if self.robots.has_key(root): return
				331	url = urlparse.urljoin(root, "/robots.txt")
				332	self.robots[root] = rp = robotparser.RobotFileParser()
				333	self.note(2, "Parsing %s", url)
				334	rp.debug = self.verbose > 3
				335	rp.set_url(url)
				336	try:
				337	rp.read()
				338	except (OSError, IOError), msg:
				339	self.note(1, "I/O error parsing %s: %s", url, msg)
				340
				341	def run(self):
				342	while self.todo:
				343	self.round = self.round + 1
				344	self.note(0, "\nRound %d (%s)\n", self.round, self.status())
				345	urls = self.todo.keys()
				346	urls.sort()
				347	del urls[self.roundsize:]
				348	for url in urls:
				349	self.dopage(url)
				350
				351	def status(self):
				352	return "%d total, %d to do, %d done, %d bad" % (
				353	len(self.todo)+len(self.done),
				354	len(self.todo), len(self.done),
				355	len(self.bad))
				356
				357	def report(self):
				358	self.message("")
				359	if not self.todo: s = "Final"
				360	else: s = "Interim"
				361	self.message("%s Report (%s)", s, self.status())
				362	self.report_errors()
				363
				364	def report_errors(self):
				365	if not self.bad:
				366	self.message("\nNo errors")
				367	return
				368	self.message("\nError Report:")
				369	sources = self.errors.keys()
				370	sources.sort()
				371	for source in sources:
				372	triples = self.errors[source]
				373	self.message("")
				374	if len(triples) > 1:
				375	self.message("%d Errors in %s", len(triples), source)
				376	else:
				377	self.message("Error in %s", source)
				378	# Call self.format_url() instead of referring
				379	# to the URL directly, since the URLs in these
				380	# triples is now a (URL, fragment) pair. The value
				381	# of the "source" variable comes from the list of
				382	# origins, and is a URL, not a pair.
				383	for url, rawlink, msg in triples:
				384	if rawlink != self.format_url(url): s = " (%s)" % rawlink
				385	else: s = ""
				386	self.message(" HREF %s%s\n msg %s",
				387	self.format_url(url), s, msg)
				388
				389	def dopage(self, url_pair):
				390
				391	# All printing of URLs uses format_url(); argument changed to
				392	# url_pair for clarity.
				393	if self.verbose > 1:
				394	if self.verbose > 2:
				395	self.show("Check ", self.format_url(url_pair),
				396	" from", self.todo[url_pair])
				397	else:
				398	self.message("Check %s", self.format_url(url_pair))
				399	url, local_fragment = url_pair
				400	if local_fragment and self.nonames:
				401	self.markdone(url_pair)
				402	return
				403	try:
				404	page = self.getpage(url_pair)
				405	except sgmllib.SGMLParseError, msg:
				406	msg = self.sanitize(msg)
				407	self.note(0, "Error parsing %s: %s",
				408	self.format_url(url_pair), msg)
				409	# Dont actually mark the URL as bad - it exists, just
				410	# we can't parse it!
				411	page = None
				412	if page:
				413	# Store the page which corresponds to this URL.
				414	self.name_table[url] = page
				415	# If there is a fragment in this url_pair, and it's not
				416	# in the list of names for the page, call setbad(), since
				417	# it's a missing anchor.
				418	if local_fragment and local_fragment not in page.getnames():
				419	self.setbad(url_pair, ("Missing name anchor `%s'" % local_fragment))
				420	for info in page.getlinkinfos():
				421	# getlinkinfos() now returns the fragment as well,
				422	# and we store that fragment here in the "todo" dictionary.
				423	link, rawlink, fragment = info
				424	# However, we don't want the fragment as the origin, since
				425	# the origin is logically a page.
				426	origin = url, rawlink
				427	self.newlink((link, fragment), origin)
				428	else:
				429	# If no page has been created yet, we want to
				430	# record that fact.
				431	self.name_table[url_pair[0]] = None
				432	self.markdone(url_pair)
				433
				434	def newlink(self, url, origin):
				435	if self.done.has_key(url):
				436	self.newdonelink(url, origin)
				437	else:
				438	self.newtodolink(url, origin)
				439
				440	def newdonelink(self, url, origin):
				441	if origin not in self.done[url]:
				442	self.done[url].append(origin)
				443
				444	# Call self.format_url(), since the URL here
				445	# is now a (URL, fragment) pair.
				446	self.note(3, " Done link %s", self.format_url(url))
				447
				448	# Make sure that if it's bad, that the origin gets added.
				449	if self.bad.has_key(url):
				450	source, rawlink = origin
				451	triple = url, rawlink, self.bad[url]
				452	self.seterror(source, triple)
				453
				454	def newtodolink(self, url, origin):
				455	# Call self.format_url(), since the URL here
				456	# is now a (URL, fragment) pair.
				457	if self.todo.has_key(url):
				458	if origin not in self.todo[url]:
				459	self.todo[url].append(origin)
				460	self.note(3, " Seen todo link %s", self.format_url(url))
				461	else:
				462	self.todo[url] = [origin]
				463	self.note(3, " New todo link %s", self.format_url(url))
				464
				465	def format_url(self, url):
				466	link, fragment = url
				467	if fragment: return link + "#" + fragment
				468	else: return link
				469
				470	def markdone(self, url):
				471	self.done[url] = self.todo[url]
				472	del self.todo[url]
				473	self.changed = 1
				474
				475	def inroots(self, url):
				476	for root in self.roots:
				477	if url[:len(root)] == root:
				478	return self.isallowed(root, url)
				479	return 0
				480
				481	def isallowed(self, root, url):
				482	root = urlparse.urljoin(root, "/")
				483	return self.robots[root].can_fetch(AGENTNAME, url)
				484
				485	def getpage(self, url_pair):
				486	# Incoming argument name is a (URL, fragment) pair.
				487	# The page may have been cached in the name_table variable.
				488	url, fragment = url_pair
				489	if self.name_table.has_key(url):
				490	return self.name_table[url]
				491
				492	scheme, path = urllib.splittype(url)
				493	if scheme in ('mailto', 'news', 'javascript', 'telnet'):
				494	self.note(1, " Not checking %s URL" % scheme)
				495	return None
				496	isint = self.inroots(url)
				497
				498	# Ensure that openpage gets the URL pair to
				499	# print out its error message and record the error pair
				500	# correctly.
				501	if not isint:
				502	if not self.checkext:
				503	self.note(1, " Not checking ext link")
				504	return None
				505	f = self.openpage(url_pair)
				506	if f:
				507	self.safeclose(f)
				508	return None
				509	text, nurl = self.readhtml(url_pair)
				510
				511	if nurl != url:
				512	self.note(1, " Redirected to %s", nurl)
				513	url = nurl
				514	if text:
				515	return Page(text, url, maxpage=self.maxpage, checker=self)
				516
				517	# These next three functions take (URL, fragment) pairs as
				518	# arguments, so that openpage() receives the appropriate tuple to
				519	# record error messages.
				520	def readhtml(self, url_pair):
				521	url, fragment = url_pair
				522	text = None
				523	f, url = self.openhtml(url_pair)
				524	if f:
				525	text = f.read()
				526	f.close()
				527	return text, url
				528
				529	def openhtml(self, url_pair):
				530	url, fragment = url_pair
				531	f = self.openpage(url_pair)
				532	if f:
				533	url = f.geturl()
				534	info = f.info()
				535	if not self.checkforhtml(info, url):
				536	self.safeclose(f)
				537	f = None
				538	return f, url
				539
				540	def openpage(self, url_pair):
				541	url, fragment = url_pair
				542	try:
				543	return self.urlopener.open(url)
				544	except (OSError, IOError), msg:
				545	msg = self.sanitize(msg)
				546	self.note(0, "Error %s", msg)
				547	if self.verbose > 0:
				548	self.show(" HREF ", url, " from", self.todo[url_pair])
				549	self.setbad(url_pair, msg)
				550	return None
				551
				552	def checkforhtml(self, info, url):
				553	if info.has_key('content-type'):
				554	ctype = cgi.parse_header(info['content-type'])[0].lower()
				555	if ';' in ctype:
				556	# handle content-type: text/html; charset=iso8859-1 :
				557	ctype = ctype.split(';', 1)[0].strip()
				558	else:
				559	if url[-1:] == "/":
				560	return 1
				561	ctype, encoding = mimetypes.guess_type(url)
				562	if ctype == 'text/html':
				563	return 1
				564	else:
				565	self.note(1, " Not HTML, mime type %s", ctype)
				566	return 0
				567
				568	def setgood(self, url):
				569	if self.bad.has_key(url):
				570	del self.bad[url]
				571	self.changed = 1
				572	self.note(0, "(Clear previously seen error)")
				573
				574	def setbad(self, url, msg):
				575	if self.bad.has_key(url) and self.bad[url] == msg:
				576	self.note(0, "(Seen this error before)")
				577	return
				578	self.bad[url] = msg
				579	self.changed = 1
				580	self.markerror(url)
				581
				582	def markerror(self, url):
				583	try:
				584	origins = self.todo[url]
				585	except KeyError:
				586	origins = self.done[url]
				587	for source, rawlink in origins:
				588	triple = url, rawlink, self.bad[url]
				589	self.seterror(source, triple)
				590
				591	def seterror(self, url, triple):
				592	try:
				593	# Because of the way the URLs are now processed, I need to
				594	# check to make sure the URL hasn't been entered in the
				595	# error list. The first element of the triple here is a
				596	# (URL, fragment) pair, but the URL key is not, since it's
				597	# from the list of origins.
				598	if triple not in self.errors[url]:
				599	self.errors[url].append(triple)
				600	except KeyError:
				601	self.errors[url] = [triple]
				602
				603	# The following used to be toplevel functions; they have been
				604	# changed into methods so they can be overridden in subclasses.
				605
				606	def show(self, p1, link, p2, origins):
				607	self.message("%s %s", p1, link)
				608	i = 0
				609	for source, rawlink in origins:
				610	i = i+1
				611	if i == 2:
				612	p2 = ' '*len(p2)
				613	if rawlink != link: s = " (%s)" % rawlink
				614	else: s = ""
				615	self.message("%s %s%s", p2, source, s)
				616
				617	def sanitize(self, msg):
				618	if isinstance(IOError, ClassType) and isinstance(msg, IOError):
				619	# Do the other branch recursively
				620	msg.args = self.sanitize(msg.args)
				621	elif isinstance(msg, TupleType):
				622	if len(msg) >= 4 and msg[0] == 'http error' and \
				623	isinstance(msg[3], InstanceType):
				624	# Remove the Message instance -- it may contain
				625	# a file object which prevents pickling.
				626	msg = msg[:3] + msg[4:]
				627	return msg
				628
				629	def safeclose(self, f):
				630	try:
				631	url = f.geturl()
				632	except AttributeError:
				633	pass
				634	else:
				635	if url[:4] == 'ftp:' or url[:7] == 'file://':
				636	# Apparently ftp connections don't like to be closed
				637	# prematurely...
				638	text = f.read()
				639	f.close()
				640
				641	def save_pickle(self, dumpfile=DUMPFILE):
				642	if not self.changed:
				643	self.note(0, "\nNo need to save checkpoint")
				644	elif not dumpfile:
				645	self.note(0, "No dumpfile, won't save checkpoint")
				646	else:
				647	self.note(0, "\nSaving checkpoint to %s ...", dumpfile)
				648	newfile = dumpfile + ".new"
				649	f = open(newfile, "wb")
				650	pickle.dump(self, f)
				651	f.close()
				652	try:
				653	os.unlink(dumpfile)
				654	except os.error:
				655	pass
				656	os.rename(newfile, dumpfile)
				657	self.note(0, "Done.")
				658	return 1
				659
				660
				661	class Page:
				662
				663	def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE, checker=None):
				664	self.text = text
				665	self.url = url
				666	self.verbose = verbose
				667	self.maxpage = maxpage
				668	self.checker = checker
				669
				670	# The parsing of the page is done in the __init__() routine in
				671	# order to initialize the list of names the file
				672	# contains. Stored the parser in an instance variable. Passed
				673	# the URL to MyHTMLParser().
				674	size = len(self.text)
				675	if size > self.maxpage:
				676	self.note(0, "Skip huge file %s (%.0f Kbytes)", self.url, (size*0.001))
				677	self.parser = None
				678	return
				679	self.checker.note(2, " Parsing %s (%d bytes)", self.url, size)
				680	self.parser = MyHTMLParser(url, verbose=self.verbose,
				681	checker=self.checker)
				682	self.parser.feed(self.text)
				683	self.parser.close()
				684
				685	def note(self, level, msg, *args):
				686	if self.checker:
				687	apply(self.checker.note, (level, msg) + args)
				688	else:
				689	if self.verbose >= level:
				690	if args:
				691	msg = msg%args
				692	print msg
				693
				694	# Method to retrieve names.
				695	def getnames(self):
				696	if self.parser:
				697	return self.parser.names
				698	else:
				699	return []
				700
				701	def getlinkinfos(self):
				702	# File reading is done in __init__() routine. Store parser in
				703	# local variable to indicate success of parsing.
				704
				705	# If no parser was stored, fail.
				706	if not self.parser: return []
				707
				708	rawlinks = self.parser.getlinks()
				709	base = urlparse.urljoin(self.url, self.parser.getbase() or "")
				710	infos = []
				711	for rawlink in rawlinks:
				712	t = urlparse.urlparse(rawlink)
				713	# DON'T DISCARD THE FRAGMENT! Instead, include
				714	# it in the tuples which are returned. See Checker.dopage().
				715	fragment = t[-1]
				716	t = t[:-1] + ('',)
				717	rawlink = urlparse.urlunparse(t)
				718	link = urlparse.urljoin(base, rawlink)
				719	infos.append((link, rawlink, fragment))
				720
				721	return infos
				722
				723
				724	class MyStringIO(StringIO.StringIO):
				725
				726	def __init__(self, url, info):
				727	self.__url = url
				728	self.__info = info
				729	StringIO.StringIO.__init__(self)
				730
				731	def info(self):
				732	return self.__info
				733
				734	def geturl(self):
				735	return self.__url
				736
				737
				738	class MyURLopener(urllib.FancyURLopener):
				739
				740	http_error_default = urllib.URLopener.http_error_default
				741
				742	def __init__(*args):
				743	self = args[0]
				744	apply(urllib.FancyURLopener.__init__, args)
				745	self.addheaders = [
				746	('User-agent', 'Python-webchecker/%s' % __version__),
				747	]
				748
				749	def http_error_401(self, url, fp, errcode, errmsg, headers):
				750	return None
				751
				752	def open_file(self, url):
				753	path = urllib.url2pathname(urllib.unquote(url))
				754	if os.path.isdir(path):
				755	if path[-1] != os.sep:
				756	url = url + '/'
				757	indexpath = os.path.join(path, "index.html")
				758	if os.path.exists(indexpath):
				759	return self.open_file(url + "index.html")
				760	try:
				761	names = os.listdir(path)
				762	except os.error, msg:
				763	exc_type, exc_value, exc_tb = sys.exc_info()
				764	raise IOError, msg, exc_tb
				765	names.sort()
				766	s = MyStringIO("file:"+url, {'content-type': 'text/html'})
				767	s.write('<BASE HREF="file:%s">\n' %
				768	urllib.quote(os.path.join(path, "")))
				769	for name in names:
				770	q = urllib.quote(name)
				771	s.write('<A HREF="%s">%s</A>\n' % (q, q))
				772	s.seek(0)
				773	return s
				774	return urllib.FancyURLopener.open_file(self, url)
				775
				776
				777	class MyHTMLParser(sgmllib.SGMLParser):
				778
				779	def __init__(self, url, verbose=VERBOSE, checker=None):
				780	self.myverbose = verbose # now unused
				781	self.checker = checker
				782	self.base = None
				783	self.links = {}
				784	self.names = []
				785	self.url = url
				786	sgmllib.SGMLParser.__init__(self)
				787
				788	def check_name_id(self, attributes):
				789	""" Check the name or id attributes on an element.
				790	"""
				791	# We must rescue the NAME or id (name is deprecated in XHTML)
				792	# attributes from the anchor, in order to
				793	# cache the internal anchors which are made
				794	# available in the page.
				795	for name, value in attributes:
				796	if name == "name" or name == "id":
				797	if value in self.names:
				798	self.checker.message("WARNING: duplicate ID name %s in %s",
				799	value, self.url)
				800	else: self.names.append(value)
				801	break
				802
				803	def unknown_starttag(self, tag, attributes):
				804	""" In XHTML, you can have id attributes on any element.
				805	"""
				806	self.check_name_id(attributes)
				807
				808	def start_a(self, attributes):
				809	self.link_attr(attributes, 'href')
				810	self.check_name_id(attributes)
				811
				812	def end_a(self): pass
				813
				814	def do_area(self, attributes):
				815	self.link_attr(attributes, 'href')
				816	self.check_name_id(attributes)
				817
				818	def do_body(self, attributes):
				819	self.link_attr(attributes, 'background', 'bgsound')
				820	self.check_name_id(attributes)
				821
				822	def do_img(self, attributes):
				823	self.link_attr(attributes, 'src', 'lowsrc')
				824	self.check_name_id(attributes)
				825
				826	def do_frame(self, attributes):
				827	self.link_attr(attributes, 'src', 'longdesc')
				828	self.check_name_id(attributes)
				829
				830	def do_iframe(self, attributes):
				831	self.link_attr(attributes, 'src', 'longdesc')
				832	self.check_name_id(attributes)
				833
				834	def do_link(self, attributes):
				835	for name, value in attributes:
				836	if name == "rel":
				837	parts = value.lower().split()
				838	if ( parts == ["stylesheet"]
				839	or parts == ["alternate", "stylesheet"]):
				840	self.link_attr(attributes, "href")
				841	break
				842	self.check_name_id(attributes)
				843
				844	def do_object(self, attributes):
				845	self.link_attr(attributes, 'data', 'usemap')
				846	self.check_name_id(attributes)
				847
				848	def do_script(self, attributes):
				849	self.link_attr(attributes, 'src')
				850	self.check_name_id(attributes)
				851
				852	def do_table(self, attributes):
				853	self.link_attr(attributes, 'background')
				854	self.check_name_id(attributes)
				855
				856	def do_td(self, attributes):
				857	self.link_attr(attributes, 'background')
				858	self.check_name_id(attributes)
				859
				860	def do_th(self, attributes):
				861	self.link_attr(attributes, 'background')
				862	self.check_name_id(attributes)
				863
				864	def do_tr(self, attributes):
				865	self.link_attr(attributes, 'background')
				866	self.check_name_id(attributes)
				867
				868	def link_attr(self, attributes, *args):
				869	for name, value in attributes:
				870	if name in args:
				871	if value: value = value.strip()
				872	if value: self.links[value] = None
				873
				874	def do_base(self, attributes):
				875	for name, value in attributes:
				876	if name == 'href':
				877	if value: value = value.strip()
				878	if value:
				879	if self.checker:
				880	self.checker.note(1, " Base %s", value)
				881	self.base = value
				882	self.check_name_id(attributes)
				883
				884	def getlinks(self):
				885	return self.links.keys()
				886
				887	def getbase(self):
				888	return self.base
				889
				890
				891	if __name__ == '__main__':
				892	main()