python - How to modify lxml autolink to be more liberal? -


i using autolink function of great lxml library documented here: http://lxml.de/api/lxml.html.clean-module.html

my problem detects urls start http://. use broader url detection regex one: http://daringfireball.net/2010/07/improved_regex_for_matching_urls

i tried make regex work lxml autolink function without success. end a:

lxml\html\clean.py", line 571, in _link_text host = match.group('host') indexerror: no such group 

any python/regex gurus out there know how make work?

there 2 things in order adapt regexp lxml's autolink. first wrap entire url pattern match in group (?p<body> .. ) - lets lxml know goes inside href="" attribute.

next, wrap host part in (?<host> .. ) group , pass avoid_hosts=[] parameter when call autolink function. reason regexp pattern you're using doesn't find host (sometimes host part none) since matches partial urls , ambiguous url-like patterns.

i've modified regexp include above changes , given snippet test case:

import re import lxml.html import lxml.html.clean  url_regexp = re.compile(r"""(?i)\b(?p<body>(?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|(?p<host>[a-z0-9.\-]+[.][a-z]{2,4}/))(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""")  doc = """<html><body>     http://foo.com/blah_blah     http://foo.com/blah_blah/.     http://www.extinguishedscholar.com/wpglob/?p=364.     http://✪df.ws/1234     rdar://1234     rdar:/1234     message://%3c330e7f840905021726r6a4ba78dkf1fd71420c1bf6ff@mail.gmail.com%3e     &lt;mailto:gruber@daringfireball.net?subject=test&gt; (including brokets).     bit.ly/foo </body></html>"""  tree = lxml.html.fromstring(doc) body = tree.find('body') lxml.html.clean.autolink(body, [url_regexp], avoid_hosts=[]) print lxml.html.tostring(tree) 

output:

<html><body>     <a href="http://foo.com/blah_blah">http://foo.com/blah_blah</a>     <a href="http://foo.com/blah_blah/">http://foo.com/blah_blah/</a>.     <a href="http://www.extinguishedscholar.com/wpglob/?p=364">http://www.extinguishedscholar.com/wpglob/?p=364</a>.     <a href="http://%c3%a2%c2%9c%c2%aadf.ws/1234">http://&#226;&#156;&#170;df.ws/1234</a>     <a href="rdar://1234">rdar://1234</a>     <a href="rdar:/1234">rdar:/1234</a>     <a href="message://%3c330e7f840905021726r6a4ba78dkf1fd71420c1bf6ff@mail.gmail.com%3e">message://%3c330e7f840905021726r6a4ba78dkf1fd71420c1bf6ff@mail.gmail.com%3e</a>     &lt;<a href="mailto:gruber@daringfireball.net?subject=test">mailto:gruber@daringfireball.net?subject=test</a>&gt;     (including brackets).     <a href="bit.ly/foo">bit.ly/foo</a> </body></html> 

Comments

Popular posts from this blog

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

fortran - Function return type mismatch -

queue - mq_receive: message too long -