python - How to select only certain tag and text using xpath? -
for example, html block:
<p><b>text1</b> (<span><a href="#1">asdf</a>text2</span>)</p> i need select tags "a" , rest must plain text see in browser:
result = ["text1", " (", <tag_a>, "text2", ")"] or that.
tried:
hxs.select('.//a|text()') in case finds tags "a" text returned direct children.
at same time:
hxs.select('.//text()|a') gets texts, tags "a" direct children.
update
elements = [] in hxs.select('.//node()'): try: tag_name = i.select('name()').extract()[0] except typeerror: tag_name = '_text' if tag_name == 'a': elements.append(i) elif tag_name == '_text': elements.append(i.extract()) is there better way?
is kind of thing you're looking for?
you can remove descendant tags block using etree.strip_tags
from lxml import etree d = etree.html('<html><body><p><b>text1</b> (<span><a href="#1">asdf</a>text2</span>)</p></body></html>') block = d.xpath('/html/body/p')[0] # etree.strip_tags apparently takes list of tags strip, wasn't working me tag in set(x.tag x in block.iterdescendants() if x.tag != 'a'): etree.strip_tags(block,tag) block.xpath('./text()|a') yields:
['text1', ' (', <element @ fa4a48>, 'text2', ')']
Comments
Post a Comment