python - How to select only certain tag and text using xpath? -


for example, html block:

<p><b>text1</b> (<span><a href="#1">asdf</a>text2</span>)</p> 

i need select tags "a" , rest must plain text see in browser:

result = ["text1", " (", <tag_a>, "text2", ")"] 

or that.

tried:

hxs.select('.//a|text()') 

in case finds tags "a" text returned direct children.

at same time:

hxs.select('.//text()|a') 

gets texts, tags "a" direct children.

update

    elements = []     in hxs.select('.//node()'):         try:             tag_name = i.select('name()').extract()[0]         except typeerror:             tag_name = '_text'          if tag_name == 'a':             elements.append(i)         elif tag_name == '_text':             elements.append(i.extract()) 

is there better way?

is kind of thing you're looking for?

you can remove descendant tags block using etree.strip_tags

from lxml import etree d = etree.html('<html><body><p><b>text1</b> (<span><a href="#1">asdf</a>text2</span>)</p></body></html>') block = d.xpath('/html/body/p')[0] # etree.strip_tags apparently takes list of tags strip, wasn't working me tag in set(x.tag x in block.iterdescendants() if x.tag != 'a'):   etree.strip_tags(block,tag)  block.xpath('./text()|a') 

yields:

['text1', ' (', <element @ fa4a48>, 'text2', ')'] 

Comments

Popular posts from this blog

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

fortran - Function return type mismatch -

queue - mq_receive: message too long -