python : Replacing a HTML element depending on its content -

- July 15, 2014

i have html document, in elements contains stuff want hide (like chinese government doing, except want hide confidential information). example have :

<div>     <span> bkhiu jknd o so  yui iou 789 </span>     <span>         bkhiu         <div> 56 898tr secret oij890 </div>     </span> </div>

and want elements contain string secret, , replace whole content ### :

<div>     <span> bkhiu jknd o so  yui iou 789 </span>     <span>         bkhiu         <div>###</div>     </span> </div>

i have thought of using minidom , re :

xmldoc = minidom.parsestring(my_html_string) # filtering nodes content sensitive_nodes = filter(lambda n: re.search('secret', n.nodevalue),      xmldoc.getelementsbytagname()) # replacing content node in sensitive_nodes:     node.nodevalue = '###' # output my_html_string = xmldoc.toxml()

but first parsing doesn't succeeds :

expaterror: mismatched tag: line 27, column 6

and .getelementsbytagname() needs tagname parameter ... while don't care tag name , need nodes (in order filter content). code doesn't work @ all, try explain wanna achieve.

any idea how ? minidom or different ?

ok ... have found simple way, using beautifulsoup :

import re beautifulsoup import beautifulsoup  soup = beautifulsoup(my_html) nodes_to_censor = soup.findall(text=re.compile('.*secret.*')) node in nodes_to_censor:     node.replacewith('###')

Search This Blog

Score

python : Replacing a HTML element depending on its content -

Comments

Post a Comment

Popular posts from this blog

how to build hyperlink for query string in php -

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

queue - mq_receive: message too long -