python : Replacing a HTML element depending on its content -
i have html document, in elements contains stuff want hide (like chinese government doing, except want hide confidential information). example have :
<div> <span> bkhiu jknd o so yui iou 789 </span> <span> bkhiu <div> 56 898tr secret oij890 </div> </span> </div> and want elements contain string secret, , replace whole content ### :
<div> <span> bkhiu jknd o so yui iou 789 </span> <span> bkhiu <div>###</div> </span> </div> i have thought of using minidom , re :
xmldoc = minidom.parsestring(my_html_string) # filtering nodes content sensitive_nodes = filter(lambda n: re.search('secret', n.nodevalue), xmldoc.getelementsbytagname()) # replacing content node in sensitive_nodes: node.nodevalue = '###' # output my_html_string = xmldoc.toxml() but first parsing doesn't succeeds :
expaterror: mismatched tag: line 27, column 6 and .getelementsbytagname() needs tagname parameter ... while don't care tag name , need nodes (in order filter content). code doesn't work @ all, try explain wanna achieve.
any idea how ? minidom or different ?
ok ... have found simple way, using beautifulsoup :
import re beautifulsoup import beautifulsoup soup = beautifulsoup(my_html) nodes_to_censor = soup.findall(text=re.compile('.*secret.*')) node in nodes_to_censor: node.replacewith('###')
Comments
Post a Comment