Python HTML parsing with beautiful soup and filtering stop words -


i parsing out specific information website file. right program have looks @ webpage, , find right html tag , parses out right contents. want further filter these "results".

for example, on site : http://allrecipes.com/recipe/slow-cooker-pork-chops-ii/detail.aspx

i parsing out ingredients located in < div class="ingredients"...> tag. parser job nicely want further process these results.

when run parser, removes numbers, symbols, commas, , slash(\ or /) leaves text. when run on website results like:

cup olive oil cup chicken broth cloves garlic minced tablespoon paprika 

now want further process removing stop words "cup", "cloves", "minced", "tablesoon" among others. how do this? code written in python , not @ it, , using parser information can manually enter rather not.

any on how in detail appreciated! code below: how this?

code:

import urllib2 import beautifulsoup  def main():     url = "http://allrecipes.com/recipe/slow-cooker-pork-chops-ii/detail.aspx"     data = urllib2.urlopen(url).read()     bs = beautifulsoup.beautifulsoup(data)      ingreds = bs.find('div', {'class': 'ingredients'})     ingreds = [s.gettext().strip('123456789.,/\ ') s in ingreds.findall('li')]      fname = 'porkrecipe.txt'     open(fname, 'w') outf:         outf.write('\n'.join(ingreds))  if __name__=="__main__":     main() 

import urllib2 import beautifulsoup import string  badwords = set([     'cup','cups',     'clove','cloves',     'tsp','teaspoon','teaspoons',     'tbsp','tablespoon','tablespoons',     'minced' ])  def cleaningred(s):     # remove leading , trailing whitespace     s = s.strip()     # remove numbers , punctuation in string     s = s.strip(string.digits + string.punctuation)     # remove unwanted words     return ' '.join(word word in s.split() if not word in badwords)  def main():     url = "http://allrecipes.com/recipe/slow-cooker-pork-chops-ii/detail.aspx"     data = urllib2.urlopen(url).read()     bs = beautifulsoup.beautifulsoup(data)      ingreds = bs.find('div', {'class': 'ingredients'})     ingreds = [cleaningred(s.gettext()) s in ingreds.findall('li')]      fname = 'porkrecipe.txt'     open(fname, 'w') outf:         outf.write('\n'.join(ingreds))  if __name__=="__main__":     main() 

results in

olive oil chicken broth garlic, paprika garlic powder poultry seasoning dried oregano dried basil thick cut boneless pork chops salt , pepper taste 

? don't know why it's left comma in - s.strip(string.punctuation) should have taken care of that.


Comments

Popular posts from this blog

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

fortran - Function return type mismatch -

queue - mq_receive: message too long -