Python HTML parsing with beautiful soup and filtering stop words -

- February 15, 2012

i parsing out specific information website file. right program have looks @ webpage, , find right html tag , parses out right contents. want further filter these "results".

for example, on site : http://allrecipes.com/recipe/slow-cooker-pork-chops-ii/detail.aspx

i parsing out ingredients located in < div class="ingredients"...> tag. parser job nicely want further process these results.

when run parser, removes numbers, symbols, commas, , slash(\ or /) leaves text. when run on website results like:

cup olive oil cup chicken broth cloves garlic minced tablespoon paprika

now want further process removing stop words "cup", "cloves", "minced", "tablesoon" among others. how do this? code written in python , not @ it, , using parser information can manually enter rather not.

any on how in detail appreciated! code below: how this?

code:

import urllib2 import beautifulsoup  def main():     url = "http://allrecipes.com/recipe/slow-cooker-pork-chops-ii/detail.aspx"     data = urllib2.urlopen(url).read()     bs = beautifulsoup.beautifulsoup(data)      ingreds = bs.find('div', {'class': 'ingredients'})     ingreds = [s.gettext().strip('123456789.,/\ ') s in ingreds.findall('li')]      fname = 'porkrecipe.txt'     open(fname, 'w') outf:         outf.write('\n'.join(ingreds))  if __name__=="__main__":     main()

import urllib2 import beautifulsoup import string  badwords = set([     'cup','cups',     'clove','cloves',     'tsp','teaspoon','teaspoons',     'tbsp','tablespoon','tablespoons',     'minced' ])  def cleaningred(s):     # remove leading , trailing whitespace     s = s.strip()     # remove numbers , punctuation in string     s = s.strip(string.digits + string.punctuation)     # remove unwanted words     return ' '.join(word word in s.split() if not word in badwords)  def main():     url = "http://allrecipes.com/recipe/slow-cooker-pork-chops-ii/detail.aspx"     data = urllib2.urlopen(url).read()     bs = beautifulsoup.beautifulsoup(data)      ingreds = bs.find('div', {'class': 'ingredients'})     ingreds = [cleaningred(s.gettext()) s in ingreds.findall('li')]      fname = 'porkrecipe.txt'     open(fname, 'w') outf:         outf.write('\n'.join(ingreds))  if __name__=="__main__":     main()

results in

olive oil chicken broth garlic, paprika garlic powder poultry seasoning dried oregano dried basil thick cut boneless pork chops salt , pepper taste

? don't know why it's left comma in - s.strip(string.punctuation) should have taken care of that.

Search This Blog

Score

Python HTML parsing with beautiful soup and filtering stop words -

Comments

Post a Comment

Popular posts from this blog

how to build hyperlink for query string in php -

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

queue - mq_receive: message too long -