mechanize - first python script, scraper, recommendations welcome -
i have finished first python script, scraper election data philipines. not have programming background, have used stata statistical analysis , dabbled bit in r lately want switch @ point. want learn python extract data websites , other sources. far have browsed through python tutorial, "learning python" o'reilly waiting on shelf. wrote following script taking inspiration other peoples' scripts , browsing included packages documentation.
what looking general advice. script work, there superfluous parts? should structure differently? there typical (or plain dumb) beginners mistakes?
i have compiled few questions myself have listed after script.
import mechanize import lxml.html import csv site = "http://www.comelec.gov.ph/results/2004natl/2004electionresults_local.aspx" br = mechanize.browser() response = br.open(site) output = csv.writer(file(r'output.csv','wb')) br.select_form(name="ctl00") provinces = br.possible_items("provlist") prov in provinces: br.select_form(name="ctl00") br["provlist"] = [prov] response = br.submit() br.select_form(name="ctl00") pname = str(br.get_value_by_label("provlist")).strip("[]") municipalities = br.possible_items("munlist") mun in municipalities: br.select_form(name="ctl00") br["munlist"] = [mun] response = br.submit(type="submit", name="ctl01") html = response.read() root = lxml.html.fromstring(html) try: table = root.get_element_by_id(id="dlistcandidates") data = [ [td.text_content().strip() td in row.findall("td")] row in table.findall('tr') ] except keyerror: print "results not available yet." data = [ [ "." in range(5) ] ] br.select_form(name="ctl00") mname = str(br.get_value_by_label("munlist")).strip('[]') print pname, mname, data, "\n" row in data: if row: row.append(pname) row.append(mname) output.writerow([s.encode('utf8') if type(s) unicode else s s in row]) when execute script, error message "deprecationwarning: [item.name item in self.items]. what's reason , should worry it?
i looping on provinces number keys , fetch name each time. should rather build dictionary in beginning , loop on that?
is there easy way encode "ene" character (n tilde above) directly normal n?
insted of replacing "data" each time, how best collect , write csv file @ end? better solution?
the site takes quite while respond each request. getting data takes hour. can speed executing sveral scripts , concatening provinces list. how go sending parallel requests in 1 script? want more data site, , nice speed process.
i have tried both beautifulsoup , lxml module, liked lxml solution better. other modules useful these kinds of tasks?
is there central register documentation/help files both built in modules , others? seemed me documentations scattered everywhere, inconvenient. writing help(something) resulted in "something not found".
any recommendations , critique appreciated. english not native language, hope managed keep mistakes @ minimum.
the
deprecationwarningcomingmechanizemodule, , being issued when callpossible_items. it's suggesting better way same effect. don't know why author didn't make more explicit.i don't think makes difference.
you might want @ http://effbot.org/zone/unicode-convert.htm .
writing incrementally, you're doing, looks fine me. instead make list of rows, append in loop, , write whole thing in 1 go @ end; main advantage slight increase in modularity. (suppose wanted same scraping use result in way; reuse code more easily.)
(a) if remote site taking long time respond , scraping remote site, sure hitting multiple requests in parallel @ all? (b) may want check owners of site in question don't object sort of scraping, both out of politeness , because if object may notice you're doing , block you. i'd guess since it's government site they're ok it. (c) take @
threading,multiprocessingmodules in python standard library.i don't know; sorry.
no. (unless count google.)
it looks if bit of back-and-forth determine provinces , municipalities. if don't change between invocations of script, might worth saving them somewhere locally instead of asking remote website every time. (the gain isn't worth effort -- might want measure how long takes information.)
you might consider extracting code turns blob of html list of candidates (if that's is) separate function.
you might consider extracting separate function:
def select_item(br, form, listname, value, submit_form=none): br.select_form(form) br[listname] = [value] return br.submit(type="submit", name=(submit_form or form)) and maybe this:
def get_name(br, formname, label): br.select_form(formname) return str(br.get_value_by_label(label)).strip("[]")
Comments
Post a Comment