mechanize - first python script, scraper, recommendations welcome -

- June 15, 2014

i have finished first python script, scraper election data philipines. not have programming background, have used stata statistical analysis , dabbled bit in r lately want switch @ point. want learn python extract data websites , other sources. far have browsed through python tutorial, "learning python" o'reilly waiting on shelf. wrote following script taking inspiration other peoples' scripts , browsing included packages documentation.

what looking general advice. script work, there superfluous parts? should structure differently? there typical (or plain dumb) beginners mistakes?

i have compiled few questions myself have listed after script.

import mechanize import lxml.html import csv  site = "http://www.comelec.gov.ph/results/2004natl/2004electionresults_local.aspx"  br = mechanize.browser() response = br.open(site)  output = csv.writer(file(r'output.csv','wb'))   br.select_form(name="ctl00") provinces = br.possible_items("provlist")  prov in provinces:     br.select_form(name="ctl00")     br["provlist"] = [prov]     response = br.submit()     br.select_form(name="ctl00")     pname = str(br.get_value_by_label("provlist")).strip("[]")     municipalities = br.possible_items("munlist")     mun in municipalities:         br.select_form(name="ctl00")         br["munlist"] = [mun]         response = br.submit(type="submit", name="ctl01")         html = response.read()         root = lxml.html.fromstring(html)         try:              table = root.get_element_by_id(id="dlistcandidates")             data = [                        [td.text_content().strip() td in row.findall("td")]                         row in table.findall('tr')                    ]         except keyerror:             print "results not available yet."             data = [ [ "." in range(5) ] ]         br.select_form(name="ctl00")         mname = str(br.get_value_by_label("munlist")).strip('[]')         print pname, mname, data, "\n"         row in data:             if row:                  row.append(pname)                 row.append(mname)                 output.writerow([s.encode('utf8') if type(s) unicode else s s in row])

when execute script, error message "deprecationwarning: [item.name item in self.items]. what's reason , should worry it?
i looping on provinces number keys , fetch name each time. should rather build dictionary in beginning , loop on that?
is there easy way encode "ene" character (n tilde above) directly normal n?
insted of replacing "data" each time, how best collect , write csv file @ end? better solution?
the site takes quite while respond each request. getting data takes hour. can speed executing sveral scripts , concatening provinces list. how go sending parallel requests in 1 script? want more data site, , nice speed process.
i have tried both beautifulsoup , lxml module, liked lxml solution better. other modules useful these kinds of tasks?
is there central register documentation/help files both built in modules , others? seemed me documentations scattered everywhere, inconvenient. writing help(something) resulted in "something not found".

any recommendations , critique appreciated. english not native language, hope managed keep mistakes @ minimum.

the deprecationwarning coming mechanize module, , being issued when call possible_items. it's suggesting better way same effect. don't know why author didn't make more explicit.
i don't think makes difference.
you might want @ http://effbot.org/zone/unicode-convert.htm .
writing incrementally, you're doing, looks fine me. instead make list of rows, append in loop, , write whole thing in 1 go @ end; main advantage slight increase in modularity. (suppose wanted same scraping use result in way; reuse code more easily.)
(a) if remote site taking long time respond , scraping remote site, sure hitting multiple requests in parallel @ all? (b) may want check owners of site in question don't object sort of scraping, both out of politeness , because if object may notice you're doing , block you. i'd guess since it's government site they're ok it. (c) take @ threading , multiprocessing modules in python standard library.
i don't know; sorry.
no. (unless count google.)

it looks if bit of back-and-forth determine provinces , municipalities. if don't change between invocations of script, might worth saving them somewhere locally instead of asking remote website every time. (the gain isn't worth effort -- might want measure how long takes information.)

you might consider extracting code turns blob of html list of candidates (if that's is) separate function.

you might consider extracting separate function:

def select_item(br, form, listname, value, submit_form=none):   br.select_form(form)   br[listname] = [value]   return br.submit(type="submit", name=(submit_form or form))

and maybe this:

def get_name(br, formname, label): br.select_form(formname)   return str(br.get_value_by_label(label)).strip("[]")

Search This Blog

Score

mechanize - first python script, scraper, recommendations welcome -

Comments

Post a Comment

Popular posts from this blog

how to build hyperlink for query string in php -

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

queue - mq_receive: message too long -