Starting with Scraping in Python

Scraping is one of those things that you need always, be it for some article extraction or extracting data from tables and what not. I recently started with scraping and I used mechanize library and beautiful soup for the data-processing. There is also a requests library in python which is very well integrated for all your use but it is a little difficult to start with than mechanize.

Mechanize helps you to open url's and submit forms on any page or extract links for that matter very easily. If you try to do it by simple methods such as by urllib library, it is pain in the ass. First you would have to open the link, view the source code, make note of field names, fill them correctly, and correctly submit the url.

import urllib, urllib2  
req = urllib2.Request("http://example.com/form/submit/url")  
data = urllib.urlencode({'field1': 'value', 'field2': 'value', 'filed3': 'value'})  
headers={'User-Agent': 'Mozilla something', 'Cookie': 'name=value; name2=value2'})  
response = urllib2.urlopen(req)  
# do something with response

And still after that you may find that all that amounted to nothing because of website using csrf protection. Alas!

But good news is mechanize handles that for you and you wouldn't have to worry about anything.All you have to do is find the url , find form and note down its identifier and you are good to go . The beautiful thing is, mechanize will automatically handle csrf fields and most other popular forms of preventing bots doing their dirty business all over a website.

import mechanize  
browser.open('http://example.com/form/')  
browser.select_form(name='the_form')browser['field1'] = 'value'  
browser['field2'] = 'value'  
browser['field3'] = 'value'browser.submit()  

When you have got your data ,  Beautiful soup makes it very easy to process the html and handle DOM of javascript.

from bs4 import BeautifulSoup  
soup = BeautifulSoup(browser.response().read())  
body_tag = soup.bodyall_paragraphs = soup.find_all('p')  

Here is a sample code I wrote to find the count of a particular brand phones on flipkart. It is self explanatory.

import mechanize  
from bs4 import BeautifulSoup  
br = mechanize.Browser()  
# Start of mechanize code to handle the bots identifying security
br.set_handle_equiv(True)  
br.set_handle_redirect(True)  
br.set_handle_referer(True)  
br.set_handle_robots(False)  
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)  
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Fedora/3.0.1-1.fc9 Firefox/3.0.1')]  
# End of bot destroyer
br.open("http://www.flipkart.com/")   # Opens the url  
br.select_form(nr=1)    #selects the second form in the html code

# Use the below commented code to find a form with particular id
'''  
for form in br.forms():  
    if form.attrs['id'] == 'fk-top-search-box':
        br.form = form
        break
'''  
br['q'] = 'motorola'                 #fills the from whose input id is q and motorola is what you want to search  
response = br.submit().read()        #br.submit submits the form and than read function reads the html of the resultant url  
soup = BeautifulSoup(response)       #convert html to bs4 object  
result = soup.find_all("a", class_="Android Phones")  
result = result.getText().strip()

print result  

Hope you find this post useful. Do drop comment for any suggestions or any doubts.Will be happy to help.
Cheers and Happy Coding!!