Python Web Crawler to Increase SEO Traffic & Alexa Ranking
I was trying to search for any web crawler tool on the net, but no luck. I mean, yes.. there are lot of web crawler/spider tools but they are not what I’m looking for. Luckily, I just found a simple web crawler coded in Python, then I recode it to suit my need.
This simple Python web crawler will crawl any link resulted from Google searching. We can set what string to crawl and if it finds any link contains the “string”, then the sitelink will be crawled thoroughly. The links which have been found will be saved in an array and will be processed one by one, if it duplicates, it will be skipped and go to next link until it has no item in the array.
This python crawler uses httplib.HTTPConnection class with HTTPConnection.request instance. For more information about this Python class, you can learn from here:
Alright, now we have understood well on how this python script works. So, what is the purposes of this crawler? The main purpose or advantage of using this crawler is we can increase our SEO traffic or Alexa rating to our website. It sounds weird, yes .. but it really works for me. I tested on one of my site and crawl everything found on google with the string “my_url_site”. Logically, when someone clicks link from Google, then the link will get point in Google searching algorithm – known as Google Analytics code. It will also be used by Alexa to calculate the site rank.
Here I share my experience. Before I’m using this python web crawler, my site was at about 210,000 in Alexa rank. Then, I tried to run this script for about 2-3 times/week or maybey just 3-4 times in a month. Then after 4-5 months, my site now is at about 134,000 in Alexa rank world site. Look at the difference, in just 4 months my site got a really good rank in Alexa, ofcourse that means more visitor come into my site. I didn’t change anything in the web code, HTML header description, meta tag, or anything .. I just didn’t touch it at all.
So, if you want to follow my method like this, you can continue reading.
1/ I recommend you use Python 2.7 as there are many improvement changes in Python 2.7. Actually, this httplib.HTTPConnection class is an old class for opening http request and works good on Python 2.0, 2.4, 2.5, or 2.6. The newer class most programmers are using right now is urllib2 class.
2/ Make file crawler.py and copy this script:
url = "http://www.google.com"
depth = 5
search = "mysamplesite.com"
# get the parameters or use defaults
if (len(sys.argv) > 1):
url = sys.argv
if (len(sys.argv) > 2):
depth = int(sys.argv)
if (len(sys.argv) > 3):
search = sys.argv
processed = 
def searchURL(url, depth, search):
# only do http links
if (url.startswith("http://") and (not url in processed)):
url = url.replace("http://", "", 1)
# split out the url into host and doc
host = url
path = "/"
urlparts = url.split("/")
if (len(urlparts) > 1):
host = urlparts
path = url.replace(host, "", 1)
# make the first request
print "crawling host: " + host + " path: " + path
conn = httplib.HTTPConnection(host)
req = conn.request("GET", path)
res = conn.getresponse()
# find the links
contents = res.read()
m = re.findall('href="(.*?)"', contents)
if (search in contents):
print "Found " + search + " at " + url
print str(depth) + ": processing " + str(len(m)) + " links"
for href in m:
# do relative urls
href = "http://" + host + href
# follow the links
if (depth > 0):
searchURL(href, depth-1, search)
print "skipping " + url
searchURL(url, depth, search)
3/ Things you need to change to suit your need. Look at these line:
url = "http://www.google.com" //we use google.com to crawl
depth = 5 //the depth of the link subdirectory
search = "mysamplesite.com" //string that we want to crawl from the links found, here we use our own site link
if (url.startswith("http://") and (not url in processed)): // we can exclude site link that not need to be crawled
if (url.startswith("http://") and url.find('excludedsites.com') < 0 and (not url in processed)):
4/ We are ready to run the Python web crawler. Crawling process time depends on the depth we have set before. More depth means more time to crawl.
Subscribe to comments with RSS.