franx47

Python Web Crawler to Increase SEO Traffic & Alexa Ranking

Posted in Anything, Python Programming, Web Crawler by franx47 on February 4, 2013

Hi,

I was trying to search for any web crawler tool on the net, but no luck. I mean, yes.. there are lot of web crawler/spider tools but they are not what I’m looking for. Luckily, I just found a simple web crawler coded in Python, then I recode it to suit my need.

This simple Python web crawler will crawl any link resulted from Google searching. We can set what string to crawl and if it finds any link contains the “string”, then the sitelink will be crawled thoroughly. The links which have been found will be saved in an array and will be processed one by one, if it duplicates, it will be skipped and go to next link until it has no item in the array.

This python crawler uses httplib.HTTPConnection class with HTTPConnection.request instance. For more information about this Python class, you can learn from here:

http://docs.python.org/2/library/httplib.html

Alright, now we have understood well on how this python script works. So, what is the purposes of this crawler? The main purpose or advantage of using this crawler is we can increase our SEO traffic or Alexa rating to our website. It sounds weird, yes .. but it really works for me. I tested on one of my site and crawl everything found on google with the string “my_url_site”. Logically, when someone clicks link from Google, then the link will get point in Google searching algorithm – known as Google Analytics code. It will also be used by Alexa to calculate the site rank.

Here I share my experience. Before I’m using this python web crawler, my site was at about 210,000 in Alexa rank. Then, I tried to run this script for about 2-3 times/week or maybey just 3-4 times in a month. Then after 4-5 months, my site now is at about 134,000 in Alexa rank world site. Look at the difference, in just 4 months my site got a really good rank in Alexa, ofcourse that means more visitor come into my site. I didn’t change anything in the web code, HTML header description, meta tag, or anything .. I just didn’t touch it at all.

So, if you want to follow my method like this, you can continue reading.

1/ I recommend you use Python 2.7 as there are many improvement changes in Python 2.7. Actually, this httplib.HTTPConnection class is an old class for opening http request and works good on Python 2.0, 2.4, 2.5, or 2.6. The newer class most programmers are using right now is urllib2 class.

2/ Make file crawler.py and copy this script:

import sys
import httplib
import re

url = "http://www.google.com"
depth = 5
search = "mysamplesite.com"

# get the parameters or use defaults
if (len(sys.argv) > 1):
url = sys.argv[1]

if (len(sys.argv) > 2):
depth = int(sys.argv[2])

if (len(sys.argv) > 3):
search = sys.argv[3]

processed = []

def searchURL(url, depth, search):
# only do http links
if (url.startswith("http://") and (not url in processed)):
processed.append(url)
url = url.replace("http://", "", 1)

# split out the url into host and doc
host = url
path = "/"

urlparts = url.split("/")
if (len(urlparts) > 1):
host = urlparts[0]
path = url.replace(host, "", 1)

# make the first request
print "crawling host: " + host + " path: " + path
conn = httplib.HTTPConnection(host)
req = conn.request("GET", path)
res = conn.getresponse()

# find the links
contents = res.read()
m = re.findall('href="(.*?)"', contents)

if (search in contents):
print "Found " + search + " at " + url

print str(depth) + ": processing " + str(len(m)) + " links"
for href in m:

# do relative urls
if (href.startswith("/")):
href = "http://" + host + href

# follow the links
if (depth > 0):
searchURL(href, depth-1, search)
else:
print "skipping " + url

searchURL(url, depth, search)

3/ Things you need to change to suit your need. Look at these line:

url = "http://www.google.com" //we use google.com to crawl

depth = 5 //the depth of the link subdirectory

search = "mysamplesite.com" //string that we want to crawl from the links found, here we use our own site link

if (url.startswith("http://") and (not url in processed)): // we can exclude site link that not need to be crawled
eg:
if (url.startswith("http://") and url.find('excludedsites.com') < 0 and (not url in processed)):

4/ We are ready to run the Python web crawler. Crawling process time depends on the depth we have set before. More depth means more time to crawl.

Happy crawling!

About these ads

9 Responses

Subscribe to comments with RSS.

  1. Steve said, on February 19, 2013 at 8:08 AM

    This is really interesting. It makes sense to me why it would work. One question though…do you have to change the IP address the web crawler is crawling from? I could see that being suspicious if the same IP address is clicking all the links periodically.

  2. franx47 said, on February 19, 2013 at 9:34 PM

    @Steve: No, I never changed my IP or using proxy. Even if the script runs by clicking all the <href found in google result, Google never block my IP. I run it by changing the depth & keyword, and run it every 2 or 3 times per week. So, from Google algorithm view, it will somehow be considered as a normal searching. Moreover, when it starts to crawl the first URL found, it will take some times to crawl the site (being processed) depends on the depth, and then start to crawl the next URL.

  3. Steve said, on February 20, 2013 at 3:44 AM

    Very cool! So how did it affect your google ranking on the keywords?

  4. franx47 said, on February 21, 2013 at 1:03 AM

    @Steve: As I said on my post, before I run this script for one of my site, it has a Google rank at 210,000. After run the script frequently 2/3 times/week .. I check it again in 4-5 months, it has rank at 135,000. I didn’t change the SEO, HTML tag, or web description at all. Checked on Alexa, the traffic is increasing every 1 month. As far as I know, more keyword we type on Google, it will be recorded as a popular searching keyword. Means the site will get higher rank on Google result, at least it belongs to 20 big Google searching results. So, this script helps us to submit the searching keyword on Google automatically.

  5. Steve said, on February 23, 2013 at 7:24 AM

    Sounds like it’s a pretty valid strategy for raising SEO rankings, thus bringing in more traffic. I think it could be offered as an automated service for small businesses wanting to keep high in the google rankings.

    Do you have data on how your specific keywords improved in rankings?

  6. franx47 said, on February 25, 2013 at 9:27 PM

    @Steve: Yes, it could be used like that. When I run this script, I changed the url (to first crawl from), depth, and the search keyword. The search keywords can be any keywords from my website tag or web description on the meta tag. For example, my site has a chat feature. So I just use search keyword with “mysite.com chat” or anything keyword relates to it. You can do changing the keyword to whatever you want as long as it reflects your website description. FYI, I run this script from multiple IP address or different network host. The changing can be monitored or seen on Alexa page. There’s site traffic analyze with nice graphical.

  7. chad said, on March 5, 2013 at 2:28 PM

    I tried it on windows python 2.7
    crawling host: http: path: //www.google.com
    Traceback (most recent call last):
    File “C:\crawler\crawler.py”, line 39, in
    req = conn.request(“GET”, path)
    File “C:\Python27\lib\httplib.py”, line 958, in request
    self._send_request(method, url, body, headers)
    File “C:\Python27\lib\httplib.py”, line 992, in _send_request
    self.endheaders(body)
    File “C:\Python27\lib\httplib.py”, line 954, in endheaders
    self._send_output(message_body)
    File “C:\Python27\lib\httplib.py”, line 814, in _send_output
    self.send(msg)
    File “C:\Python27\lib\httplib.py”, line 776, in send
    self.connect()
    File “C:\Python27\lib\httplib.py”, line 757, in connect
    self.timeout, self.source_address)
    File “C:\Python27\lib\socket.py”, line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
    socket.gaierror: [Errno 11004] getaddrinfo failed

  8. franx47 said, on March 5, 2013 at 2:36 PM

    @Chad: Make sure you set “url” with http://www.website.com. As httplib will only recognize url start with http://

  9. Eileen said, on April 28, 2013 at 2:54 AM

    What’s up, I read your blog like every week. Your writing style is witty, keep it up!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 73 other followers

%d bloggers like this: