I made a blog scraper!
I am researching blogs for my master's paper, and my advisor suggested that I automate the process so that I don't have to go through hundreds of blogs by hand. I thought it was appropriate to craft it in Python using some of the skills we have learned earlier in the semester - particularly what we did for the Input/Output exercise. I figured I'd post it here since A) I used Python to make it and B) our guest today mentioned that learning to automate tasks would be a good takeaway for the course. Code below. It works just fine for me on Ubuntu.
import urllib2
import time
y = 0
sec = 0
numPosts = 1
for x in range(numPosts):
print 'Scraping page...'
# Name document
z = str(y)
docName = 'blogger' + z + '.html'
# Create new file to save the HTML of the scraped page
req = urllib2.Request('http://www.blogger.com/next-blog?navBar=true&blogID=1120725995044390317')
response = urllib2.urlopen(req)
html = response.read()
docName = 'blogger' + z + '.html'
f = open(docName, 'w')
f.write(html)
f.close()
print "\033[1;31m" + docName + 'created ' + u"\u2764"
# Trim the file to a single blog entry
file = open(docName, 'r')
filteredHTML = file.read()
# Find the start and end (start of the next entry) of the first entry on the page
docStart = filteredHTML.find("<h2 class='date-header'><span>")
filteredStart = docStart + 22
filteredStartHTML = filteredHTML[filteredStart:]
docEnd = filteredStartHTML.find("<h2 class='date-header'><span>")
docEnd = docEnd + docStart
filteredHTML = filteredHTML[docStart:docEnd] + '<!----- END POST ------>'
# Check that entry exists
print "\033[31m" + filteredHTML[:50] + '...'
# Append entry to Blogger document
bloggerDocument = open('bloggerPosts.html', 'a')
bloggerDocument.write(filteredHTML)
bloggerDocument.close()
y += 1
if (y < numPosts):
print "\033[0m" 'Wait a minute please? *chu*~'
time.sleep(60)
print "\033[0m" 'Done!'
I'm not entirely sure how legal it is to have these sorts of things, to be honest, but since I'm using it for research and not to actually post any of the content I think I should be safe. I have to go through the IRB soon! But I still wanted to share with you all. I did run into some issues when trying to gather lots of posts one after another, which is why I added the "time sleep" tidbit. That way, the servers don't freak out and yell at me for being a robot.
Resources include
- Active State: here
- silsHack: here
- Python for beginners: here
- Python documentation: here
- FileFormat.info: here
Here's what the output looks like for me: