So the full code for this thing is rather long and can be found here.
Excercise 1
def excercise1(book):
'''
proccess_file() got rather complicated and too sofisticated for this problem
So we're just going to rewrite it here. We could define a new method,
but my preference is to only do that if we need to solve the same problem 3 or more times
'''
fp = open(book)
longstring = str()
for line in fp:
line = line.replace('-', ' ')
for word in line.split():
word = word.strip(string.punctuation + string.whitespace)
word = word.lower()
word = re.sub('[\W_]+','', word) #regular expressions clean up wierd characters not included in string.punctuation
longstring = longstring + word + " "
return longstring
The excercise1 takes the argument of a text file and returns it stripped of all nonalphanumeric characters
Exercise 2
def def exercise2(book):
hist = process_file(book, True)
return "There are %d unique words in this work" %len(hist)
def process_file(filename, guten):
hist = dict()
fp = open(filename)
if guten:
header = True
if not guten:
header = False
for line in fp:
if line[:20] == "*** END OF THIS PROJ": # There must be a better way to escape the header and footer
header = True
if not header:
process_line(line, hist)
if line[:20] == "*END*THE SMALL PRINT" or line[:20]=="*** START OF THIS PR": #this is only for the shakespeares folios "00ws110.tt"
header = False
#print "header escaped" #woo, debugging
return hist
def process_line(line, hist):
line = line.replace('-', ' ') #clean hyphenated words
for word in line.split(): #re.split('[\W_]+', line) #could do the split with regex, but regex is magic and doesn't strip punctuation quite as nicely
word = word.strip(string.punctuation + string.whitespace)
word = word.lower()
word = re.sub('[\W_]+','', word) #this regular expression should get rid of the few special iso characters not in string.punctuation
hist[word] = hist.get(word, 0) + 1
Exercise 2 returns the number of unique words used in a body of text. A cursory examination of a couple of text from different eras suggest modern texts have fewer unique words. This owes mostly to orthographical errors and inconsistency. Though it also does have to do with a contracting vocabulary as older terms fell our of use
Exercise 3
def exercise3(book):
hist = process_file(book, True)
output = top_20(hist)
return output
def top_20(hist):
hist_sorted = sorted(hist.iteritems(), key=operator.itemgetter(1), reverse=True) #according to stackexchange this is a really fast way to sort a dicitonary
output ="The twenty most common terms in this work are:\n"
for i in range(0,20):
output += str(hist_sorted[i]) +"\n"
return output
Exercise 3 returns the 20 most common words in a given book. They are almost always "the" though sometimes Gutenberg makes an appearance if the header and footer ignore mechanism malfunctions.
Exercise 4
def exercise4(book):
output_list = return_true_words(compare_lists(process_file(book, True), process_file("words.txt", False)))
return output_list #long
def compare_lists(list1, list2):
new_list = dict()
for word1 in list1:
new_list[word1] = True
#print word1
if list2.has_key(word1):
new_list[word1] = False
return new_list
Exercise 4 takes two lists of words, a book and a list of common words, compares them and returns the words present in the book not present in the list of common words. Most of the returned words for newer texts (post ~1820ish) are proper nouns or specialized vocabulary. For older text orthography and vocabulary vary so widely that many more words are returned. Shakespeare and works of that same age often have more words excluded than included in the list of common words.
And Finally I wrote the following method to call all of the excercise() methods because I was too lazy to type out 'excerciseN()' four times
def writeitallout():
for i in range(1,5):
output = open("execise%d.txt" % i, "w")
method_name = "exercise%d" % i #because writing four method names is hard
outtext = eval(method_name) #eval() evaluates a string as python code
'''
eval() is kind of dangerous and has the potential to make it much easier
to excecute malicious, obfuscated code, but it works in this case.
I suppose this makes this bad code. The other methods I tried to solve this
problem did not work nearly as well.
'''
print "Writing exercise %d to file exercise%d.txt" % (i,i)
output.write(str(outtext('pg43791.txt')))
output.close()
As I note in the comments, eval() is a particularly dangerous method and according to the kind folks on StackExchange a mark of poor programming. It does pose serious security concerns as it does allow potentially malicious code to be generated in an obfuscated way.