Matt's Final Project Reflection

by Matt Zimo

21 Jun 2017

NOTE: This reflection was written before I fixed the downloadfile.py module. I fixed it when I was at work. This program did not really turn out how I imagined. My first ideas for this project centered on text mining for machine learning purposes. I wanted to make a program that would parse a text and identify if the text is hate literature or not. I imagine I would have a baseline metrics for words and phrases used in non-hate speech and sample hate speech. I would then have the program compare the lexicons of the two samples to find identifying key words or phrases used in hate speech (besides the obvious ones, hopefully). I thought maybe I would use the twitter API, since twitter is popular with hateful people. But that plan never got off the ground. One of the reasons for that was because I was not quite sure what I thought the program would look like. That was a problem I had for the majority of this project. I had a vague idea of features I wanted the program to have, but I never really got a complete vision of the final product. I knew I wanted to make a data analysis program because I think it would be more useful for my studies. In retrospect, I would have had more fun had I made a turtle game.

I next thought that maybe I could create a website that downloads and parses the text from a website and identifies the reading level (or grade level) of the site. I thought I could compare sites like the New York Times to USA Today or CNN.com. I tried downloading a website to see what the raw data would look like, so I selected the New York Times homepage. When I did that with the basic urllib module, I got thousands of lines of unintelligible code. I couldn’t find the headlines or body paragraphs from the articles, just lots of code for the advertisements and column widths and so forth. I got scared and rethought my plans.

My next idea for the project was to make the program analyze a Project Gutenberg .txt file and parse out the words, sentences, and paragraphs and calculate a reading level for the book. I wanted to be able to enter the URL for a specific Project Gutenberg book, and the program could download it and save it. This would solve the issue of finding the main text because each Project Gutenberg book is stored in a text file with standardized (or so I thought) metadata at the beginning and end of the book that I could easily get rid of for the analysis. Once more, I did not have a very good idea of what I mean by reading level, or where I would find information about it. I also didn’t know how I could visualize that data. I still feel like the data visualization aspect of my program is more of an afterthought than a critical feature. I also don’t know how different the reading levels would be for the books, since most books in Project Gutenberg are novels from over 100 years ago, I suspect that most of them would be about the same.

When I discussed my idea with my group, it was obvious from the start that Justyn, Natasha, and Aaron were way ahead of me. Natasha and Aaron even had some prepared code and datasets. I felt ashamed. I was far behind everybody else in my group for the entire experience. I think the daily updates were not very good for my self-esteem or progress, but it’s a summer class, so I had to press on. I did get a couple of good ideas by looking at some of my groupmates’ code. Aaron’s printer module had a neat function that printed a message like, “calculating” when the program made big calculations that took a long time. I used that in my code as well. Natasha had an ingenious method of ensuring her tables printed out columns evenly by subtracting the differences between the length of the string and the column width and adding that number of spaces to the end of the string. I used that idea in my table of words arranged by frequency of use.

Speaking of that table, I spent way too much time trying to get that to work. The table is supposed to show each word, the number of times it appears in the book, and the length of the word. I already have all that information stored in a dictionary called “words_used” in the Book class definition module. I spent half a day trying to find a way to sort the dictionary by the number of times the word is used. The reason why I couldn’t get “words_used” sorted was either Trinket does not have some module required like itertools, or “words_used” had too many values per key, or—the most likely scenario—I did not understand the solution provided in stackoverflow. In the end, I created the table by essentially duplicating the function to create the “words_used” dictionary. First, I created a method that generates a list of all the words used in descending order of use (words_used_order). Next, I created a method that loops through the list from words_used_order, using each word as the key to lookup the count value in the words_used dictionary.

One of the first issues I ran into when I began to parse out my sample Project Gutenberg books was that not every book has exactly the same formatting in the metadata at the top. For example, all books begin with a sentence in all caps and three asterisks on either side announcing the beginning of the book text. In Mencken.txt, you see it in line 25, “** START OF THE PROJECT GUTENBERG EBOOK A BOOK OF BURLESQUES **” I first wrote code that would find the start of that line, and use that position as a starting point for finding the second set of three asterisks (at the end of the line) to isolate the text after them. However, one of my books began with three asterisks followed by a space and “START OF THE PROJECT…” and the other began with three asterisks, no space and “START OF THIS PROJECT…” I didn’t notice the lack of a spaces between the asterisks and the sentence until later, which was another source of confusion. I spent a lot of time trying to figure out how to use regular expressions to overcome the issue of non-standardized data, but I was ineffective. When I showed Zach my problem at the TriPython meeting, he said I should just clean up the data so that both my sample book files would have the same formatting. So that is what I did, but I was worried that when I added the downloading functionality later on, my program would not read the files correctly.

When I finally got all my analyses to work, I began working on my stretch goal to download a book off of Project Gutenberg and write it as a file in the program. This turned out to be another source of frustration. I tried using the regular urllib.request method to get access to the Project Gutenberg file, but I kept getting Error 403 forbidden messages. I tried looking for other ways to get it, but they required special modules not available in Trinket. I finally asked for help from you, and you informed me that it might be a security feature on the server’s end that I can “outwit” by masking my request as from a web browser. Unfortunately, this still didn’t work. I finally googled how to retrieve Project Gutenberg books from python applications, and I read that Project Gutenberg has mirror servers across the globe that permit access. I looked on the list of mirrors for a server based in the United States and tried to access it in my program, and voilà, it worked! Unfortunately, the file gets saved as a single line of text, and I could not figure out how to save it with the same formatting as the original ebook (i.e., with separated lines and special characters like line breaks no printed). My analysis tools do not work on downloaded files yet. That is a major disappointment and unmet objective.

Adding the histogram with the matplotlib.pyplot was not nearly as hard as I imagined. The documentation sort of elided where you are supposed to put the data and how the data should be formatted, so I had to guess. I decided to store the data (word/sentence length) in a list and see if that worked. Well, imagine my surprise when it did. It was the first thing that went right on the first try in the entire process. I changed the color of the bars to blue to make it accessible to red/green colorblind people, but I guess it does not really matter if all of the bars are just one color.

Overall, this project did not come close to meeting my goals. I do not have a “grade calculation evaluator. The program does not properly import ebooks from the project Gutenberg mirror server. Despite my best efforts, the program still includes some punctuation marks at the beginning and ends of words, which makes the word length count inaccurate. My code also separates sentences at periods, so a sentence like, “He spoke with Mr. Smith about the thing.” Gets split into two sentences in the sentence count. Now that I think of it, I guess I could have added a few formatting conditionals to the code to fix this. (Something like sentences = sentences.replace(“Mr. “, “Mr”)) I think if I had more clarified goals for the project from the start, I would have done a much better job. Despite all of this, when everything started to come together at the end, I got pretty excited, and I think I’m not too far off from having a good program here.

Matt Zimo is an information science grad student at UNC Chapel Hill. Go Vols! Find Matt Zimo on Twitter, Github, and on the web.