Matt's final project proposal and workplan

by Matt Zimo

13 Jun 2017

Nothing to see here… I want to do a data analysis tool that asks the user to input a URL and the program scrapes the text to determine the most common words used in the file (interesting words, not “the”, “a”, “an”, etc.) I would like to also see the average word length, sentence length, and paragraph length (in words). Ideally, I could compare it to some standardized formula somehwere that will help me determine the grade level of the writing. I am interested in studying hate speech online, and I wondered if I could find a way to help identify hate speech automatically. I could load up a dictionary or list of words from known works of hate literature, and then compare the user’s inputted URL to those works. This might be a way to flag heinous online content.

My initial milestones are:

  • Read and scrape webpage
  • Calculate average word, sentence, paragraph lengths
  • Find "grade level" writing statistics to compare webpage to
  • Calculate grade level of webpage

Advanced milestones:

  • Find common words/phrases from known examples of hate literature
  • Compare website content to hate speech
  • Generate "hate score" of website

I’m not sure if I bit off more than I can chew with this project. I think I’m going to just have it import .txt files from a webpage, unless I can find an easy tool for scraping the text of a website. It also kind of stinks that I am so busy during the work week, so I don’t have any time to really work on the project until the weekend. I’m honestly really discouraged. I think the fast pace of this course is too much for me.

My updated milestones are:

[ ] Read and scrape webpage [ ] Store the text of the webpage into a .txt file [ ] Calculate average word, sentence, paragraph lengths [ ] Find “grade level” writing statistics to compare webpage to [ ] Calculate grade level of webpage [ ] Do some sort of data visualization with the statistics

Stretch milestones:

[ ] Find common words/phrases from known examples of hate literature [ ] Compare website content to hate speech [ ] Generate “hate score” of website

Matt Zimo is an information science grad student at UNC Chapel Hill. Go Vols! Find Matt Zimo on Twitter, Github, and on the web.