Nothing to see here… I want to do a data analysis tool that asks the user to input a URL and the program scrapes the text to determine the most common words used in the file (interesting words, not “the”, “a”, “an”, etc.) I would like to also see the average word length, sentence length, and paragraph length (in words). Ideally, I could compare it to some standardized formula somehwere that will help me determine the grade level of the writing. I am interested in studying hate speech online, and I wondered if I could find a way to help identify hate speech automatically. I could load up a dictionary or list of words from known works of hate literature, and then compare the user’s inputted URL to those works. This might be a way to flag heinous online content.
My initial milestones are:
- Read and scrape webpage
- Calculate average word, sentence, paragraph lengths
- Find "grade level" writing statistics to compare webpage to
- Calculate grade level of webpage
Advanced milestones:
- Find common words/phrases from known examples of hate literature
- Compare website content to hate speech
- Generate "hate score" of website
I’m not sure if I bit off more than I can chew with this project. I think I’m going to just have it import .txt files from a webpage, unless I can find an easy tool for scraping the text of a website. It also kind of stinks that I am so busy during the work week, so I don’t have any time to really work on the project until the weekend. I’m honestly really discouraged. I think the fast pace of this course is too much for me.
My updated milestones are:
[ ] Read and scrape webpage [ ] Store the text of the webpage into a .txt file [ ] Calculate average word, sentence, paragraph lengths [ ] Find “grade level” writing statistics to compare webpage to [ ] Calculate grade level of webpage [ ] Do some sort of data visualization with the statistics
Stretch milestones:
[ ] Find common words/phrases from known examples of hate literature [ ] Compare website content to hate speech [ ] Generate “hate score” of website