Aaron Plocharczyk's project idea and work plan post

by Aaron Plocharczyk

13 Jun 2017

Project Idea:
I had a lot of trouble coming up with something to do for my final project. I actually started on an idea that I don’t find very interesting. I was parsing the exceptions that php throws in errorlog files and was going to do analysis on that. I could not come up with anything better to do and had to get a head start to make up for the heavy workload of my job this week. I actually got relatively far but I’m not interested in the project, so I’d rather scrap it and put effort into something I care about and will work hard on than make something just to satisfy the requirements. Even now that I don’t have the weekend to work and will have less time to work on it this coming week, I think it’s the right move. Here’s what I have made and am scrapping:

I ultimately ended up on doing sentement analysis for reviews found online. Sentiment analysis is very useful in the real world because it can be used for machine learning. You can use what you learn from sentiment analysis programs like the one I plan to build to make a “model” that a machine can use to learn to make predictions on reviews in the future. This awesome website gives me a bunch of reviews that have been prelabeled as either having a positive or negative sentiment. It gives me three files: amazon, yelp, and imdb. Sentiment analysis is much more interesting to me and there are a lot of different things I can do with it.
It will certainly be useful to have the program be able to analyze each file independently in terms of review length, word count, common words, etc. But it might also be interesting to see if I can make the files interact at all if I have time to do so.
Work Plan:

  • Upload data files
  • Create main.py
  • Create analysis_tools.py
  • Create an iterative loop for the program to run in
  • Allow user to type “Help” at any time to gain insight on what to do next
  • Allow user to specify file name
  • Allow user to access max/min/average review length for all types of sentiment reviews (or for a specific sentiment type) in the file
  • Allow user to access max/min/average word count for all types of sentiment reviews (or for a specific sentiment type) in the file
  • Allow user to access most common words in all types of sentiment reviews (or for a specific sentiment type) in the file
  • Allow user to access least common words in all types of sentiment reviews (or for a specific sentiment type) in the file
  • Allow user to access most correlated words to a specific sentiment type in the file
  • Allow user to set a “threshold” for correlation so that a term must occur at least n times before it is considered
  • Allow user to see correlation of word count/review length to a specific sentiment type in the file
  • Allow user to view terms that are unique to a certain file
  • Allow user to visualize results of these queries

    I’ve decided to shift my project from sentiment analysis to a program that actually trains a machine learning model and tests it. Here is my new work plan:
    By 6/14:
  • program asks user for a training file and creates a “trainingFile” object from it
    By 6/15:
  • program counts total number of occurrences for each term
  • program calculates correlation of terms with a sentiment
    By 6/19:
  • program prints and visualizes correlation and total number of occurrences for each term
  • program allows user to set a minimum threshold for total number of occurrences for each term
  • program allows user to delete particular terms
    By 6/22:
  • program asks user for a testing file and creates a “testingFile” object from it
  • program tests current model on the test file and saves the results locally to refer to in the next iteration of the program
  • program prints a visual representation of the test results as compared to previous tests
  • program allows user to type “help” at any time to learn about what they should do
  • program repeats
    Stretch Goals:
  • program allows user to save model in a comma separated format
  • program allows user to open saved models
  • program allows user to alter saved models
  • program allows user to test with saved models
I am a senior at UNC Chapel Hill taking my last couple courses this summer. Find Aaron Plocharczyk on Twitter, Github, and on the web.