Aaron Plocharczyk's final project post

by Aaron Plocharczyk

22 Jun 2017

Here’s the trinket:



Reflection:
I orginally chose to do a different project altogether. It was going to accept PHP error log files and analyze them for you. I get relatively far into that project before deciding to do something else because it wasn’t useful or interesting enough. What I actually decided to do was much more useful and interesting to me and can be a useful tool for others as well. My program offers a concise way to build and test machine learning models from scratch. All you have to upload is a labeled training file, and a labeled testing file if you want to test your model on it on some data.
My main attitude when building this program was simplicity and user friendliness. I wanted the program to be powerful but simple enough that someone who hasn’t really done much machine learning before could take this program out of the the box and figure it out on their own.
One of the most useful things for me to accomplish this goal was to have an intro that displays immediately when you run the program that explains what machine learning basically is and how you can use my program, in simple terms. The intro is solid, yet simple, so I tell the user to explore the help menu to learn more.
The second most useful thing that helped me towards my goal of user-friendliness was my help menu. The help menu is constantly available and displays all the commands that the user can enter. If the user wants to learn more about a command, all the user has to do is type “explore” followed by a space and the number of that command. This is, of course, all explained in the text directly above the help menu. Exploring the commands in the help menu provides thorough documentation of all of the program’s capibilitues and is enough to teach a beginner with this program to become an expert.
The final most useful thing that helped me towards my goal of user-friendliness was my willingness to revise, reorganize, and redo code that could be done better with regard to user-side simply. I used to make the user type completely different commands to view correlations of ALL terms with sentiments, number of occurences of ALL terms, correlations of REMOVED terms with sentiments, number of occurences of REMOVED terms, correlation of ONE SPECIFIC term with sentiments, and number of occurences of ONE SPECIFIC term. I didn’t even have polarity in the program at the time, but you can see that all of those commands were overwhelming to the user. On top of that, I used to make my user actually type commands to build their model from the information that they could have available to them. This was all way too complicated. So I took a step back and spent a long time thinking about what would be the best way to organize the concepts that machine learning deals with. I came up with the idea of a “selection” and “deselecion” table that each have all of the information about a term that you need. “Selected” terms will automatically be used to build a model behind the scenes and test on new data, and deselected terms can be selected again in no time. It was so much easier to deal with the data that needed to be dealt with once I went back through my code and updated the design of it. It took many revisions before I made my user interface as simple as it is now, and each revision seemed like a complete overhaul. All you really need to know now is that you can select terms, deselect terms, view your selection table, view your deselection table, and focus on a specific term to check out what state it’s in. Additionally, you have the powerful tool of setting upper and lower thresholds to deselect terms based on their number of occurrrences. That’s way easier to keep up with and can’t be simplified very much more that I can think of. When you want to test on a new file, you don’t even have to build a model. All the parsing, math, building, and testing is done behind the scenes and you just get a number representing prediction accuracy in return.
Another point of focus for me was cleanliness of code. I really sat down and thought through what classes I should have and how they should all work together. Originally, I just had TrainingFile and TestingFile. Then I realized that all my code would be much cleaner if I added a Review class. When I was ready to try to start testing my data, I also added a Model class for that. As I got further into this structure, it became clear that needed to give TrainingFile and TestingFile a common parent class (File) to make my code simpler. And it became stunningly obvious that way to many calculations were going on in my Model class, so I added the Selection class to be a mediator between the TrainingFile and the Model. This is the system that ended up working best for me.
As far as non-class modules go, I didn’t want to stuff everything into main.py, so I made intro.py, interpreter.py, and printer.py. Main handles the very big picture things, like taking in a training file and calling a few very high-level functions (which in turn end up calling a lot of other functions outside of the main.py). Intro.py gives an intro to the basics of the program; printer.py can print lines, tables, and graphs beautifully, and interpreter can convert user-entered command strings to actual function calls. I would say that interpreter.py is where most of the work actually comes together.
The third point I want to discuss is cramming as much functionality as possible into simple commands. I gave the command “view selection( ordered by [column header])” an optional portion where the user can specify that they want to order the table by a certain column. All the user has to do to specify that column is type the name of that column’s header (they don’t event have to get the capitalize the first letter), and if they don’t specify anything, it will default to “Polarity”. Ordering tables by a column is not necessary to make this project complete, but it’s just one of the many useful things I throw in to make the user experience as easy as possible. Organizing tables by count, for example, will help you see what number might be good to set a threshold at. You can do this for both selection and deselection tables. Additionally, when you are testing your selection table on a file, there is often some randomness in your predictions, so it can be usefull to test multiple times an find the average of all those tests before saying for sure how accurate your prediction algorithm is. That’s why there’s an optional portion on the “test on testing file name” command. You can specify how many times you’d like your algorithm to be tested, and you’ll get a more accurate result whe you enter a higher value for n. The value of n defaults to 1, and you get a notice if you try to exceed 100 (because that would take way too long to execute, and 100 times should be enough anyway). Further, by simply typing “revert to best selection for [testing file name]”, the program will restore your selection table to the exact state it was in when you got the best results you’ve ever gotten for the testing file you specified. This is a super useful tool that allows you to focus less on how to undo what you’ve done to get to a previous state and focus more on how to improve what you are currently trying to do.
My final point of focus with this project was robustness. I wanted the project to be able to handle absolutely anything you could throw at it and respond in a way that makes sense. Because of this, the program recognized sentiments as words or numbers, alerts you of improperly formatted training or testing files and lets you specify a different file, alerts you if the file you specified does not exist and lets you retry, tells you when your command was mistyped and lets you retry, has defaults for when you don’t specify certain values, won’t let you execute certain commands that would take too long, and most importantly, will recognize and process a wide array of training and testing files. I made the program to recognize anything without a space as a sentiment if it was the last word on the line in your training/testing file. But even cooler than that, you can have any number of sentiments that you want, as long as you have at least two. This means that you don’t just have to analyze positive and negative. You can analyze very positive, positive, somewhat positive, neutral, somewhat negative, negative, and very negative if you really wanted to. Or, if you had a different type of file where you were trying to predict what type of person wrote a sentence, your sentiments could be: scholar, adult, child, monkey. The program will just deal with it automatically! That, to me is one of the coolest parts of my program. It is completely wired to deal with the files you give it instead of just restrict the types of files you are allowed to give it. many sentiments word sentiments

I am a senior at UNC Chapel Hill taking my last couple courses this summer. Find Aaron Plocharczyk on Twitter, Github, and on the web.