Jasmine Plott's Final Project

by Jasmine Plott

29 Apr 2016

Here is the link to my code:

Looking back at where I first started in this class in comparison to what I was able to create with my final project, I find it hard to believe that I actually have the final product that I do. I even created my milestones with a mindset that I would not be able to complete all of them, but I did. I even ended up adding more milestones as I progressed as “stretch” goals for myself, although I did not label them explicitly as such. Here are the milestones that I completed throughout the project-some are what I started with originally while others got added with time:

  • Have the program ask for user input (so they select from either National or State data)
  • Have the user input use regular expressions
  • Have users type “Help” to be directed to a page that gives them more detail about the program
  • Have users type “Finished” (or other appropriate command) to allow them to exit from the program
  • Figure out how to have the user upload their data that they might want analyzed by my program
  • Produce a graph that visualizes the input the user entered.
  • Use matplotlib to create a bar graph
  • Get the graph to print
  • Spruce up the graph with spacing, labels, and a title
  • After the initial data is printed for the user, allow them to have the opportunity to enter more input (i.e. create a while loop)
  • Produce some statistics from the data that the user has asked for. Baby milestones are below to grow to this:
  • Print the mean. To do this, I’ll need to get the sum and total count of values to figure this out
  • Print the median. To do this, I’ll need to use the total count of values and put it in the median formula
  • Print the maximum. To do this, I could use the maximum function
  • Print the minimum. To do this, I could use the minimum function
  • Account for multiple instances of minimums and maximums in your code
  • Create functions for these statistics
  • Guard for user input (i.e. make it accept filenames that are entered as “California.csv” or “california.csv”)
  • Enter a try and except to help ensure random files are not being entered

My data analysis program ended up as one that goes through the most popular baby names for either a specific state or the entire nation for the United States within a certain year. There are a variety of files that users can select from, or if they would like to bring their own data from a certain year, they can enter this information into the program to get it to analyze in that way as well. The user input is built around regular expressions, so users can enter a partial or full name in order to see the frequency for names that have similar beginnings or names that become increasingly similar as more specific input is entered by the user. It’s more interesting not to enter a whole name, since there would only be one frequency for that name throughout the year. After a user enters what they would like to see in a name, the program analyzes these results to show the median number of times names with that input were seen, the average frequency of names associated with that user input, and the names that occur the most and the least that match the user input. There is a visual image of a baby that pops out and says “Here are your results” and below is a graph of the frequencies for each possible match associated with the user input. The program continues to loop through to ask the user to select a file until they type “Exit.” There is also a “Help” screen that provides detailed instructions about how to use this data analysis program.

When I first started working on this final project, I knew that I needed to keep two things in mind: break the milestones into simpler steps to accomplish them and consider the scope of the project. These were two major lessons that I had to learn from the game app project we created for class. I had not adequately thought about these two factors before going into my project, and I stressed myself out a lot more than I had needed to. For this last major project in our class, I was determined to remain calm, collected, and use these strategies to my advantage; I’m happy to report that I was successful in this endeavor.

The first goal that I set out to accomplish for this project was getting my data to print appropriately and making sure that I was able to pull out the actual pieces of code that I wanted to. The way that I envisioned the project at the beginning had to be adapted as I actually started handling the data and working with the code. Originally, I had wanted to include data for State and National frequencies of baby names in the United States from as far back as 1880. However, when I started testing the files that I had, I quickly realized that the scope of what I had set out to do was much larger than I had initially intended it to be. The data files that I had entered into Trinket were so large that they actually never loaded and crashed the program-I believe that for the National files themselves, they millions of lines of code.

Instead of panicking, as I might have before, I decided that I needed to break my data down into chunks. It would instead be better to focus on the names and frequencies from one year rather than having my program run slowly to have more data; I discovered that it not necessarily less meaningful to have less data if it would be beneficial for a better final product. I decided that I would use data files from 2014 for the state of California and the National data to build my program. Making this decision about my program meant that I would not be able to examine the frequency of names across the years, so I decided to integrate a new component to make my program more interesting and unique-I was going to add regular expressions rather than just accepting user input. That way users could enter a series of letters that made up a name such as “J,” “Jo, “Joe,” etc. to get an idea of variations and trends in certain names types throughout the year.

I had not originally intended to incorporate regular expressions into my final project, since I was a bit intimidated by them. However, I thought that this would be a good unofficial “stretch” goal for me to try to accomplish. The way that I accomplished this portion of the program required breaking it down into a series of steps and regularly debugging. I knew that the first step I needed to take was getting the portions of the file that I wanted to extract from my practice file to print. I started this by using what I knew about dictionaries by creating a list of the lines, and then from these lines, pulling out the columns that were where the baby names and frequencies were contained and placing these in a dictionary of names with key value pairings. Once I got this to work, I tested my file using general user input. As soon as I confirmed that this portion of the code worked, I moved on to testing how my program would take regular expression user input. I learned that I needed to use “re.match” in order to ensure that the name in the dictionary was extracted correctly.

This was the first major chunk of my project, since I combined and completed several of the milestones (and some extra ones) that I intended to accomplish from the start. Little did I know how much these steps that I took throughout this first portion of my project would follow me throughout the rest of my progress throughout the project. The main takeaways that I gained from getting my code to print and making the user input sophisticated enough to accept regular expressions helped me build the step-by-step process that I carried with me throughout the rest of my code. I learned that I needed to develop my code bit by bit, and I got in the general habit of printing a variable each time I created it to confirm that it matched with what I was thinking of it as in my mind. After a whole semester of coding with Python, it was finally starting to sink in that printing was an excellent way of testing and confirming which variables were what. I think it was taking the time to take coding as a process to build and build again better that really had long lasting benefits for me in the long run.

The next portion of the project where I had to keep my problem solving mentality in mind was when I began to create the visuals and functions associated with my data. This was the part of the project where I think that I truly blossomed as I was no longer thinking about “just completing the milestones for my final project,” but I actually started thinking about the ways that I could develop my program to be more fun and exciting for my users. This was when I added the ASCII image of a baby to print out along with the results that were popping out alongside the user data. I had originally intended this to be my visualization of data for the project, but it was pointed out to me (rightfully so) that this image of the baby was not really doing anything to reflect the data that I had extracted from my files. I could have reverted back to creating a graph that printed out astericks to reflect the number associated with the name in the data that I had extracted, but the program matplotlib was introduced to us during class as a good, challenging way to represent one’s data. I decided to take on another challenge and use this.

Before actually incorporating matplotlib anywhere into my code, I had to do my research in order to fully understand how to create a graph using outcomes from the user input. Here, I again implemented my step by step problem solving strategy by breaking down these huge milestones into smaller ones to build up to the big accomplishment. I realized that the easiest way to go about doing this would be to again create a dictionary of the results selection based on what is pulled out of the dictionary that contained all the names and user frequencies. Once I had developed and made sure my dictionary of results was working correctly, I gradually started to piece together my graph from matplotlib. I used a lot of explanations and recommendations from outside sources to help me understand exactly what I was doing as I wrote the code; I did not want to just have code for code’s sake I actually wanted to understand why I was taking the time to write certain things to get my graph to print.

I continued this gradual approach by first getting my graph to print, then adding labels, and doing a better job of spacing and layout for the x-axis. I did the best with what I had spacewise for the x-axis, and for most results, this does look like a nice, clear graph. However, there really is no good way to display a graph with thousands of x-axis points in a program like this. I integrated such spacing into the code as much as I could to make the graph as beautiful as possible, and ultimately, I realized that there is no “good” solution to spacing thousands of labels out at once. My experience with using matplotlib as a way to print my graph was a good one in the long run, and perhaps I will come back to it to do more with it in the future. This step by step process and challenging myself to do something that I had thought initially I could not do before was an empowering feeling for me.

There are two other major pieces of my program that I strived to develop and make nicer. The first involves incorporating the statistical information (i.e. mean, median, maximum, and minimum) into what I was working with. I discovered early on that the maximum and minimum were not as simple as I had made them out to be. For some result sets, there were names that had more than one minimum (this was not typically true for the maximum), and I wanted to be able to account for all of these instances of the minimum in order to provide the best data to my user as possible. I ultimately ended up creating a function that extracted the minimum from the set, and then looped through the results to find the keys that were associated with these minimums. This was followed up by an if statement set to analyzing if the length of the minimum_keys list was equal to one to print a certain thing, and if it was larger, to print that instead. Here is what the function ultimately looked like:

# This portion of the code is for the minimum
def smallest(dict):
#This pulls out the minimum value in the dictionary
  min_value = min(dict.values())
#This goes through the dictionary and pulls out the names associated with the
#minimum values.  It then adds them to this list of the minimum keys
  min_keys = [name for name in dict if dict[name] == min_value]
#This cleans up the names so that they have no brackets or quotes when I decide
#to print them
  pretty_min_keys = (", ".join(min_keys))
#Since there is potential for multiple minimum values, this portion of the code
#looks at the length of the list, and depending on that length, prints the 
#appropriate option
  if len(min_keys) == 1:
    print("The name that occurs the least is " + str(pretty_min_keys) + ", and it occurs " + str(min_value) + " times." + "\n")
  if len(min_keys) > 1:
    print("The names that occur the least are " + str(pretty_min_keys) + ", and they occur " + str(min_value) + " times each." + "\n")

Being able to craft this function inspired me to do some quality assurance on my program so that users wouldn’t be able to break it. This is typically not something that I would do for a program and considered polishing, since in the past, it’s typically been all that I can do to create a workable and appealing final product, but I was feeling fierce and encouraged by what I had been able to accomplish so far. I adapted the program to accept both upper and lower case entries from the user as well as incorporate “try and except” to ensure that random file names were not just being entered. I also followed up with some other stylish creations in modifying the different levels of acceptance for filechoice and adapting my program to accept user entered data specifically.

If I could change anything about my program, I think that I would play around more with the loop so that if a user enters a name incorrectly, they do not have to go back and start at the beginning of entering a file name. I’m also curious to learn more about why my frequency graph prints out under the new section that asks for a new filename the user would like to work with. It does not seem to impede the functionality of the program and it could just be that the graph is so big that it just takes that long to load. I will keep these specific instances in mind as a way to grow for the future.

Looking back on my process, attitudes, and problem solving strategies throughout the course of finishing this project, I think that sometime throughout all of this, something about Python clicked in my brain. I’m not quite sure how to describe exactly what happened or whether I did anything specific to make it happen, but I found myself thinking of new ways to try out my code, playing around with new techniques, and creating fixes to little glitches that I encountered throughout my program. I’m not sure whether it was having a long term project like this that really allowed me to engage in the material or perhaps I just finally understand more about the way that Python works. Either way, I think that this project truly helped me grow and accumulate all the skills I had learned throughout this course into a fruitful final product.

Jasmine Plott is a first year masters student in the School of Information and Library Science at UNC Chapel Hill. She is a librarian in training and slowly developing her programming skills. Find Jasmine Plott on Twitter, Github, and on the web.