Becca's Final Project

by Becca Greenstein

28 Apr 2016

Trinket:

Reflection:

When I heard that we had a choice for the final project, my brain immediately jumped to improving my Harry Potter trivia game, which I was excited about. I was significantly more comfortable with the Turtle part of the class than with the data analysis part, so this seemed like a more manageable undertaking. In class, Elliott jokingly mentioned a genome analysis tool and I started nerding out a little, since my background is in Biology and that sounded awesome. This project did seem harder than the Turtle game, and I definitely stepped out of my comfort zone by choosing it. At the same time, however, I would be dealing with strings composed of four possible characters instead of lists of lists composed of many more possible characters, so it was more manageable than some projects. I didn’t include stretch goals in my milestones because I was constantly second-guessing my ability to carry out this project (coding a successful data analysis project was a stretch goal within itself), but it turned out okay, I think.

Two things that were very helpful for this project were having new code and milestones due for each class, and working with the same group every class. I’ve never been a procrastinator since procrastinating makes me nervous, so the few times this semester when I’ve been forced to submit a pull request at 10:00 PM the night before something is due have been stressful for me. Even if we hadn’t had intermediate deliverables for this project, I probably would have worked on it incrementally, just because of its sheer size, but it was nice that everyone was doing it at the same time. I also enjoyed the opportunity to work with the same group of people throughout the project update phase. They all helped with the progression of my project: Jasmine showed me how she moved functions among modules (more on that later), Colin pressed an incorrect key in a menu on Tuesday and my program didn’t let him start over, which reminded me to add user checking, and Yiyang helped me understand GitHub better after I made a mistake with my dates in my recent pull request. I know we talked about how GitHub worked a few months ago, which made sense at the time, but I hadn’t messed up my pull request file names before this incident, so I haven’t had to deal with this before and was fuzzy on the details.

I knew at the beginning of this project that I wanted to start with writing the code for all the capabilities (translate, reverse complement, count codons, etc.) before putting everything in functions, because the code would be easier to debug if I didn’t have to navigate through menus. Most of the code was relatively doable by myself, but I did utilize Stack Overflow for things I was confused about (making sure I understood why each tidbit worked, of course). Throughout, I tried to use variable names that both made sense (i.e. the list for the transcription function is called mRNA and the string that gets printed is called transcription) and were consistent (i.e. the iterand name is often called nucleotide or codon, since those are the units one is counting by). I also tried to clearly differentiate between cDNA functions and gDNA functions, and to keep my functions in a generally consistent format. Over the weekend, I had trouble moving functions between different modules (see last reflection). Jasmine helped me troubleshoot moving them around during class on Tuesday, and I got it to work eventually. I’m not exactly sure what I was doing differently, since I didn’t save the wrong versions, but it’s definitely cleaner now than it was before.

Since Tuesday, I have focused on the user by assuming that they don’t know what they are doing, and thus that they will screw up. To this end, I tried to guide the user as much as possible to do the correct thing by writing new help functions and clearer instructions at the beginning. I also broke my program in a variety of ways to ensure that the user is guided if they mess up in different ways. Originally, I had one massive if statement in each cDNA capability function that ensured the gene started with a start codon, ended with a stop codon, and had a length that was a multiple of three before proceeding. When I changed the start codon in the test sequence, it still worked, even though it should have printed “Error”. It took me a long time, and a lot of test print statements for each of the three conditions, to realize that I needed three separate if statements, nested inside one another. I also added an else statement for each one that printed the specific reason why the program failed. I first had these in each capability before realizing I could shorten my code and put the three conditionals in the cDNA options menu, such that it checks your sequence before it lets you analyze anything, and tells you what’s wrong if something is wrong. The if statements are also now if-the-wrong-thing, rather than if-the-right-thing, so I don’t need else statements. I then re-realized that the tool should work only if the characters in the sequences are As, Ts, Cs, and Gs, so I added an if statement for that to both the cDNA and gDNA options menus. I also added a surprise DNA picture when the user exits because style points.

There are a few things that I wish my program could do but I’m not sure how to implement without making it look less pretty or learning a lot more code. I am aware that the user can’t access the main menu from either the gDNA menu or the cDNA menu. This is because the Main Menu function is defined in “main.py” and one can’t import functions bidirectionally across modules. I want to keep everything that is currently in “main.py” there, since it all is related to the Main Menu, so I don’t think I can do this with the current layout. I assume that the user would need to write things down or take a break if they wanted to analyze gDNA and cDNA, so this hopefully isn’t a problem. The other thing I wish my program could do is find optimal primers (small pieces of DNA useful for sequencing or amplifying a specific sequence) when given a sequence. I know from the (free!) DNA editing tool that I used when I worked as a lab technician ApE, or A Plasmid Editor that primers can be specified by length, Tm (related to melting temperature), GC content, GC clamp (number of guanines or cytosines at the end of a sequence), consecutive bases, and orientation (5’ to 3’ or vice versa). This would be quite a lot of code, and I’m not even sure how to calculate Tm - the formula can be found here, but I can’t understand it, let alone combine all of these functionalities into one coherent capability. I realize that I could have been more creative in my choice of visual element (using matplotlib or a horizontal histogram, for example), but I think it is a clean display. Plus it’s a lot better than the long sentence I had before.

In my mid-semester reflection, I talked about being nervous to experiment because I was worried I’d screw up and hoping I could get better at this as time goes on. I do think I’ve gotten better at this since writing that (yay!). I’m more comfortable with having a very rough program, with a lot of test print statements and commented-out incorrect code, since getting things wrong is crucial to figuring out how to get them right. I also think that ways to troubleshoot, potential solutions, or alternative ways to code things occur to me faster than they used to.

One thing that made this project hard was the lack of color variety in modules that aren’t “main.py.” To troubleshoot the user checking above, I moved the cDNA translate function from “functions.py” into “main.py” so the colors would show up and I could catch my mistakes faster. Have Python 3 Trinkets always been like this? Another was the difference in error encoding in Python 2 vs Python 3 Trinkets – Python 2 highlights your error on the appropriate line and is a lot clearer. Python 3 traces the problem back through each function before arriving at the error at the end of the statement, which is less intelligible to me (but it might just be me).

I realize that most librarians don’t have science backgrounds, so I’ve had to explain some basic Biology to every SILS student to whom I’ve shown this program. They all think it’s cool though (not without one peer calling me a nerd). I told a college friend and fellow Biology major about it a few days ago and she got very excited and thought it would be practical, so I do feel like I’ve created something useful. The need to explain my job or assignment for school in layman’s terms is something I’ve experienced while working in labs (“What did you do at work today?” can be answered in many different ways, and “I did minipreps of the TALENs and Cas9/gRNAs for Ant1 in the expression vectors, digested them, and then sent the correct preps out for sequencing and transformed them into GV3101” is significantly less intelligible than “I checked the samples I made yesterday and did the next step for the ones that were correct”), and I think it will be the case as I pursue my hoped-for career as a science librarian.

Relatedly, going forward, I do hope to continue to use these skills so I don’t forget how to code in Python. The job description for my summer internship said that familiarity with a programming language, specifically Python, is preferred. When I wrote the application over winter break, I mentioned that I was taking a Python class in the coming semester, and my supervisor recently suggested brushing up on Python before I start the internship in June. I will definitely need to review dictionaries and regex, since I barely utilized dictionaries in this project and I didn’t even touch regex (at one point, I imported re for the sequence-not-containing-ATCG-user checking, but I ended up doing it a different way). I do feel like I have the resources and strategies to gain a deeper understanding of the language, and other programming languages if I so desire. That’s a good feeling.

Becca is a second-semester MSLS student in SILS. She likes science, words, the outdoors, and helping people. Find Becca Greenstein on Twitter, Github, and on the web.