Thursday, December 23, 2010

Efficiently Reading in and Iterating Through Large Files with Python

One of the challenges of being a bioinformatician is efficiently dealing with large files. Imagine parsing through a text file that is a couple of terabytes--not a feat as simple as one would think. Usually, even though we get mass amounts of data in a text file, we can avoid worrying about the size of the file since we have powerful computing clusters. I know, we are all so spoiled. However, every now and then we have to bite the bullet and not rely on ample resources to make up for our blissful ignorance.

I wrote a python script that does some plotting for me. However, it takes about a week to run since it has to parse through about 500 gigabytes of data and process them.

There are two programming amendments that are easy to implement and very helpful in keeping a low memory footprint (and also speeding up the program):

(1) range
In python, using the range function actually creates a list which it stores in memory. However, using xrange just iterates through the numbers.

range(0,5000) takes up more memory than xrange(0,5000)

Here is a quick example of what I mean:

range.py:
import sys

for i in range(0,int(sys.argv[1])):
        pass

print " "


xrange.py:
import sys

for i in xrange(0,int(sys.argv[1])):
        pass

print " "



[$] time python range.py 10000000

real    0m1.139s
user    0m0.911s
sys     0m0.228s

[$] time python xrange.py 10000000

real    0m0.506s
user    0m0.502s
sys     0m0.004s


(2) readlines
when reading in a file, I usually write something along the lines of:

file = open('somefile.txt', 'r')
for i in file.readlines():
    # operation
file.close()


This is great. It is clear, concise, and elegant. However, the file being read in is stored into memory. If the text file is gigantic, this may not be a very good idea. Luckily, there are several ways around this:

Method 1:
with open('somefile.txt', 'r') as FILE:
    for i in FILE:
        # operation


Please note that the "with" construct is only available from Python 2.6 onwards.

Method 2:
import fileinput
for i in fileinput.input('somefile.txt'):
    # operation


The fileinput module is great for folks who like to use as many pre-built modules as possible rather than re-inventing the wheel.

Method 3:
BUFFER = int(10E6) #10 megabyte buffer
file = open('somefile.txt', 'r')
text = file.readlines(BUFFER)
while text != []:
    for t in text:
        # operation
    text = file.readlines(BUFFER)


Although this method is the messiest of the three, it also provides the most control over how much memory the program can suck up.

Even though it might be more memory efficient to not load the entire text file into memory, the program may end up being slower if one is not careful. It is a balancing act and there is no single correct way to go about something like this. One just needs to weigh out the options and do what seems best. From my experience, the three methods have a minimal variance is execution time (no matter how large the file), so it really comes down to a matter of personal preference.
I've used all three of the suggested methods at one time or another. My personal preferences are either methods one or three.
5 Optinalysis: December 2010 One of the challenges of being a bioinformatician is efficiently dealing with large files. Imagine parsing through a text file that is a cou...

Tuesday, December 7, 2010

Being Productive and Remaining Stress Free

Over the last week, I had to investigate, prepare, and present an important scientific study, as well as write a final paper for a class I was taking. Needless to say, it was not easy to remain stress free. I may have appeared calm and collected on the outside, but on the inside I was a total train wreck. It is a well documented fact that stress can be a killer—so cutting as much out as possible is definitely something worth striving for.

Productivity and stress don't really have much to do with each other. They may be correlated in some ways, but for the most part they are independent.

It is possible to be:
  • productive and not stressed
  • stressed and not productive
  • unproductive and unstressed
  • productive and stressed.

I've found that there are 3 factors that contribute to making a task stressful:
  • poor planning
  • unreasonable expectations
  • poor time management

When there is no plan, there really isn't any course of action to follow. Setting up a plan for action is essential. This will ensure a smooth flow of events when completing task milestones. There shouldn't be any moment when a milestone is complete and one is left to wonder about what to do next.

Being ambitious is one thing, setting high standards is another, but setting unreachable goals is just a waste of time and energy. Not only is this demotivating, but it is something that will never get finished. Nobody likes to fail, so why setup for failure? Ultimately, it does more harm than good.

Taking long breaks is completely understandable. In fact, I encourage it! A fresh mind is a more productive mind. However, there is a point where it is necessary to buckle down and churn out some machine-like work. Human beings are very capable of working on a single task for a few hours straight. Besides taking it easy on the breaks, it is crucial to spend time on the right parts of the task! Remember the 80/20 rule. 20% of the work takes 80% of the time, 80% of the work takes 20% of the time. Get that 80% done. Finish the easy tasks first and then go for the gold.

In retrospect, I'm sure that if I had followed my own advice I wouldn't have been so stressed out. Hopefully, anybody else who reads this takes away something useful.
5 Optinalysis: December 2010 Over the last week, I had to investigate, prepare, and present an important scientific study, as well as write a final paper for a class I w...
<