Some Python notes

Yes, I love python. However, I am no expert. Most of the stuff I write in python is dirty and ugly code. Sometimes I figure out new things (things that programmers already know) and sometimes I forget these things. So, here are a few tips and tricks that I use from time to time. Really, I am writing this for future Rhett when he does a google search for "how do you save data to a file in python".

Saving data to a file

Suppose I am modeling a basketball falling through the air. I want to plot this data, so I save position and time data in lists. For example: Here is the info from scipy.org that covers this same stuff.

x = [0.1, 0.2, 0.3, 0.41, 0.52]

t=[1, 2, 3, 4, 5]

In this case, there really isn't a problem. However, what if I have 1000 data points in each list and I want to use it again? If the program only takes a second to run, I will just re-create that data. Sometimes this is not feasible. What if I want to model throwing a basketball 1000 times and each time it takes about 1 second to calculate. 1000 seconds is not too long to wait, not too long unless you are impatient like me. Ok, how do you save this sucker as a file?

Here is a super-simple sample:

i-3556d4ee7a86fc9cfec0f0e5c256c5ff-2010-09-02_python_sample.jpg

Of course, the key here is the "savetxt" function. If you don't import the pylab module, it might not work. I think the savetxt function is part of numpy (which is in pylab) - just load pylab to be safe. Also, the file is saved to the same folder as your python script is in. The default settings for the savetxt function makes a space delimited file - you can change that if you want a comma delimiter, just add this:

i-76c92bbe9881b71cc9bb8688c19e3fa4-2010-09-02_delimiter.jpg

One more note: Python is ruthlessly obedient. If you just spent 20 hours running a program and you save the file as temp_data.txt. Then later you run something and save it as temp_data.txt - done. It just overwrote your previous data. You can't blame python, it is just doing what you told it to do.

What about getting the data back? Here is another sample that takes the data from the file created by the previous program and plots it.

i-3a78e9482df9ac79b9da56f0a625aa2c-2010-09-02_get_data.jpg

The genfromtxt function basically is the inverse function of savetxt function. Notice that I had a list of lists for my data - I separated this into two separate lists for plotting (gotta keep em separated).

Functions

Everyone uses functions. But what if you want to make one yourself? It really isn't too difficult. Let me start with a sample program. Suppose I want to use the same program as above, but I want to see what happens as I change the acceleration. One simple (and ugly) way would be to just keep copying the code from above how ever many times you want to change the acceleration. Another way would be to make a function that takes an acceleration and creates the data. Then you could just call this function as many times as you like.

Just to make this simple, suppose I want to look at the distance the object travels after 3 seconds as a function of acceleration.

i-2365b04562b642770a0a1555cdb3585d-2010-09-02_sample_function.jpg

Here, I created a function calc_dist(). It takes one parameter - the acceleration (which I labeled as ac so it wouldn't get messed up with the variable a). The function is essentially the same as the previous program - but the key is the return. At the end of the function, what do you want it to give back? For this case, I have it just returning the final distance. It is possible to return any number of things. Suppose I wanted to also return the time (even though I already know that). At the end of the function, I would just put:

i-c36dd1572507dd5466aab337bbbcf000-2010-09-02_return.jpg

When I call this function, I could say result=calc(a), but result would be a list such that result[0] would be the position and result[1] would be the time.

The key thing to remember is to not make variables that appear both in the function and outside the function. This is bad. This is like crossing the streams from the Ghostbusters' proton packs. Why are they called proton packs anyway?

Going through a list

I actually already used this idea above. What if I have a list of positions, and I want to print them:

i-87198883a6317c4c73cbe5a1d2a4e8b0-2010-09-02_golist.jpg

The old Rhett would have made a variable n = 0 and then said something like "while n < 10:" and then in the loop have n = n +1. This python way is much easier. The 'len' is the length of the list and range says how high n should go. You don't have to increment n yourself, python does it for you.

Oh, notice that I reference data[n][0]? Since data is a list of list - actually it looks like this:

i-ef5c4e26d725219eb3b29a909a84a141-2010-09-02_lists.jpg

The element data[3][0] would be the 4th in the list with the first element in that 4th thing - remember python starts with 0 being the first element.

Histograms and Random Numbers

Histograms are useful - and super easy to make with MatPlotLib (which is in the pylab module). Here is a sample program:

i-96f4a91f70f247127803cdfd9f5df113-2010-09-02_normalhisto.jpg

Here is the output from that program:

i-09508e7e374b097f5b0fd45dfb55b25c-2010-09-02_figure_1.jpg

This really shows two useful functions:

  • hist(): this is just like the plot function (and it is in pylab). Just pass the list of data to the function and it will make a histogram. Boom. That simple. If you like, you can tell it how many bins to make (in this case I said 10 - which happens to be the default). I also made the bars green (I don't know why, but blue always shows up funny on the webpage). Also, I turned on the grid. This is the matplotlib documentation on histograms.
  • nomralvariate(): This function produces a random number from the normal distribution. If I just used the "random()" function, the histogram would be boring and flat. The normalvariate function is not in pylab, so you have to load the random module. The main arguments it takes are the average and the standard deviation of the distribution you want. Here is a description of other functions in the random module.

Oh, and you are welcome future Rhett. I hope you still have at least a little bit of hair.

More like this

Going through a list
--------------------

This is even simpler:

for d in data:
print d[0]

Oops, that didn't preserve indentation. Note that the "print d[0]" above should be indented.

Oh those happy days of fixing code written by engineers and scientists (gritting teeth here)! Oh yes, I could tell you some war stories.

Please guys, when your program starts to get beyond a few pages, drag in a programmer. I know we don't really understand your subject-matter area the way you do, but we kind of don't need to. Being able to attend to code structure without needing to know to much about what it does or means is part of what we do.

If you don't, the risk is that you'll wind up depending on a monster that you utterly need to do your work but which cannot be fixed or modified. If you depended on weird glassware, after a point you'd call in a glassblower, right? Much the same thing. Before you start spending more time and head-space programming than doing actual science, get one of us in to refactor your code. It could save you weeks of effort.

Paul, I'm sure you are right, but ... as someone who drives programmers crazy, for the reasons you suggest, here a few considerations:

1) Python is a scripting, interactive language. Therefore, you can think of it like a macro language. Macros are for anyone and everyone.

2) Sometimes the structure of the program IS the point, in scientific research. I've done projects where the final outcome is a set of statistics and a graph that could have been derived from a formula. But the formula is hard to understand for the relevant audience. As simulation is more understandable, so I've written simulations. Somewhere between a long and drawn out simulation, where no programming efficiencies are considered (well, almost none) and no short cuts are taken, and a formula, are a lot of other versions of the process that could have been done and that, likely, a real programmer would have come up with, and that would be understandable to a programmer.

But non-programmers as well as those who could not really get it with the formula either can look at the program and with a little guidance totally get it.

On the other hand, I remember once writing a procedure in MS Access to do a major operation on a database .... when I ran it it took five or six hours to run. I had a question about a bit of the code, so I posted the question on a BBS. A programmer posted a SQL one-liner that replaced my procedure and ran in a few seconds and produce the same exact result.

At no point did I understand the SQL statement, but it worked. I love the fact that it worked (and I used it) but he fact that I didn't understand how it worked meant that if it was a computational process (instead of mere database cleanup) I could not have used it in research.

Though not all programmers are good programmers and everyone can benefit from learning for themselves how to program well anyway. It's enlightening, empowering and surprisingly easy too, and I was absolutely furious when I later discovered that the dreary and pointless âhow to use a spreadsheet and word processorâ kind of rubbish I was forced to endure on my degree course could've been something like this:

http://groups.csail.mit.edu/mac/classes/6.001/abelson-sussman-lectures/

instead.

Often you have more than one list of the same size and one index is enough. Then I like the following trick:

for i, item in enumerate(list):
print item, listTwo[i], listThree[i]

Really, I am writing this for future Rhett when he does a google search for "how do you save data to a file in python".

This is the real reason for putting comments in your code: so that when you return to it a few months or a few years later you have some clue what the #&^% you were thinking at the time. It applies to any programming language which supports comments, and future-me has benefitted multiple times from reading past-me's comments.

By Eric Lund (not verified) on 03 Sep 2010 #permalink

There's also the numpy.loadtxt function for loading data from a file. I haven't used the genfromtxt function, so maybe it's similar, but my favorite thing about loadtxt is the skiprows=n option, where n=integer (i.e. I usually set n=1 to skip the column titles).

Besides the nomralvariate() function in random, there are a couple options in scipy, namely the convenience function scipy.randn(d0,d1,...) which returns an array of shape d0,d1,... of normally distributed random numbers. (This is complement to scipy.rand(d0,d1...), which returns uniformly variate values)

Additionally, there is a whole lot of other random possibilities in scipy.random, which has the benefit over the builtin random of returning arrays rather than lists.

Bob Day already mentioned the for loop simplification. (For loops in python can loop over any list/sequence/iteratable, not just the result of range())

I'll also note that you don't need pylab/numpy to save python objects - stock python comes with the "pickle" module which can do it for you.

--

import pickle

output = open("filename","wb") #"b" means binary - there's an option to specify text-mode pickling if you're concerned about this.
pickle.dump(object, output)
output.close()

input = open("filename","rb")
object = pickle.load(input)
input.close()

--

You can only save a single object this way, but you can get around that by wrapping a bunch of separate objects in a tuple or dictionary.

(Dollars to doughnuts, pylab's savetxt is actually using the pickle module under the hood.)

This is a slightly more advanced topic, but list comprehensions can make some common operations more, well, comprehensible. Instead of (_ used instead of spaces at the beginning of lines):

data = []
for n in range(1000):
__data = data + [normalvariabe(mu, sigma)]

You can do:

data = [normalvariate(mu, sigma) for _ in range(1000)]

That might be a bit confusing, since we're not actually using the iteration variable (which is why I called it "_"). Here's another example:

xplot = []
tplot = []
for n in range(len(data)):
__xplot = xplot + [data[n][0]]
__tplot = tplot + [data[n][1]]

Becomes:

xplot = [row[0] for row in data]
yplot = [row[1] for row in data]

Actually, you could also do this:

xplot, yplot = zip(*data)

But this is a bit more intricate. This is mixing together zip with the * operator to expand a collection into distinct function arguments and a tuple unpack. You can think of the idiom zip(*x) as "transpose x", and "a, b = y" as giving the two rows of y the names "a" and "b".

range says how high n should go

To be nitty, this isn't what range does. It's not incrementing a single variable each time through the loop. It creates a list of all the variables initially and increments the position in the list each time through the loop. This is mostly okay, but if N is very large it can use a small but non-negligible amount of memory.