The OpenOffice challenge: can you do what needs to be done?

Okay OpenOffice fans, show me what you can do.

Earlier this week, I wrote about my challenges with a bug in Microsoft Excel that only appears on Windows computers. Since I use a Mac, I didn't know about the bug when I wrote the assignment and I only found out about it after all but one of my students turned in assignment results with nonsensical pie graphs.

So, I asked what other instructors do with software that behaves differently on different computing platforms. I never did hear from any other instructors, but I did hear from lots of Linux fans. And, lots of other people kindly informed me that I could use OpenOffice and that it runs and behaves exactly the same on Macs, Linux, Windows XP, and VISTA.

One commenter, Chris Miller even went the extra mile and made a very nice screencast demonstrating how to use OpenOffice to count unique data. The screencast didn't show the things that I need students to do, but it did give me an idea.

The OpenOffice Challenge

I'm really not opposed to using OpenOffice. I'm just not convinced that it can do the tasks that I need to have done. After 23 years of using Microsoft Word and Excel; I'm not terribly motivated to switch to something if I suspect that it has fewer features and is harder to use.

I really do want what's best for the students, but while, rants in the comment section are kind of amusing, they're not convincing.

I'm a scientist, okay, I need evidence, not anecdotes. Don't give me testimonials, give me data!

So, here's your chance to show me that OpenOffice has the features that I need. If OpenOffice will do what I want, I'll be happy to give it a try the next time I teach this class. If the results look okay and OpenOffice works okay, without too many complications, I'll even make my own screencast of OpenOffice and post it here with instructions.

This assignment takes me about 20 minutes to do with Microsoft Excel 2004 on my Mac. (I guess I'll be using this version for quite awhile, thanks some news from Andy! (commenter 26)).

Now, what do you do?

The contest rules
1. Get a copy of our data here and save the data file as text (okay, you probably already knew that).

2. Parse the third column so that the culture state, biomes, and bacterial genera and species are separated into four (or more columns).

3. Clean up the data.

4. Make a pivot table (or the equivalent) to count the numbers of each genus found in each of the two biomes. You don't need to count the bacteria that live in unknown biomes, we're only interested in the bacteria that were found in the forest and creek.

5. You will probably have to clean up the data again and redo the pivot tables. (This is real life kinds of stuff that we're doing here after all.)

The final table should look like this:

i-37c0e9f311e0b024e5aa6bc6184a7074-table.png

6. Make two pie charts from your pivot table so that we can compare the bacteria that live in the different biomes. Warning: this was the part that almost everyone got wrong.

7. Take screen shots of your pivot table and your two pie graphs.

8. Share the images with me and the rest of the world. You can put them on-line somewhere, like your blog or flickr or something and add the link here in the comment section.

Or, if you have a gmail account, you can open up Google Documents, make a presentation, and insert your images. Then click the Publish tab and put the link to your file in the comment section.

Game on!

Categories

More like this

Just out of curiosity, is this something you have your students do? Based on your time, I would expect this to take them 40 minutes to multiple hours (depending on how well versed they are in Excel). Is that your experience?

(just trying to get another data point on how computer literate students are)

Yes, indeed. This is one of the assignments that they do, as part of a long project working with 16S rRNA sequences from bacteria on the JHU campus. We get new data sets every year, and we do lots of different things to analyze it and see how the results change between biomes and over time.

I don't know exactly how long it takes the students, but I'm pretty sure from their comments that it takes them more than 40 minutes and only one of them figured out how to work around the bug in Excel. I don't think they automatically know how to use Find and Replace or how to edit information. They were also surprised to find that you can't edit a pivot table because they didn't quite understand that a pivot table, like a graph, is a kind of report that's generated from primary data, and not data itself.

That's why I ended up making the movies and writing new instructions with a work around for the Windows XP Excel bug. Unfortunately, I'm still not sure what to do about Windows VISTA and Excel 2007.

Maybe OpenOffice will be the answer, but so far, the jury's still out.

Ron - about your other question - how computer literate are the students?

My students are pretty mixed. A few are scientists in biotech companies. Some are students in biotechnology education programs. A few come from programming backgrounds.

The few with programming backgrounds are usually pretty literate although they annoyed sometimes by the biology parts of our course. They don't understand the biology terms and that bothers them.

The biotech scientists are usually kind of amazed that all these things exist and they didn't know it.

The biotech students are quite variable as far as skill. They range from a skill level where they can't find a file if they download it from the internet (especially Windows users) to students who decide they want to go on and learn SQL and Perl.

I do find that they need instructions for almost all Excel types of activities, though, because they've rarely done more than make one graph in one or two classes. And those graphs hardly ever include things like error bars, multiple series, or both axes like the graphs that I make them do.

I seriously doubt anyone is going to take part in your contest. If you are having problems with microsoft software and want an alternative, do what the rest of us do, and do your own homework.

I also doubt any open office (or any other open source application) user or programmer cares if you decide to switch away from microsoft, I know I don't. If you want to, it is available for free, but usually the rule is that if you are going to use it, you need to do the research and learning yourself before going to other people, or stick with the crappier commercial version.

This is a rule you are breaking. let me save you some time and encourage you to stay with microsoft, the community needs more people contributing on their own, not looking for other people to do it for them.

Thats what I like about the Linux/OSS community (of which I have been a member since 1996), its kindness and understanding of others and willingness to help (as exemplified by centrex).

Ms. Porter doesn't "care" if OO can do this or not, nor should she have to invest her own time to find out. If someone out there is an OO expert and can answer the question quickly, why not do so?

Helping others out _is_ the cornerstone of Linux/FOSS, and unless centrex manufactured his/her PC from ore and sand, and keyed in the machine code for his/her original, newly written OS bit by bit, he or she is depending on the efforts of others as well.

Hmph!

It will be interesting to try this our myself, since while OO has many features MSOffice does not, it is not a complete superset, and I will be intrigued to find out if OO can do pivot tables.

TomP.

My biggest complaint with the Open Office spreadsheet program is the inability to format graphs/charts the way I want. Mind you, Excel isn't exactly user-friendly in this regard, but at least I can get it to do what I want.

I do like Open Office as a document editor, though. Unlike Word, it doesn't keep trying to anticipate (incorrectly) what I want to do.

Just a note for Mac users. You may want to compare how Open Office (which only runs on Macs using the X-11 extension that basically runs it in Linux) versus in NeoOffice (a port of Open Office to the Mac GUI). The problem being that NeoOffice is usually a few months behind the latest version of Open Office).

In your blog photo you're all smiley and nice, but when I got into today's assignment I saw your *other* side. ;)

I doubt I would've gotten a decent grade on my sasignment but at least I can turn something in, unlike some of the previous commenters. Heh. Read about my troubles at http://www.triopticon.com/Members/kewball/posts/open-office-challenge

By Matt Platte (not verified) on 12 Mar 2008 #permalink

This was a quick and nasty run (~20 minutes) mostly to procrastinate. I probably messed up quite badly (a lot of cleaning was done without much care). I would normally spend more time on colour, and exclude categories that had 0 in both charts, but never mind.

http://img399.imageshack.us/my.php?image=openofficedemonstrationaq0.png

Oh, the data was very messy, but open office was able to parse it pretty nicely just through pasting. I had to delete about four cells after pasting to remove extra columns it added - problems with the data formatting, and combined a couple that got split (although I think most of what was moved into the next column wasn't needed anyway...).

I'm quite experienced with Calc, but only the simple functions. Someone who is a more advanced user will probably be better at this than me.

By Paul Schofield (not verified) on 12 Mar 2008 #permalink

I'm not an expert at Calc, but I am willing to jump in and try. However, I am having a little problem with steps 2 and 3.

Not being familiar with "culture state, biomes, and bacterial genera and species" from step 2, or what they are supposed to be, cleaning up the data in step 3 is kinda difficult, although it looks like the data definitely needs some cleanup because the data appears to be either in the wrong order, or some have info that the others don't (Note the line with label 208.ab1, which has the comment field starting with a number... which is the first field that starts that way, but not the last. Are all supposed to start with a number, or is the number in the wrong place in that one, or is something else at work that I don't know about?).

Assuming that you are testing OpenOffice.org Calc's ability to handle the data and not the general populace's Biology knowledge, at the very least, an example of how it is supposed to be formatted, even using different data, would be helpful for those of us who aren't as familiar with Biology as your students. Keep in mind, I haven't been in a Biology class since Biology II in high school back in 1990-1991 (I didn't take any Biology classes in college).

Corfy;

There were a few samples that had extra values - these were either copies of what I assume were sample labels or whatever that other number was (Q20_over_len) which you don't need for the actual data. The biomes are just Forest, Creek or Unknown, and various misspellings of each. The various genera are easily worked out with a couple minutes google/wiki work.

Also, the formatting is a bit off, but copy into notepad and then into calc. That gets rid of a lot of the problems (I had half the data doing very strange things).

I've only been out of biology for five years myself, but the physics has easily overwritten all those modules already.

By Paul Schofield (not verified) on 12 Mar 2008 #permalink

I've never played with the "Data Pilot" (OO's name for what Microsoft calls "Pivot Tables") features at all before, so I may play with this if I get a chance. It's a worthy cause, and would probably be good practice. Besides, I need something useful to do between beer-microbiology posts.

A question about step 1, though: Is that really the original data - that is, was the original data REALLY entered as an HTML table? Or is the original data available in some other format? (Or can it be exported from Google Docs as .odf or other format?) If so, that might be a lot simpler than trying to "screen scrape" the data by copying and pasting out of the HTML table...(No, I've not used Google Docs before at all).

Here is my try...not so difficult...other than I knew not what I was dealing with at any point:

I really like what the fellow did above with somehow harmonizing the colours so one legend is available for two pie charts.
I commented here on the OO.org forum:
http://www.oooforum.org/forum/viewtopic.phtml?t=69471&highlight=

Cheers

I won't have time to look at any results until tonight, but I can tell you quickly where the data came from.

The original data were entered by the students in my on-line bioinformatics course at Austin Community College. The students used FinchTV to select the good parts of their sequences, connect to blastn, and identify the bacterial species with the best match.

Then the students entered their results in the iFinch database as free-form text by using the FinchTV comment window. You can see this here.

Once the data are stored in the iFinch database, we can retrieve them pretty easily with SQL. I wrote an SQL statement that would extract that data and stored my query in iFinch. The students then, log into iFinch, run the query, and click a button to download an Excel-compatible spreadsheet with the data. I uploaded a copy of that file to Google Docs.

I made a movie about this part of the project and posted it here.

My students are pretty mixed. A few are scientists in biotech companies. Some are students in biotechnology education programs. A few come from programming backgrounds.

So most of your students should have had at least some experience with computers (assuming they are second year students), and some should be fairly proficient.

The few with programming backgrounds are usually pretty literate although they annoyed sometimes by the biology parts of our course. They don't understand the biology terms and that bothers them.

I know what you mean. A friend of mine is a computer scientist who works for a biotech company. Our jargon confuses him as much as his jargon confuses me. :-)

The biotech scientists are usually kind of amazed that all these things exist and they didn't know it.

Once you get set in your ways.... That's one of the reasons I hang out and read your blog. I get to learn different ways to do things.

The biotech students are quite variable as far as skill. They range from a skill level where they can't find a file if they download it from the internet (especially Windows users) to students who decide they want to go on and learn SQL and Perl.

That's the variability I've seen, too. Which is the basis for my comment umpteen posts ago about a computer skills course the students could test out of.

I do find that they need instructions for almost all Excel types of activities, though, because they've rarely done more than make one graph in one or two classes. And those graphs hardly ever include things like error bars, multiple series, or both axes like the graphs that I make them do.

Not to mention that Excel is a business application that we "make do" with for science graphs because it is ubiquitous on the machines the students can access. Don't know how many times I have to tell the students that they shouldn't use the "line" graph for their data.

Hi Sandra,

I posted your question on OOoForum.org here :

http://www.oooforum.org/forum/viewtopic.phtml?p=277432#277432

Hopefully some of the info from the people over there can be helpful.
I would also encourage you to participate in this thread.

At least one person got a result :

http://i201.photobucket.com/albums/aa272/swingkyd/challengout.png

Biggest problem for us IT geeks is what exactly you mean with things like biomes, culture state, and 'clean up the data' ;-)

Hello, I don't know why I took the challenge, but I guess that my own research is kind of frustrating at this time and so I was happy to work on a problem that I could actually solve (I was also curious how a research in other fields of study might look like). But here is what I found:
Let me first state that I know next to nothing about biology (apart from what I learn by reading scienceblogs ;-)) my field of study is IT. So if you find any errors in my solution, this is most likely the cause (for example I do not know whether arthropobacter is a differrent genus than anthropobacter or whether it is just a typo).
The hardest part of the challenge was understanding the contest rules, for example when you wrote "Parse the third column" I have found this to be quite a challenge since the data is such a horrible mess that resists any kind of attempt with automatic parsing. I could not figure out how to do it in excel so i have written a perl script, but I still had to correct some errors by hand. I guess that either your knowledge of excel if far better than mine or that when you wrote parse, you meant do it by hand.
Next thing the pivot tables: I never really used them much before and found the work with them in excel quite clumsy. For example I could not figure out how to put biome in both the data part and the column header of the pivot table. I have got around this by introducing a new column full of ones into the data and using this as the data part of the pivot table.
The graphs are also difficult, because if if I try to make a graph from a pivot table, excel makes a special kind of pivot table graph which changes every time the pivot table changes (this means there is no easy way to display pie charts for creek and forest at the same time).
Now let's move to openoffice.org: I have actually never used it before I have always worked with Excel or alternativelly with GNUPlot. The first thing I have tried were the pivot tables and I must say that they are much better implemented in Openoffice. The creation of the table was very fast and intuitive by something they call "Data Pilot". It took me less than 20 seconds and I was able to use biome in both the column header and the data part with no problems.
The graphs, they were kind of problematic because they are done quite differently than in excel. For example it took me quite some time to figure out how to add a title to the graph. Another issue (as you can see from the pictures) is that the number of colors for cathegories is limited and after a few cathegories they start to repeat which is quite annoying.
Anyway I have uploaded the pictures of the pivot tables and the graphs here: http://sweb.cz/tomas1p/openoffice/ I hope I got it at least partially right.
So I my advice about open office is: try it ;-) Be prepared that it will have both the features that you will like as well as those that you might hate. If you give it some time you will figure out which ones prevail and if you are lucky it might even solve your problem.

I also doubt any open office (or any other open source application) user or programmer cares if you decide to switch away from microsoft, I know I don't. If you want to, it is available for free, but usually the rule is that if you are going to use it, you need to do the research and learning yourself before going to other people, or stick with the crappier commercial version.

This actually is part of why adoption rates are lower than you'd expect for free and open source software - there's little support infrastructure or incentive to make one, so you have to rely on the enthusiasm of the user community and developers to help you, which depending on your needs and knowledge level, may or may not be good enough. As a commerical software developer, I'm quite strongly motivated to have people use the software I make, as are my coworkers. If people stop buying it, then my company (which is not Microsoft or linked to Microsoft on any level other than using their products) lays us off. Thus, we go to great lengths to make sure the software is meeting customer needs. If one of our customers needs to know how to do something, they can call us 24/7 and get not only a customer support rep, but one who is a programmer and capable of investigating and fixing bugs if necessary.

Here's a quick hint: you can use Google to figure out the correct spelling for the names of the bacterial genera.

And Google will even give you the definition of a biome. It's a place where things live. The creek is one biome, the forest is the other.

A perhaps not hugely helpful comment, but: is it really necessary to require people to all use the same tool? I mean, being able to understand the material and the needed process to get the data is the important part after all, and if someone uses Excel, another Perl and ImageMagick, or Matlab or Octave, or R ... it is still all solving the same problem.

If anything, a plurality of solutions would highlight to the computationally less well versed students that yes, there really are a number of possible tools and there is no one single best way you have to do a particular task. You would still want to give a more detailed crib sheet for some method like Excel of course, for those students who find themselves completely at sea when it comes to data processing. But require the same tool for everyone sounds a bit counterproductive.

Hi Janne,

I don't mind the comment. If my students wanted to use something else, other than Excel it would be okay. The problem is that I can only write so many sets of instructions.
And, since I have Excel, and all of the students have access to Excel, I write instructions for Excel.

My goal, really, is that the students learn how to analyze the data. I require that they make the tables and pie graphs properly so that they can answer the question we're asking and discuss their results. We're trying to find out something about the metagenomics of two different biomes. We want to know how what kinds of bacteria live near the creek, what kinds of bacteria live in the forest, and what do the distributions of those bacteria look like? Are there some bacteria that are found in both places? Are there some bacteria that are only found in one place or the other? We can't answer those questions without the proper graphs and we can't make the proper graphs without the proper tables tallying the data.

But, this is a community college course and we don't really have any prerequisites. Consequently, none of my current students know how to use anything other than Excel and even using Excel is pretty hard for many of them.

So, it falls to me to write good, easy to follow instructions and make movies when those instructions aren't enough.

Having people use the same tool makes it easier to teach them how they can analyze the data. Most of the students are learning the software while they're trying to learn the biology. It is much easier on the instructor to teach using one tool. (in class demos or tutorials posted on-line)

That being said, I'm almost 100% sure the Sandra hasn't told the students they have to use excel. I know when I teach, I don't require the students to use a particular tool as long as they can produce the desired result. However, I do warn them that computer support is going to be tougher if they don't know how to use their tool of choice.

Looking at Sandra's post above, I realize I never really answered her initial question about how to deal with software. If the student has a laptop, I ask them to bring it in to class or office hours so I can try to help them figure things out. I warn them ahead of time that my expertise is pretty much Mac-centric. If they can't bring it in, I suggest other tools (like google docs) that don't cost them anything or I suggest they use the computer labs on campus. (Dealing with on-line only classes adds issues I haven't had to deal with yet.)

BTW - all you who worked on the problem, all I can say is WOW!

And I must have been nuts to sign on for grading more homework ;-)

I'll put a response up tomorrow.

But thanks!!! I do appreciate it and I'll give it a try.

Sandra was obviously writing faster than I did....

Well Ron, you've certainly given me inspiration for another post. Is it better to try and teach topics (i.e. computing and biology or math and biology) separately or to integrate the topics, as we see them in real life?

Two comments:

1) I have heard that OpenOffice products are very RAM-hogging in comparison to MS Office products, and in some cases they take *minutes* to load up documents or spreadsheets. That has been a dealbreaker for me to switch to Linux for now. Might want to get more info on how that might play into your consideration about switching to OpenOffice.

2) I noticed that you are using pie charts with quite a few categories. Have a look at this, which I found a pretty thought-provoking essay on why pie charts are not very good ways to display data. I think I'm going to swear off the pie.

I have heard that OpenOffice products are very RAM-hogging in comparison to MS Office products, and in some cases they take *minutes* to load up documents or spreadsheets.

This is only true to a limited extent. I certainly wouldn't load 50MB tab-delimited files into OOcalc for analysis, as it'd be somewhat painful (yes, more so than excel).

I'd argue, though, that you shouldn't be loading a data set of that size into a spreadsheet anyway. There are better ways to deal with large data sets. Use the right tool for the right job.

It seems that others have the challenge covered at this point, but if it turns out that their answers are unsatisfactory, I'll look into doing it later.

Sandra, you may want to change your link to point to the permalink of my screencast, so that you don't lose it when I post new entries.

Sandra, sorry; I misread your posts to mean that you required the use of Excel specifically.

Separate from this specific discussion, if you happen to be using Linux (or Windows; it's available there too), then Gnumeric (http://www.gnome.org/projects/gnumeric/) is actually a better replacement for Excel than OOCalc. Quite a bit faster, deals with larger data sets, and is more compatible with Excel spreadsheets in general (it emulates a lot of Excel bugs when needed in order to have spreadsheets behave as expected, for instance).

Thanks Chris, I replaced the link with permalink.

Janne: when I use UNIX, it's usually BSD, but Gnumeric sounds interesting, I may check it out when I get some time. The name is sure interesting, if it follows the GNU convention (GNU is NOT UNIX), does that mean the Gnumeric is not numeric?

CM: That looks like an interesting article. I think for our analysis, though, pies are still the best way. That may be more clear when I post some pictures of what they're supposed to look like.

Here's my shot, using OpenOffice on Linux.

I'm also not fond of pie charts, especially for a study like this where you want to compare the bacteria found in the two locations. A bar chart, something like this makes that more clear: biome_lab.png

That's a rather frustrating data table. Shouldn't the students learn to build clean tables first ? You know, one info per column, using codes (C for creek, F for forest), that kind of thing. And one special column for unstructured remarks and comments. Wouldn't that avoid loss of time and help improving the quality of analysis ?

A little trick I use rather often : when you have stuff in one column, separated by commas (for instance), and you want to split it between several columns, you can save the file as delimited text, then import it again, but this time, specify that both the semicolon (usual delimiter) and the comma are delimiters.

By Christophe Thill (not verified) on 14 Mar 2008 #permalink