Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm a PhD student in Electrical Engineering. I'm currently working on a Monte Carlo-type simulation for looking at the underwater light field for underwater optical communication (no sharks!). I'm doing the development in MATLAB and I recently put all my code up on Github (https://github.com/gallamine/Photonator) to help avoid some of these problems (lack of transparency). Even if nobody ever looks/uses the code, I know every time I do a commit there's a change someone MIGHT and I think it helps me write better code.

The problem with doing science via models/simulation is that there just isn't a good way of knowing when it's "right" (well, at least in a lot of cases), so testing and verification are imperative. I can't tell you how many times I've laid awake at night wondering if my code has a bug in it that I can't find and will taint my research results.

I suspect another big problem is that one student writes the code, graduates, then leaves it to future students, or worse, their professor, to figure out what they wrote. Passing on the knowledge takes a heck of a lot of time, especially when you're pressed to graduate and get a paycheck).

There's got to be a market in this somewhere. Even if it was just a volunteer service of "real" programmers who would help scientists out. I spent weeks trying to get my code running on AWS, which probably would have taken a few hours from someone who knew what they were doing. I also suspect that someone with practice could make my simulations run at twice the speed, which really adds up when you're doing hundreds of them and they take hours each.



I'm a M.S. student in mechanical engineering facing a similar situation, except I haven't put any code on Github (my advisor wants to keep it proprietary, but I probably would not bother putting it up even if he were ok with it).

I've written around 15000 lines of MATLAB for my research and only a handful of people will ever need to see it. Some is well-structured and nicely commented, but other parts are incomprehensible and were written under severe time constraints. My advisor is not much of a programmer and will not be able to figure it out, and I feel bad for leaving a pile of crappy code to the person who inevitably follows in my footsteps, but I ultimately have a choice between writing fully commented, well-tested, and well-structured code and graduating a semester late (at the cost of several thousand dollars to myself), or writing code that's "just good enough" to get results on time. This is a solo project (there is no money for a CS student to intern) and I'm not getting paid to write code unlike a professional programmer, so every second I spend improving my code beyond the bare minimum costs me time and money.

Even if I were able to tidy up and publish all of my code, most mechanical engineers would not be able to understand it because most can't write code. Those who can mostly use FORTRAN, although C is becoming more common. Nonetheless, even those who could understand my code would have little incentive to read through 15000+ lines of code.

Unfortunately, as far as research code is concerned, a lot of trust is still required on the part of the reader of the publication. I agree that the transfer of knowledge should be handled differently, but until there is a strong incentive for researchers to write good code it will continue to be bad. Especially when many research projects only require the code to demonstrate something, after which it can be put in the closet.


"my advisor wants to keep it proprietary"

This concerns me. Is this kind of thinking pervasive in public academic institutions? Avoiding the copyright ownership issues that tend to accompany such discussions, would it not be better to be more open about the code in an attempt to gain peer review?

I understand your personal motivations about not publishing, but the statement about your advisor is what I'm worried about.


Yes it is. Often it's not for nefarious reasons - it happens a lot where I work because we use data from collaborators that is unpublished, and it's considered unethical to jump over them by releasing code or results based on it.

Of course, the problem is that it can sometimes take years to get large datasets published and this means that the code gathers dust and gets forgotten in the meantime. By contrast, the papers and results aren't, because those are the things by which academic careers are measured.

I would personally support a wholesale change in culture in this area. Code and data/results/conclusions are not as inseparable as most scientists would like to believe, and often should be published as a unit. There has been push in this direction for a while in the engineering sciences, but other informatics disciplines like biology lag badly in this respect.


The ethical considerations with regard to "jumping" collaborators indeed make sense.

As to the last point, perhaps it's time the scientific community took software into consideration along with the data and it's resulting papers. At the least, acknowledge the problem. At best, decide where (alongside the data? with the paper in progress?) the software should be stored.


I wonder whether our priorities for research are misguided. Isn't research about extending the knowledge of humanity? Writing and passing on readable code would probably advance us further in total, than everyone starting basically from scratch.

(I'm not faulting you, you just react to the incentives.)


Well, I went to a lecture by one of the most prominent scientists here in Brazil, where he explicitly said that the answer to your question is NO. Research as it stands today exists to feed the system. According to him, you:

* Publish, so you can get grants

* Use that grant so you can publish more

* Get more grants;

* Get tenure somewhere in the middle.

I have to confess I was very disgusted by him saying that in front of such a large audience of scientists and graduate students.

EDIT: Formatting


I agree it's a problem, but I think you have to fix the incentives to make meaningful change. When people are thrown into a cut-throat competitive environment, with tenure clocks, multiple junior professors per tenure slot, requirement to bring in grants to fund your research or you get shut down, etc., it doesn't encourage people to be altruistic and sharing.


I think the problem is fundamentally one of economics. Research is good, but you have to decide how much money to allocate to it. In order to decide, you need a metric for performance. Really, only scientists are qualified to judge whether the results of other scientists are worth anything, so currently the only metric we really have is publishing in peer-reviewed journals. Ultimately, therefore, that's where the incentives end up.

When a more appropriate way of quantifying research output and its benefits is found, hopefully a beneficial change in culture will trickle down into the academic trenches.


How about trying to fix the current system by making somebody else using your software count as a "super citation"? (It could even arguably count as much as co-authorship.)


I think this is an excellent idea. If published software could be tagged via a unique identifier (like the DOI of a paper), then it could be cited by that tag just like a paper. Well written software might even get cited more than the paper it was published in.


It's not his fault that the system is set up in a such a way that some random bureaucrat that's not close to the project can make a department unemployed with a wave of his hand. Aiming for the next grant is how you survive - it's not trivial for academics (or anyone) to move cities every year or so to follow where grants might land.

Until there is that job security, knowing that as long as you keep working you're not going to be randomly turfed out, this phenomenon will be a fundamental part of the academic career.


Well, yes, it is about extending the knowledge of humanity- but it doesn't happen in a vacuum, and is subject to a lot of the same constraints as any other human activity. And, as you say, there are incentives at work- if the john_b's advisor had written "release usable MATLAB toolkit for $doing_whatever_john_b's_thesis_does" into his grant as a deliverable, you can bet that both john_b and the advisor would have made sure that it was in a releasable state, and also that the advisor would have had funds available to pay john_b to clean it up and get it ready to go- they would have been specifically allocated for that purpose in the grant's budget.


> graduating a semester late (at the cost of several thousand dollars to myself)

Really? Your funding isn't guaranteed?

When I did my Master's I was funded as an R.A. without my advisor/lab having to tap her particular grants. Grants and fellowships were usually seen as something "extra" for master's and Ph.D pre-quals students, not their main source of funding. I find it surprising that your school or department seems to (or is forced to) think differently.


PhD student funding is guaranteed in my department, but not funding for M.S. students. Initially I did have a RA, supplied by my advisor's start up funding (he was fairly new at the time). But when his grant applications were rejected and his start up funding ran out, I had to do a TA instead. TA-ships aren't guaranteed though, and at my school if you stay too long on a TA they take you off of it to make sure that other students have a chance at funding too. My advisor did ultimately get some grant money in, but by that point there were other students who needed it more than me.

It might have something to do with recent state budget cuts (it's a state-funded university). My department has also grown dramatically over the past few years, both in terms of faculty and students, so the graduate student funding will probably lag behind for a few years more.


There is a market, and it's called libraries. Eventually you will use a language where software carpentry and code reuse is a core feature, and tested, modular libraries for not only core algorithms, but also deployment and dev-ops stuff (like managing a compute cluster on the cloud) will have standard approaches.

This is starting to shape up on the Python side of things, but it has stagnated a little bit. People who can and do write the foundational code are oftentimes too focused on making the code work, and not at all focused on improving the quality of the ecosystem that their code is part of. Open Source is a great mechanism for many things, but polishing up the last 20% is not one of them.


"Eventually you will use a language where software carpentry and code reuse is a core feature."

Well, to some extent products like MATLAB solve this problem. For better or worse, I trust Matlab's ability to generate a (pseudo) random number, parallel process my functions, invert matrices, etc., etc.

On a broader level, thanks to the specialization of academia, chances are that the code I want to write isn't duplicated by others. Even if it is, I still have to trust them to have written it well - which is the whole problem here.

I guess I don't have as much hope as you do.


The people at Willow Garage think there is a market for this [1].

http://www.willowgarage.com/blog/2010/04/27/reinventing-whee...


Huh. Cool. Are you familiar with Metropolis light-transport? http://en.wikipedia.org/wiki/Metropolis_light_transport

I've read about it in the context of speeding up global-illumination path-tracing for computer graphics.

I think it's based on work that was originally done for neutron scattering.


I believe I came across it at some point or another, but I haven't looked closely at it. For better or worse, it doesn't seem to be employed by others doing light field simulations underwater ... not sure why. I'd have to poke into it further. I was, however, seriously thinking about writing a plugin for Blender that would utilize my specific scattering phase functions in their volumetric ray tracing renderer ... no time though.


"Essentially, all models are wrong, but some are useful."

[ http://en.wikiquote.org/wiki/George_E._P._Box ]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: