Deliberately-straightforward -- agreed. But "highly proficient" after writing one or two scripts? That's quite a stretch.
For instance, one of the questions I give in phone screens is for the candidate to write a program to count the number of occurrences of unique words in a text file. The "after writing one or two Python scripts" approach is something like this:
counts = {}
f = open('test.txt')
lines = f.read().split('\n')
for line in lines:
for word in line.split(' '):
if word:
word = word.lower()
if word in counts.keys():
counts[word] += 1
else:
counts[word] = 1
f.close()
count_items = [(count, word) for word, count in counts.items()]
count_items.sort()
for count, word in reversed(count_items):
print word, count
Whereas the "highly proficient" (and much simpler and more Pythonic) approach might look something like this:
import collections
counts = collections.Counter()
with open('test.txt') as f:
for line in f:
for word in line.lower().split():
counts[word] += 1
for word, count in counts.most_common():
print word, count
lines = [line for line in open("bible.txt")]
words = [word for line in lines for word in line.split()]
counts = {word:0 for word in words}
for word in words:
counts[word] += 1
No imports needed. Linear time. A bit inefficient in the dictionary comprehension, but easy to read.
The "lines=" and "words=" can be compressed into one line, but I figure this is a bit easier to read for people who aren't familiar with nested list comprehensions.
I'm not too familiar with Python, but this is case sensitive unlike benhoyt's examples. Would this be case insensitive?
lines = [line for line in open("bible.txt")]
words = [word.lower() for line in lines for word in line.split()]
counts = {word:0 for word in words}
for word in words:
counts[word] += 1
Yeah, those are nice -- and may actually be more efficient on smaller files, as you're only doing the lower() once on a big string. However, for big files you don't necessarily want to read the whole thing in at once.
One nitpick: it's Pythonic (I think) to just name the list of words "words" rather than "word_list".
Yes that's a classic tradeoff, a proficient programmer will have to pick one.
Personally I always read entire files into memory first unless I have reason to believe memory will be an issue or need to program defensively against malicious/careless input. The code is always much cleaner and easier to read and if you need to do a second pass on the data you don't need to re-read it from disk.
Here's mine for what it's worth, since this is one of the Google python assignments. Admittedly, i'm not very proficient at all.
def get_file_words(filename):
file = open(filename, 'rU')
words = {}
for line in file:
for word in [word.lower() for word in line.split()]:
if not word in words.keys():
words[word] = 1
else: words[word]+=1
return words
def print_words(filename):
wordcount = get_file_words(filename)
for word in wordcount:
print word, wordcount[word]
I'd like to jump in with a little R here- it not all that difficult in "that" either!
open(con <- file('text.txt'))
text = readLines(con, n= -1L) # n is number of lines to read, -1L means read all of it
words = strsplit(text,split = " ")
counts = table(unlist(words))
I put this in because the good thing about R is that it provides functions for many such mathematical operations. And along with this, I'll say something any self-respecting pythoner will know- Less is better than more.
smashing tons of crap together isn't necessarily "highly proficient". In some cases it makes things harder to read and/or harder to maintain and many times certainly harder to edit.
except that's wrong, all you're doing is counting the number of unique words, and you didn't even consider Foo and foo as the same in your example. Part of proficiency is understanding the problem.
For instance, one of the questions I give in phone screens is for the candidate to write a program to count the number of occurrences of unique words in a text file. The "after writing one or two Python scripts" approach is something like this:
Whereas the "highly proficient" (and much simpler and more Pythonic) approach might look something like this: