I think all projects should have sample datasets. It simplifies a lot of things, and in this case stops hundreds of geeks burning through bandwidth before they realize they don't have a clue what they are going to do with the data.
We hear you. Could you define some criteria as to the type and size of sample data you would like to see? We are working on producing more targeted/limited collections, like perhaps all most recently published blog posts etc.
Perhaps two sets, one that's just a few hundred kilobytes that contains a few sample .arc files to test against the format, and then one larger 'training' set that's small enough to test against offline (maybe like 100MB?) but large enough to contain a good sample of the possible content.
Concur with this comment -- it might also help the community provide feedback on structure and ways to segment that data so that there are more directed efforts to consume small parts of the crawl for processing
I think all projects should have sample datasets. It simplifies a lot of things, and in this case stops hundreds of geeks burning through bandwidth before they realize they don't have a clue what they are going to do with the data.