Is there a sample dataset? I think all projects should have sample datasets. It ...

ahadrana · on Nov 8, 2011

We hear you. Could you define some criteria as to the type and size of sample data you would like to see? We are working on producing more targeted/limited collections, like perhaps all most recently published blog posts etc.

showerst · on Nov 8, 2011

Perhaps two sets, one that's just a few hundred kilobytes that contains a few sample .arc files to test against the format, and then one larger 'training' set that's small enough to test against offline (maybe like 100MB?) but large enough to contain a good sample of the possible content.

dcnstrct · on Nov 9, 2011

Concur with this comment -- it might also help the community provide feedback on structure and ways to segment that data so that there are more directed efforts to consume small parts of the crawl for processing