Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As an NLP researcher, this is interesting as a sort of summarization data set.

The thing is though, summarizing news articles is best done by just reading the first paragraph of the article. News articles are intentionally written this way, and it's a very difficult baseline to beat in automatic summarization.

Still nice site though.



No, the first paragraph of a news article is designed to give the reader just enough information to get the gist of the article, but tease the reader into continuing to read along.

On web sites, the goal is to get the user to click the "more" or "details" link to get the whole article and thus display more ads.

The reality is that most of the "good stuff" for a news article could be summarized in a paragraph that would satisfy 90% of the need to read the full article - but that would defeat the business model of most sites/media dispensing news.

I like the idea of this site.


What you're saying may be true of new media sites and investigative / gonzo / entertainment journalism, but old school journalism 101 says the first paragraph should be a summary for hard news stories. In fact, hard news should be written such that you can chop off paragraphs in reverse order and still have a sensible article.


100% agreed - that's what I learned to do in high school journalism class as well. But there's lots of stuff on the web, even that's considered news, that doesn't follow this format. Here's some first sentences from other Times articles:

"The flesh is weak but the spirit of commerce is willing." (op-ed)

"Last weekend, my family and I packed our car full of supplies and drove to a fire station in New Jersey to deliver goods to an area that had been hit hard by Hurricane Sandy." (small business)

"Last April 28, a splendid spring Saturday that fairly begged you to be outdoors, I spent all afternoon in front of my living-room TV, anxiously watching the last day of the annual N.F.L. draft, live from Radio City Music Hall. " (NY Times Magazine)

"The relocation of Albert C. Barnes’s great polyglot art collection to central Philadelphia was opposed by many and dreaded by most." (A&E)

I guess my point is that journalism styles vary even within a publication. Therefore, any automated attempt to simply use the first paragraph as a summary is bound to be wrong a lot of the time. It would however be interesting to use a human-generated summary dataset as the training data for a "buries the lead" classifier. I'll bet you could do it with a bag-of-words feature pretty easily, and that the most important words would be personal pronouns.


You're right, many "news" articles don't follow that traditional format. The difference is that your examples are not the primary news article about an important event. They are either commentary / color on an important event, an opinion piece, or human interest piece about something interesting (but not really "news"-worthy).

Of course, much of what is in the paper is not "hard news" so some sort of automatic summarization could be useful for those pieces.


Militants took their fight with Israel into the heart of the country Wednesday, exploding a bomb on a public transport bus in Tel Aviv. The attack is likely to complicate already tenuous efforts to achieve a cease-fire.

http://www.cnn.com/2012/11/21/world/meast/gaza-israel-strike...

American efforts to help negotiate a cease-fire between Israel and Palestinian militants in the week-old Gaza rocket battle faced a new obstacle on Wednesday when the first bus bombing in years traumatized Tel Aviv, raising the prospect of a new Israeli retaliation just as Secretary of State Hillary Rodham Clinton was working to achieve even a brief pause in the fighting.

http://www.nytimes.com/2012/11/22/world/middleeast/israel-ga...

At least 24 people have been injured in an explosion on a bus in Israel's commercial capital, Tel Aviv, in what police described as a "terrorist attack".

http://www.aljazeera.com/news/middleeast/2012/11/20121121101...

None of these sites teased me into clicking through, they presented their stories in full text. I pay for NYT access, but their paywall is unrelated to your theory of PV corralling. These are all decent, concise treatments of the story.

What does "90% of the need to read the full article" mean to you? It seems like indefinable to me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: