bot: roadmap
DrugSense Newsbot: Overview
The DrugSense newsbot ("newsbot", "bot", etc.) is a software system that efficiently gathers news from online sources, on a given topical area. The prototype implementation we will discuss here has the example topic area of "drug related" news articles.
http://drugpolicycentral.com/bot
DrugSense News Bot
The system can gather the very latest news on a topic area, and it does this from any online html news source. Articles of interest are discovered by very limited spidering of news sites, following links of topical interest and ignoring other links. Articles are automatically categorized and summarized according to selected concept terms. Once properly configured, the system runs continuously and robustly to gather and (sub)categorize the latest news on the desired topic. No human intervention is then needed.
Unlike news "aggregators", the newsbot does not need or use other RSS XML as input. Instead, for input, the newsbot scrapes online papers' html. The newsbot produces RSS (etc) as output. Other RSS aggregators can use the newsbot's output; the newsbot doesn't need or use other RSS sources.
So, while the articles that the newsbot scans and classifies are arbitrary html, the newsbot's output is RSS XML (and html, javascript, etc.). The topic area, as well as the configuration (sites spidered, and the way they are spidered) are all defined in simple XML formats.
Concept-based News Bot
Concept Dictionary
Thesaurus-like
Central to the operation of the newsbot is the "concept dictionary." The concept dictionary is a type of thesaurus, where keywords are grouped together to form "concepts." Concepts, basically, are named lists of keywords.
The concept dictionary the newsbot uses is a type of lightweight keyword
ontology,
a conceptual schema
or a
taxonomy,
for a given topic area.
http://drugpolicycentral.com/bot/pg/news/dictionary1.htm
Newsbot's Concept Dictionary
Concepts Drive Site Content, Organization
The concepts drive the content of the system. The robot uses the concepts to guide the article finding process. The concepts are used to categorize and rank articles by relevance to the desired topic area. The concepts also are reflected as the "topics" hierarchy. Concepts are used to summarize articles. The concepts are central to the operation of the entire newsbot system and are crafted to let the system quickly sort items of topical interest. The concepts are made to let the newsbot efficiently categorize news articles; they are not intended to be a rigorous ontology, nor are they intended to be completely exhaustive.
Concept-based Spidering, Text Mining
The robot/spider portion of the system uses the concept dictionary (the concepts' keywords) to guide the new article discovery process. Online newspaper front pages and index pages are scanned for concepts. Links are followed when they look interesting, that is, when they contain certain (user specified) concepts.
The technique that the newsbot uses in
'extracting interesting and non-trivial
information and knowledge from unstructured text' is often called
text mining.
Concepts Relationships Reflected As Topics
Concepts may be linked to other concepts, as sub-concepts. (This is similar to broader term/narrower term thesauri). Concepts can be linked together to form hierarchies. The (possibly tangled) hierarchies formed by the concepts and their related sub-concepts are reflected in the topic/sub-topic organization in the newsbot site. Automatically generated "topic" pages are maintained by the system. These pages list recent articles that contain the concept/topic. So topics reflect the organization of the concepts; for each concept a corresponding topic page is maintained by the bot.
http://drugpolicycentral.com/bot/topics
Newsbot's list of topics, automatically made from concept dictionary
Concept Keywords
Concept keywords (called "terms") are implemented as regular expressions for some flexibility.
The Content: Automatic Discovery of Breaking News Articles
As mentioned before, the system uses the concepts to spider news sites for interesting articles. articles discovered by the system are cached on the server, and their categorization (a list of concepts applied to the given article) is also stored.
Richly categorizing news articles by a set of concepts tags allows for concept-based searches, later. The system lets searches be done via a search form that appears on every page of the newsbot site.
It is possible to write simple (cgi, shell, etc.) scripts to create/cache a series of complicated searches. For example, one script caches a "breaking UK drug news" page every 15 minutes. This page is created by a trivial sh(1) script that uses the bot's command-line shell script modes to make a page that embodies five searches.
http://drugpolicycentral.com/bot/uk
Newsbot page built with simple script
Newsbot: Open Service
The newsbot, similar to Meerkat-like servers, lets users also specify searches as parameters or options, in urls. The newsbot is an open service. Its API is open and documented (and trivially simple).
http://drugpolicycentral.com/bot/index.cgi?q=mexico
Newsbot search, specified in the url.
Point a browser at that url, and a search for items containing "Mexico" are returned, as html.
One may search for specific concepts, also.
http://drugpolicycentral.com/bot/index.cgi?concept=oxycodone
search for the oxycodone concept
Boolean expressions are permitted, too.
http://drugpolicycentral.com/bot/index.cgi?q=texas+and+not+cannabis
search for "Texas" articles that don't mention cannabis
http://drugpolicycentral.com/bot/index.cgi?q=thailand+or+philippines etc.
search articles that mention "Thailand" or "Philippines"
XML, RSS and Javascript Output
RSS (Really Simple Syndication) allows anyone to syndicate unique content. Others can subscribe to one's own syndicated output. Users can make the newsbot output RSS, as an option.
http://drugpolicycentral.com/bot/index.cgi?q=mexico&xml=1
search for "Mexico" articles, output it in RSS
http://news-feeds.net/feed.php/12598/
smoking-glass.ca -> News
rss-verzeichnis.de
examples of sites using this newsbot's RSS feeds
Other users may prefer javascript. A similar option outputs data as javascript. This makes the system's categorized stream of articles even more accessible - anyone can add a few lines of javascript to their web page and get headlines.
http://drugpolicycentral.com/bot/index.cgi?q=mexico&javascript=1
search for "Mexico" articles, output in javascript
http://www.opiates.com/
http://www.oxycontin-addiction-treatment.com/
http://www.geocities.com/the_pot_revolution_420/news.html
examples of sites using this newsbot's javascript feeds
User Accounts, Profiles, News Analyst Features
The newsbot supports user registration and login. Once registered and logged in, users may create concept and keyword based interest profiles. Once a profile is created, users then have the option of having "interesting" articles mailed to them.
Other (login-based) features are designed to mesh with MAPInc operations. For example, the system attempts to determine if articles cached by the newsbot have been submitted to MAPInc's drug policy article clipping service already. If so, users are alerted to that article's submission status.
Recap: Newsbot Output
The newsbot can output in html (the default),
RSS (XML RDF),
and XFML,
and javascript.
A simple, open, url interface lets users take advantage of the newsbot's richly categorized database of breaking news articles.
http://drugpolicycentral.com/bot/pg/news/feedinfo.htm
the simple API for news feeds, searches
Newsbot Design
The concept dictionary is stored as XML. For the current prototype system that deals with the topic area of "drug related" articles, the concept dictionary was hand-written in XML. The newsbot itself is written in perl.
The DrugSense Newsbot: Some Statistics
site | http://drugpolicycentral.com/bot/
|
number of online papers spidered daily | about 650 sites
|
html pages examined per site | 7 to 15
(Configurable. Now configured to look at at least 7 pages,
and, if interesting pages are found, will spider
up to 15.)
|
minimum number of web pages spidered per day | over 4550 html pages
|
maximum number of web pages spidered per day | 9750 html pages
|
interesting articles found per day | about 400 articles per day
|
articles cached for | 5 days back
|
running since | January 2003 on Baremetal servers
|
concept dictionary,
number of "concepts" | about 50 concepts
|
number of keywords per concept | from 2 to around 70 keywords
|
|
Newsbot Operation: Bot Makes RSS News feeds from HTML-Only News Sites
Unlike Meerkat, the newsbot examines the html article text, so sources don't need rss/xml themselves. However, the newsbot's output can be RSS XML. So the newsbot makes RSS news feeds from sites that don't, themselves, have feeds. (Meerkat uses existing RSS feeds as input; the newsbot can make use of any site or text source, regardless of whether the site had it to begin with.) The newsbot automatically generates summaries for interesting articles. These summaries are used for descriptions in the output.
Server Usage
The newsbot's spidering and article discovery are somewhat CPU intensive. To balance the system load with the speed of article discovery, the newsbot will automatically slow down using less CPU, when CPU usage becomes too high. This feature is configurable.
Competitive Concept-based products
Where we are: Summary
That's a brief overview of the newsbot, and how it uses the
concept dictionary.
The concept dictionary is central to the newsbot's operation. The concepts are used to automatically categorize articles. The concepts determine the topic area of interest, and the granularity at which articles are grouped. It organizes lists of keywords (regular expression patterns) so that they can efficiently guide the system as it spiders online news sites. The concept dictionary is implemented as an XML configuration file.
The newsbot runs 24/7 without human intervention, continuously gathering breaking news relating to the desired topic area. A
prototype of the newsbot exists, and has been running for over a year.
No other site delivers more recent breaking drug war and drug policy news.
The output of the system is a stream of breaking news, articles of interest. The output of the system is both html (cached pages on the site, searches on demand), and XML RSS news feeds. Other output types, like javascript or
XFML, are also available. By making the search requests specifiable as url parameters, and by making the search output available as RSS XML news feeds, people can easily use the newsbot as an open resource. When someone needs news feeds for their site (related to the general topic area, in our example, of "drug related" news items), they are able to easily get them from the newsbot site. For example, PHPnuke-based sites can use the RSS output of the newsbot directly.
|