bot: roadmap - Drug News Bot

Bot

search:

fast-breaking news analysis about drug policy and illegal drugs

The Story of Your Enslavement (video)
Prison for Profit
The Rise of the Prison Industrial Complex
Prison Profiteers (2013)

Here

DrugSense Newsbot: Overview

The DrugSense newsbot ("newsbot", "bot", etc.) is a software system that efficiently gathers news from online sources, on a given topical area. The prototype implementation we will discuss here has the example topic area of "drug related" news articles.

http://drugpolicycentral.com/bot
DrugSense News Bot

The system can gather the very latest news on a topic area, and it does this from any online html news source. Articles of interest are discovered by very limited spidering of news sites, following links of topical interest and ignoring other links. Articles are automatically categorized and summarized according to selected concept terms. Once properly configured, the system runs continuously and robustly to gather and (sub)categorize the latest news on the desired topic. No human intervention is then needed.

Unlike news "aggregators", the newsbot does not need or use other RSS XML as input. Instead, for input, the newsbot scrapes online papers' html. The newsbot produces RSS (etc) as output. Other RSS aggregators can use the newsbot's output; the newsbot doesn't need or use other RSS sources.

So, while the articles that the newsbot scans and classifies are arbitrary html, the newsbot's output is RSS XML (and html, javascript, etc.). The topic area, as well as the configuration (sites spidered, and the way they are spidered) are all defined in simple XML formats.

Concept-based News Bot

Concept Dictionary

Thesaurus-like

Central to the operation of the newsbot is the "concept dictionary." The concept dictionary is a type of thesaurus, where keywords are grouped together to form "concepts." Concepts, basically, are named lists of keywords. The concept dictionary the newsbot uses is a type of lightweight keyword ontology, a conceptual schema or a taxonomy, for a given topic area.

http://drugpolicycentral.com/bot/pg/news/dictionary1.htm
Newsbot's Concept Dictionary

Concepts Drive Site Content, Organization

The concepts drive the content of the system. The robot uses the concepts to guide the article finding process. The concepts are used to categorize and rank articles by relevance to the desired topic area. The concepts also are reflected as the "topics" hierarchy. Concepts are used to summarize articles. The concepts are central to the operation of the entire newsbot system and are crafted to let the system quickly sort items of topical interest. The concepts are made to let the newsbot efficiently categorize news articles; they are not intended to be a rigorous ontology, nor are they intended to be completely exhaustive.

Concept-based Spidering, Text Mining

The robot/spider portion of the system uses the concept dictionary (the concepts' keywords) to guide the new article discovery process. Online newspaper front pages and index pages are scanned for concepts. Links are followed when they look interesting, that is, when they contain certain (user specified) concepts.

The technique that the newsbot uses in 'extracting interesting and non-trivial information and knowledge from unstructured text' is often called text mining.

Concepts Relationships Reflected As Topics

Concepts may be linked to other concepts, as sub-concepts. (This is similar to broader term/narrower term thesauri). Concepts can be linked together to form hierarchies. The (possibly tangled) hierarchies formed by the concepts and their related sub-concepts are reflected in the topic/sub-topic organization in the newsbot site. Automatically generated "topic" pages are maintained by the system. These pages list recent articles that contain the concept/topic. So topics reflect the organization of the concepts; for each concept a corresponding topic page is maintained by the bot.

http://drugpolicycentral.com/bot/topics
Newsbot's list of topics, automatically made from concept dictionary

Concept Keywords

Concept keywords (called "terms") are implemented as regular expressions for some flexibility.

The Content: Automatic Discovery of Breaking News Articles

As mentioned before, the system uses the concepts to spider news sites for interesting articles. articles discovered by the system are cached on the server, and their categorization (a list of concepts applied to the given article) is also stored.

Richly categorizing news articles by a set of concepts tags allows for concept-based searches, later. The system lets searches be done via a search form that appears on every page of the newsbot site.

It is possible to write simple (cgi, shell, etc.) scripts to create/cache a series of complicated searches. For example, one script caches a "breaking UK drug news" page every 15 minutes. This page is created by a trivial sh(1) script that uses the bot's command-line shell script modes to make a page that embodies five searches.

http://drugpolicycentral.com/bot/uk
Newsbot page built with simple script

Newsbot: Open Service

The newsbot, similar to Meerkat-like servers, lets users also specify searches as parameters or options, in urls. The newsbot is an open service. Its API is open and documented (and trivially simple).

http://drugpolicycentral.com/bot/index.cgi?q=mexico
Newsbot search, specified in the url.

Point a browser at that url, and a search for items containing "Mexico" are returned, as html.

One may search for specific concepts, also.

http://drugpolicycentral.com/bot/index.cgi?concept=oxycodone
search for the oxycodone concept

Boolean expressions are permitted, too.

http://drugpolicycentral.com/bot/index.cgi?q=texas+and+not+cannabis
search for "Texas" articles that don't mention cannabis

http://drugpolicycentral.com/bot/index.cgi?q=thailand+or+philippines etc.
search articles that mention "Thailand" or "Philippines"

XML, RSS and Javascript Output

RSS (Really Simple Syndication) allows anyone to syndicate unique content. Others can subscribe to one's own syndicated output. Users can make the newsbot output RSS, as an option.

http://drugpolicycentral.com/bot/index.cgi?q=mexico&xml=1
search for "Mexico" articles, output it in RSS

http://news-feeds.net/feed.php/12598/
smoking-glass.ca -> News
rss-verzeichnis.de
examples of sites using this newsbot's RSS feeds

Other users may prefer javascript. A similar option outputs data as javascript. This makes the system's categorized stream of articles even more accessible - anyone can add a few lines of javascript to their web page and get headlines.

http://drugpolicycentral.com/bot/index.cgi?q=mexico&javascript=1
search for "Mexico" articles, output in javascript

http://www.opiates.com/
http://www.oxycontin-addiction-treatment.com/
http://www.geocities.com/the_pot_revolution_420/news.html
examples of sites using this newsbot's javascript feeds

User Accounts, Profiles, News Analyst Features

The newsbot supports user registration and login. Once registered and logged in, users may create concept and keyword based interest profiles. Once a profile is created, users then have the option of having "interesting" articles mailed to them.

Other (login-based) features are designed to mesh with MAPInc operations. For example, the system attempts to determine if articles cached by the newsbot have been submitted to MAPInc's drug policy article clipping service already. If so, users are alerted to that article's submission status.

Recap: Newsbot Output

The newsbot can output in html (the default), RSS (XML RDF), and XFML, and javascript. A simple, open, url interface lets users take advantage of the newsbot's richly categorized database of breaking news articles.

http://drugpolicycentral.com/bot/pg/news/feedinfo.htm
the simple API for news feeds, searches

Newsbot Design

The concept dictionary is stored as XML. For the current prototype system that deals with the topic area of "drug related" articles, the concept dictionary was hand-written in XML. The newsbot itself is written in perl.

The DrugSense Newsbot: Some Statistics

site http://drugpolicycentral.com/bot/

number of online papers spidered daily about 650 sites

html pages examined per site 7 to 15 (Configurable. Now configured to look at at least 7 pages, and, if interesting pages are found, will spider up to 15.)

minimum number of web pages spidered per day over 4550 html pages

maximum number of web pages spidered per day 9750 html pages

interesting articles found per day about 400 articles per day

articles cached for 5 days back

running since January 2003 on Baremetal servers

concept dictionary, number of "concepts" about 50 concepts

number of keywords per concept from 2 to around 70 keywords

Newsbot Operation: Bot Makes RSS News feeds from HTML-Only News Sites

Unlike Meerkat, the newsbot examines the html article text, so sources don't need rss/xml themselves. However, the newsbot's output can be RSS XML. So the newsbot makes RSS news feeds from sites that don't, themselves, have feeds. (Meerkat uses existing RSS feeds as input; the newsbot can make use of any site or text source, regardless of whether the site had it to begin with.) The newsbot automatically generates summaries for interesting articles. These summaries are used for descriptions in the output.

Server Usage

The newsbot's spidering and article discovery are somewhat CPU intensive. To balance the system load with the speed of article discovery, the newsbot will automatically slow down using less CPU, when CPU usage becomes too high. This feature is configurable.

Competitive Concept-based products

Apelon's Content Tagging Tool apelon.com¹
apelon.com²

DCARS (Document Content Analysis and Retrieval System) DCARS description

AmikaNow! Content Analysis Toolkit amikanow.com

TEMIS - Text Mining Solutions temis-group.com

Automomy Corp. autonomy.com¹
autonomy.com²

NetOwl (SRA, Inc.) netowl.com¹
netowl.com²

Leximancer leximancer.com

Mesa Dynamics "theConcept" Text Mining Software

IBM Masala text mining web crawler

USAF / Cymfony Inc. Dashboard product, text mining

Where we are: Summary

That's a brief overview of the newsbot, and how it uses the concept dictionary. The concept dictionary is central to the newsbot's operation. The concepts are used to automatically categorize articles. The concepts determine the topic area of interest, and the granularity at which articles are grouped. It organizes lists of keywords (regular expression patterns) so that they can efficiently guide the system as it spiders online news sites. The concept dictionary is implemented as an XML configuration file.

The newsbot runs 24/7 without human intervention, continuously gathering breaking news relating to the desired topic area. A prototype of the newsbot exists, and has been running for over a year. No other site delivers more recent breaking drug war and drug policy news.

The output of the system is a stream of breaking news, articles of interest. The output of the system is both html (cached pages on the site, searches on demand), and XML RSS news feeds. Other output types, like javascript or XFML, are also available. By making the search requests specifiable as url parameters, and by making the search output available as RSS XML news feeds, people can easily use the newsbot as an open resource. When someone needs news feeds for their site (related to the general topic area, in our example, of "drug related" news items), they are able to easily get them from the newsbot site. For example, PHPnuke-based sites can use the RSS output of the newsbot directly.

media charts

Bot's analysis of: "The Dangers and Consequences of Marijuana Abuse" the U.S. Department of Justice Drug Enforcement Administration (DEA) Demand Reduction Section, May 2014
more >>

Newsbot crossword puzzles!

Drug War
Propaganda

A review and analysis of modern prohibition rhetoric

Amazon Kindle

html (free)

pdf (free)

	*Wonder Drug Cover-Up:* Yes, it's true: pot fights cancer. more As Bad For Your Lungs As Smoking 20 Normal Cigarettes? 20 times more likely to cause cancer than tobacco? Why does the US Government make cannabis researchers use only Government-issued marijuana?



	Observer's Propaganda Picks dripping with drug-war propaganda! Prohibition-era cartoons Anti-prohibition political cartoons from Prohibition I.

	Support Mapinc & Drugsense
	Donate to drugsense please give generously!

DrugSense Newsbot: Overview

Drug WarPropaganda

Drug War
Propaganda