Friday, April 4, 2014

How to find, detect and show Wikipedia links for the concepts and named entity entities like person, place, organization or any such meaningful noun and keyword present in or related to any article or text?

It's a common practice that we have noticed among people who read news to expand their knowledge on daily basis to search on Wikipedia (or any such online or offline encyclopedia) for some new name of some person, place, organization, event or any meaningful thing - which we usually call as name, concept, topic, noun or named entity - that they just came across through some article, blog or any visual piece of artifact.

Here by an article I mean any kind of audio-visual media element consisting of text, image, video, sound clip, song etc.

It will be very convenient for these people if the links to some online encyclopedia like Wikipedia for the concepts or important keywords related to the subject matter being discussed by that artifact are shown near the article, so that viewer can just click on that concept's link to learn more about it.

To explain this use case and ideas on how to implement this with state of the art technologies, I will take an example of a story about Amazon launching FireTV as a next generation TV and Gaming console.

For example here is some text taken from an article on CNET and I have shown how some concepts mentioned in the article or related to the matter being discussed in the article can be shown near and/or in the article and linked to the appropriate pages on Wikipedia:

...

Games like Badland are eye-popping and sharp, and even games like The Walking Dead, while clearly a big step down in graphics detail from the PC version, animates well and looks very watchable. Content looks upscaled, even when not in 1080p, to look like, to a casual observer, it's coming from an Xbox or PlayStation.

...

Learn more about related concepts on Wikipedia:


In addition to the actual content publisher, the news aggregators like CheckDeckGoogle News etc. can also provide this feature along with whatever small title, description, feed content that they show for each article.

Here I am showing steps that a programmer can implement to provide this Wikipedia concepts linking functionality to enhance the knowledge gaining experience of the user.

Assuming that you have the text of the article that you want to enhance by annotating some words present in the article and or some what related to the subject being discussed in the article.

The text for e.g. is:
  
Games like Badland are eye-popping and sharp, and even games like The Walking Dead, while clearly a big step down in graphics detail from the PC version, animates well and looks very watchable. Content looks upscaled, even when not in 1080p, to look like, to a casual observer, it's coming from an Xbox or PlayStation.

Step 1. Finding concepts which are present in some component of the article


Annotate the text with some open source software library DBPedia Spotlight. To explain the concept with simplicity, I am giving an example using its free web service endpoint (as explained here) with Linux command line tool curl. In your program, this can be implemented using HTTP client libraries available for various programming languages. Like in Java, you can use Apache HTTPComponents.

curl http://spotlight.dbpedia.org/rest/annotate \ --data-urlencode "text=Games like Badland are eye-popping and sharp, and even games like The Walking Dead, while clearly a big step down in graphics detail from the PC version, animates well and looks very watchable. Content looks upscaled, even when not in 1080p, to look like, to a casual observer, it's coming from an Xbox or PlayStation." \ --data "confidence=0.2" \ --data "support=20" -H "Accept: application/json"

The response of the command from web service looks like:

{
  "@text": "Games like Badland are eye-popping and sharp, and even games like The Walking Dead, while clearly a big step down in graphics detail from the PC version, animates well and looks very watchable. Content looks upscaled, even when not in 1080p, to look like, to a casual observer, it's coming from an Xbox or PlayStation.",
  "@confidence": "0.2",
  "@support": "20",
  "@types": "",
  "@sparql": "",
  "@policy": "whitelist",
  "Resources":   [
        {
      "@URI": "http://dbpedia.org/resource/The_Walking_Dead",
      "@support": "108",
      "@types": "Freebase:/comic_books/comic_book_story,Freebase:/comic_books,Freebase:/comic_books/comic_book_series,Freebase:/fictional_universe/work_of_fiction,Freebase:/fictional_universe",
      "@surfaceForm": "Walking Dead",
      "@offset": "70",
      "@similarityScore": "0.09552688896656036",
      "@percentageOfSecondRank": "0.6057624587389117"
    },
    {
      "@URI": "http://dbpedia.org/resource/PlayStation",
      "@support": "2228",
      "@types": "Freebase:/business/product_line,Freebase:/business,Freebase:/exhibitions/exhibition_sponsor,Freebase:/exhibitions",
      "@surfaceForm": "PlayStation",
      "@offset": "306",
      "@similarityScore": "0.1963232457637787",
      "@percentageOfSecondRank": "0.8996430368659503"
    } ,
...
]
}

The response is in JSON format and you can parse it in your favorite programming language with some parser library, for e.g. Gson for Java.

The most important field present in this JSON object is Resources. The Resources is an array of object where each of the object represents a named entity (or concept) detected to be present in the supplied text as input to the DBPedia Spotlight service through HTTP request just explained above (with curl tool) and it contains two important fields: @URI and @offset. The @URI refers to the concept present in DBPedia dataset which is essentially a structured version of human readable Wikipedia pages. The @offset represents the position of the word (like PlayStation) in the given text passed as input for which the matching semantic concept(like 
http://dbpedia.org/resource/PlayStation) was just derived from DBPedia and so @offset can be used for underlining or highlighting the words present in article that we aim to link with appropriate Wikipedia page. DBPedia provides programmatic interface to query the database through SPARQL query language which is very similar to SQL. Our goal is to get the Wikipedia page link( like http://en.wikipedia.org/wiki/Playstation) from which this concept has been derived in DBPedia( like http://dbpedia.org/resource/PlayStation) and so this can be done by firing following  SPARQL query for given @URI against publicly accessible DBPedia's SPARQL endpoint an interactive as well as programmatic interface.

select distinct ?WikipediaLink 
from <http://dbpedia.org>
where
<http://dbpedia.org/resource/PlayStation> <http://www.w3.org/ns/prov#wasDerivedFrom> ?WikipediaLink
}

The result is a table containing a single field 
WikipediaLink with value that looks like: http://en.wikipedia.org/wiki/PlayStation?oldid=548612253

Yay! We got a Wikipedia page's link to a concept called PlayStation present in our article!

Replace the 
http://dbpedia.org/resource/PlayStation part of the SPARQL query mentioned above with some other DBPedia concept like http://dbpedia.org/resource/Xbox and you will get Wikipedia page link corresponding to that other concept like http://wikipedia.org/wiki/Xbox in the response.

To programmatically query the SPARQL endpoint like http://dbpedia.org/sparql or 
semantic web database like Virtuoso (which allows you to load and query DBPedia dataset); you can use a library or framework which can act as query engine interface between  and your application program. In Java you have Apache Jena Framework with its ARQ module for processing SPARQL query, just like we have JDBC driver for querying RDBMS through SQL. You can easily find tutorials of getting started with SPARQL and Apache Jena online.

We have to repeat this process of retrieving Wikipedia page link from DBPedia @URI for each of the objects present in @Resources array present in the JSON response received from DBPedia Spotlight service. Remember that we also have got a useful field @offset in each of the object in 
@Resources array, which will be used to highlight/underline and link the appropriate word present in the article text with its corresponding link to the page on Wikipedia.

It's just a matter of choice whether to show the concepts linked to Wikipedia Pages by underlying them as words present in the text itself or displaying them separately as related tags (/words/concepts/topics/nouns whatever you call!) near the article's content like title, description, image or feed content etc.

It’s also matter of choice whether to actually link these extracted concepts with Wikipedia or just to show them as mechanism for facilitating user to easily navigate through various content on your application or site by linking it with your internal pages of application or any application or site other than Wikipedia that you want.


Step 2. Finding and showing concepts (with links to corresponding pages on Wikipedia) that are not present in some component of given article but which might be related to the subject matter being discussed by the article

If you are a news aggregator, you probably have access to the 'similar' articles that are talking about the same subject matter being discussed by the given article; in our case for e.g. Amazon launching FireTV. So if you want to show links to those Wikipedia Pages whose corresponding concepts or words are not explicitly mentioned in the article( and therefore almost no chance of getting detected in the Step 1) but they might be related to the subject matter being discussed by the given article, then you can consider what are the concepts with Wikipedia links derived from the articles similar to the given article and include some of those concepts with Wikipedia links in this article.

For example here is an article from BBC (let's call it Article no. 2) which talks around the same story but discusses some different aspect of it. So let's say as a news aggregator your have processed this Article no. 2 also and retrieved Wikipedia Page links from it; one of which is Netflix. So if your system somehow believes that this article is similar to the earlier one from CNET (let's call it Article no. 1), then you can show the concepts with Wikipedia links derived from Article no. 2 with Article no. 1 and vice versa. If you have a pool of articles similar to an article, then you can maintain a pool of concepts with Wikipedia Links derived from those articles. Then you can choose some of those concepts with the highest frequency of appearance/detection across all these similar articles from that pool and show them with the given article as related concepts with Wikipedia links.



I hope I have made it understandable for any programmer with some introductory knowledge of semantic web, databases, HTTP RESTful services to easily annotate the articles with explicitly present and related concepts' Wikipedia links for making the knowledge gaining experience more convenient for the user.



No comments:

Post a Comment