Skip to content Skip to sidebar Skip to footer

Python Frameworks For Nlp?

I am working on a project wherein I have to extract the following information from a set of articles (the articles could be on anything): People Find the names of any people prese

Solution 1:

@sel and @3kt really good answers. OP you are looking for Entity Extraction, commonly referred to as Named entity recognition. There exist many APIs to perform this. But the first question you need to ask yourself is

What is the structure of my DATA? or rather,

Are my sentences good English sentences?

In the sense of figuring out whether the data you are working with is consistently grammatically correct, well capitalized and is well structured. These factors are paramount when it comes to extracting entities. The data I worked with were tweets. ABSOLUTE NIGHTMARE!! I performed a detailed analysis on the performance of various APIs on entity extraction and I shall share with you what I found.

Here's APIs that perform fabulous entity extraction-

NLTK has a handy reference book which talks in-depth about its functions with multiple examples. NLTK does not perform well on noisy data(tweets) because it has been trained on structured data.NLTK is absolute garbage for badly capitalized words(Eg, DUCK, Verb, CHAIR). Moreover, it is slightly less precise when compared to other APIs. It is great for structured data or curated data from News articles and Scholarly reports. It is a great learning tool for beginners.

Alchemy is simpler to implement and performs very well in categorizing the named entities.It has great precision when compared to the APIs I have mentioned.However, it has a certain transaction cost. You can only perform 1000 queries in a day! It identifies twitter-handles and can handle awkward capitalization.

IMHO sPacy API is probably the best. It's open source. It outperforms the Alchemy API but is not as precise. Categorizes entities almost as well Alchemy.

Choosing which API should be a simple problem for you now that you know how each API is likely to behave according to the data you have.

EXTRA -POLYGLOT is yet another API.

Here is a blog post that performs entity extraction in NLTK.

There is a beautiful paper by Alan Ritter that might go over your head. But it is the standard for entity extraction(particularly in noisy data) at a professional level. You could refer to it every now and then to understand complex concepts like LDA or SVM for capitalisation.

Solution 2:

What you are actually looking for is called in literature 'Named entity Recognition' or NER. You might like to take a look at this tutorial:

http://textminingonline.com/how-to-use-stanford-named-entity-recognizer-ner-in-python-nltk-and-other-programming-languages

One easy way of solving this problem partially this problem is using regular expressions to extract words having the patterns that you can find in this paper to extract peoples names. This of course might lead to extracting all the categories you are looking for i.e. the topics and the campanies names as well.

There is also an API that you can use, that actually gives the same results you are looking for, which is called Alchemy. Unfortunatelly no documentation is available to explain the method they use to extract the topics nor the people's names.

Hope this helps.

Solution 3:

You should take a look at NLTK.

Finding names and companies can be achieved by tagging the recovered text, and extracting proper nouns (tagged NNP). Finding the topic is a bit more tricky, and may require some machine learning on a given set of article.

Also, since we're talking about articles, I recommend the newspaper module, that can recover those from their URLs, and do some basic nlp operations (summary, keywords).

Post a Comment for "Python Frameworks For Nlp?"