Challenges on Text Processing/Analysis

Text Analysis 

I like this definition from Oxford: the process or practice of examining large collections of written resources in order to generate new information.” The goal of text mining is to discover relevant information in text by transforming the text into data that can be used for further analysis.

Challenges on Text Processing/Analysis

1. Character Encoding: As we know,  computer uses different types of character encoding to write the text. This is the challenge for the programmer to handle different types of encoding.

Ex: ASCII, UTF-8, UTF-16, Latin-1 etc.

2. As we know that text also contains punctuation marks, Numbers, So we require different treatment of these punctuation and number in different Application.

3. Word Segmentation: For Text Analysis we require the words from the text. For some language like Hindi, English words are separated by Whitespace characters. So it is easy to extract the words from those type of languages but much harder for the languages that do not uses spaces as a word boundary like Chinese and Japanese.

4. Identifying the synonyms of a word helps in search.

5. Abbreviation, Acronym and spelling plays a important role to understand word

6. Sentence boundary detection is important for text analysis. 
7. Word Sense Disambiguation (WSD): Words often have multiple meanings, the meaning of word depend on the context of text. 

Ex. Bank, Bank can be a financial institution or side of the river
This process is difficult.

8. Coreference resolution: Suppose a given sentence
Ram likes to eat Mango but he don't like Banana
in this example he coreference to Ram.