Tuesday, April 06, 2021

Text Processing in Python Text using NLTK and spaCy

  

.

The Internet has connected the world, while Social Media like Facebook, Twitter and Reddit provided the platform for people to express their opinions and feelings toward a topic. Then, the proliferation of smartphones increased the usage of these platforms directly. For instance, there are 96% or 2,240 million Facebook active users who used Facebook by smartphones and tablets [1].


The increment in the usage of Social Media has grown the size of text data, and boost the studies or researches in Natural Language Processing (NLP), for example, Information Retrieval and Sentiment Analysis. Most of the time, the documents or the text files to be analyzed are gigantic and contains a lot of noise, directly used raw texts for analysis is inapplicable. Hence, text processing is essential to provide clean input for modelling and analysis.


Text processing contains two main phases, which are tokenization and normalization [2]. Tokenization is the process of splitting a longer string of text into smaller pieces, or tokens [3]. Normalization referring to convert number to their word equivalent, remove punctuation, convert all text to the same case, remove stopwords, remove noise, lemmatizing and stemming.


Stemming — removing affixes (suffixed, prefixes, infixes, circumfixes), For example, running to run

Lemmatization — capture canonical form based on a word’s lemma. For example, better to good [4]

.

Text Processing with NLTK

0. Import all needed libraries

1. Tokenization

2. Normalization

a. Removing Stop words

b. Lemmatization

3. Obtain the Cleaned Tokens


Text Processing with spaCy

1. Tokenization + Lemmatization

2. Normalization

a. Removing Noise

b. Removing Stopwords

3. Obtain the Cleaned Tokens


FULL:

https://towardsdatascience.com/text-processing-in-python-29e86ea4114c


References:

[1] M. Iqbal, “Facebook Revenue and Usage Statistics (2020),” 8 March 2021. [Online]. Available: https://www.businessofapps.com/data/facebook-statistics/.

[2] M. Mayo, “A General Approach to Preprocessing Text Data,” 2017. [Online]. Available: https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html. [Accessed 12 June 2020].

[3] D. Subramanian, “Text Mining in Python: Steps and Examples,” 22 August 2019. [Online]. Available: https://medium.com/towards-artificial-intelligence/text-mining-in-python-steps-and-examples-78b3f8fd913b. [Accessed 12 June 2020].

[4] M. Mayo, “Natural Language Processing Key Terms, Explained,” 2017. [Online]. Available: https://www.kdnuggets.com/2017/02/natural-language-processing-key-terms-explained.html.

[5] “Natural Language Processing In Julia (Text Analysis),” JCharisTech, 1 May 2018. [Online]. Available: https://jcharistech.wordpress.com/2018/05/01/natural-language-processing-in-julia-text-analysis/.

[6] D. Jurafsky and J. H. Martin, “Speech and Language Processing,” 3 December 2020. [Online]. Available: https://web.stanford.edu/~jurafsky/slp3/.

[7] M.F. Goh, “Text Normalization with spaCy and NLTK,” 29 November 2020. [Online]. Available: https://towardsdatascience.com/text-normalization-with-spacy-and-nltk-1302ff430119.


No comments: