An Introduction to NLP

Blake Tolman
5 min readMar 17, 2021

With the advancement of technology people have started using voice commands and automated text responses to interact through the internet. The most common examples of this is seen with Amazon Alexa, Apple’s Siri, or through automated computer chat boxes. All this is founded upon Natural Language Processing or NLP for short. NLP is the study of how computers can interact with humans through their native language. Through this post we can introduce some foundational concepts in Language Process to further explain how it works.

Text Data:

Working with text data comes with a unique set of problems and solutions that other types of datasets don’t have. Often, text data requires more cleaning and preprocessing than normal data, and in order to get it into a format where we can use statistical methods or machine learning models are implemented. A common problem seen when working with text data is issues like punctuations and capitalization. This causes issues when trying to perform a seemingly simple task like counting words in a sentence. Consider the following sentence.

“Apple shareholders have had a great year. Apple’s stock price has gone steadily upwards — Apple even broke a trillion-dollar valuation, continuing the dominance of this tech stock.”

If trying to count how many times the “Apple” appears the logical answer would be three. However in the Python script, the returned answer would show that “Apple” only appeared twice as it views “Apple” and “Apple’s” as two separate words. With that, capitalization and punctuation would also show a similar difficulty with counting words as “Apple”, “apple”, and “apple.” (emphasis on the period after the last apple) are all separate words for Python.

This is where the first step of data cleaning is started. Text data is typically all changed to lowercase and punctuations are removed. The goal of all this is to create word tokens. The sentence “Where did you get those coconuts?”, when cleaned and tokenized, would probably look more like [‘where’, ‘did’, ‘you, ‘get’, ‘those’, ‘coconuts’]. With this, there is still more work to be done as words like “run”, “ran”, and “running” are still viewed as different words and depending on your end goal these might need to be grouped together.

Stemming and Lemmatization:

Depending on what the project is, it is sometimes best to leave “run” and “runs” as different tokens. However, many times this is not the case. NLP strategies such as stemming and lemmatization are used to deal with the problem of plurality and tense. With the example of “run”, “runs”, and “ran”, all forms of the word would reduce the word down to it to just “run”

Stemming accomplishes this by removing the ends of words where the end signals some sort of derivational change to the word. For instance, we know that adding an ‘s’ to the end of a word makes it plural — a stemming algorithm given the word “cats” would return “cat”. Note that stems do not have to make sense as actual English words. For example, “ponies” would be reduced to “poni”, not “pony”. Stemming is a more crude, heuristic process that contains rule sets that tells the algorithm how to stem each word, and what it should be stemmed to. The process is more crude than lemmatization, but it’s also easier to implement.

Lemmatization accomplishes pretty much the same thing as stemming, but does it in a more complex way, by examining the morphology of words and attempting to reduce each word to its most basic form, or lemma. Note that the results here often end up a bit different than stemming. See the following table for an example of the differences in results:

Finally, you may have intuited that many words in a text are pretty much useless and contain little to no actual information. For instance, words such as “the” and “of”. These are called Stop Words, and are often removed after tokenization is complete in order to reduce the dimensionality of each corpus down to only the words that contain important information. Popular NLP frameworks and toolkits such as NLTK contain a list of stop words for most languages, which allow us to easily loop through our tokenized corpus and remove any stop words we find.

Vectorization:

Once the data is cleaned and tokenized the text data is ready to be converted into vectors. One of the most basic, but useful ways of vectorizing text data is to simply count the number of times each word appears in the corpus. If working with a single document, a single vector is made where each vector corresponds to a single word. However when working with multiple documents the vectors can be stored in a DataFrame, where each column represents a word and each row indicates the document.

TF — IDF:

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a combination of two individual metrics, which are the TF and IDF, respectively. TF-IDF is used when we have multiple documents. It is based on the idea that rare words contain more information about the content of a document than words that are used many times throughout all the documents. For instance, if we treated every article in a newspaper as a separate document, looking at the amount of times the word “he” or “she” is used probably doesn’t tell us much about what that given article is about — however, the amount of times “touchdown” is used can provide good signal that the article is probably about sports.

Term Frequency:

Inverse Document Frequency:

With everything completed you can begin the process of how words are put into respective categories for manipulation. These are just some of the foundational steps into Natural Language Processing and it begins to get more in depth as task requirements get more complicated.

--

--