NLP
Text Normalisation
- Tokenisation- Labelling parts of sentence
- Usually words
- Can be multiple- Proper nouns
- New York
- Emoticons
- Hashtags
 
- May need some named entity recognition
- Penn Treebank standard
- Byte-pair encoding- Standard can’t understand unseen words
- Encode as subwords- -est, -er
 
 
 
- Lemmatisation- Determining roots of words
- Verb infinitives
- Find lemma- Derived forms are inflections or inflected- Word-forms
 
 
- Derived forms are inflections or inflected
- Critical for morphological complex languages- Arabic
 
 
- Stemming- Simpler than lemmatisation
- Just removing suffixes
 
- Normalising word formats
- Segmenting sentences