NLP

Text Normalisation

  • Tokenisation
    • Labelling parts of sentence
    • Usually words
    • Can be multiple
      • Proper nouns
      • New York
      • Emoticons
      • Hashtags
    • May need some named entity recognition
    • Penn Treebank standard
    • Byte-pair encoding
      • Standard can’t understand unseen words
      • Encode as subwords
        • -est, -er
  • Lemmatisation
    • Determining roots of words
    • Verb infinitives
    • Find lemma
      • Derived forms are inflections or inflected
        • Word-forms
    • Critical for morphological complex languages
      • Arabic
  • Stemming
    • Simpler than lemmatisation
    • Just removing suffixes
  • Normalising word formats
  • Segmenting sentences