AUTOMATIC STEMMING AND LEMMATIZATION PROCESS FOR SINDHI TEXT Mazhar Ali Dootio and Asim Imdad Wagan ABSTRACT Stemming and lemmatization are preprocessing steps in the NLP for language modeling and information retrieval system. Searching, information retrieving and other text analysis problems of Sindhi text increase day to day since Sindhi data grow on internet. Lots of research studies have been conducted on NLP problems of languages other than Sindhi language thus, deficiency of NLP resources for Sindhi text lemmatization and stemming processes generate research tasks to design a research methodology for solution of Sindhi computational linguistics problems. A little research work has been done on Sindhi stemming and lemmatization processes, therefore, this work is an addition to Sindhi stemming and lemmatization process. Sindhi is one of the significant languages of the sub-continent having lexicons with more inflections, diacritics and morphological structures. Therefore, development of computational linguistics and NLP resources for Sindhi text, perform significant role in solving computational, NLP, information retrieval, machine translation and other text analysis problems of Sindhi language. Research Methodology structure the research problem of this study. It solves the lemmatization and stemming problems of Sindhi language by proposing novel algorithms. Algorithms identify the Sindhi lexicons from Sindhi text and performs the lemmatization and stemming process. NLP tools are developed on basis of proposed algorithms for Sindhi text lemmatization and word stemming. Algorithm-1 is proposed for Sindhi lexicons lemmatization process and Algorithm-2 is proposed for Sindhi lexicons stemming process. This research work may be helpful and beneficial for linguistics research, search engines, information retrieving systems, machine translations as well as computational linguists for Sindhi corpus analysis, however, more research work is required on Sindhi text Word2Vec, topic modeling, sentiment and semantic analysis and feature distribution for information retrieval and language variation analysis. KEYWORDS: Sindhi NLP, Lemma, Stemming, computational linguistics, lexicon