Проверка правописания питон

Даже очень грамотный человек может сделать опечатку в слове или допустить нелепую ошибку. Этот факт не всегда остаётся замеченным при перепроверке. Использование специализированных инструментов может обеспечить корректность текстов без прямого участия человека.

Рассмотрим вопрос применения модуля Python pyenchant для обнаружения ошибок в словах и возможность их исправления.

При подготовке различной текстовой документации, договоров, отчётов и т.д. важно соблюдать правописание. Используемые в настоящее время программные средства, в частности MS Office Word, подсвечивают слова, в которых допущены ошибки. Это очень удобно и, что немаловажно, наглядно.

Но нам может понадобиться автоматизировать обнаружение ошибок в текстах при отсутствии упомянутых выше программных средств. Либо, при их наличии, делать это, не открывая документ/множество документов. Или же искомый текст может быть попросту очень длинным, его проверка займёт много времени.

На помощь приходят небезызвестный язык программирования Python и модуль pyenchant, который не только позволяет проверять правописание слов, но и предлагает варианты исправления.

Для установки модуля используется стандартная команда:

pip install pyenchant

Код для проверки правописания слова довольно прост:

import enchant # при импроте пишем именно enchant (не pyenchant)
dictionary = enchant.Dict(«en_US»)
print(dictionary.check(«driver»))

Вывод: True

Намеренно допустим ошибку в проверяемом слове:

print(dictionary.check(«draiver»))

Вывод: False

Мы можем вывести список возможных исправлений слова:

print(dictionary.suggest(u»draiver»))

Вывод: [‘driver’, ‘drainer’, ‘Rivera’]

Читатель скорее всего заинтересуется, предоставляет ли модуль возможность проверять правописание слов русского языка, и ответ – да. Однако, по умолчанию это недоступно, нам нужен словарь. Он может быть найден, например, в пакете LibreOffice по пути его установки:

«…LibreOfficeshareextensionsdict-ru»

Здесь нам нужны два файла: «ru_RU.aff» и «ru_RU.dic». Их необходимо разместить в папке модуля enchant, где хранятся словари для других языков по пути

C:…PythonPython36site-packagesenchantdatamingw64shareenchanthunspell»

Теперь, при создании объекта Dict достаточно передать строку «ru_RU», и мы сможем работать со словами русского языка.

Вернёмся к нашему примеру с ошибочно написанным словом driver. При помощи метода suggest() мы получили список возможных исправлений, и вручную мы конечно же легко сможем выбрать нужный вариант.

Но что, если мы хотим автоматизировать и этот процесс?

Давайте использовать модуль Python difflib, который позволяет сравнивать строковые последовательности. Попробуем выбрать из списка слово «driver»:

import enchant
import difflib

woi = «draiver»
sim = dict()

dictionary = enchant.Dict(«en_US»)
suggestions = set(dictionary.suggest(woi))

for word in suggestions:
measure = difflib.SequenceMatcher(None, woi, word).ratio()
sim[measure] = word

print(«Correct word is:», sim[max(sim.keys())])

Немного прокомментируем код. В словаре sim будут храниться значения степеней сходства (диапазон от 0 до 1) предложенных методом suggest() класса Dict слов с искомым словом («draiver»). Данные значения мы получаем в цикле при вызове метода ratio() класса SequenceMatcher и записываем в словарь. В конце получаем слово, которое максимально близко к проверяемому.

Вывод: Correct word is driver

Выше мы работали с отдельными словами, но будет полезно разобраться, как работать с целыми блоками текста. Для этой задачи нужно использовать класс SpellChecker:

from enchant.checker import SpellChecker

checker = SpellChecker(«en_US»)
checker.set_text(«I have got a new kar and it is ameizing.»)
print([i.word for i in checker])

Вывод: [‘kar’, ‘ameizing’]

Как видно, это не сложнее работы с отдельными словами. Кроме того, класс SpellChecker предоставляет возможность использовать фильтры, которые будут игнорировать особые последовательности, не являющиеся ошибочными, например, адрес электронной почты. Для этого необходимо импортировать класс или классы фильтров, если их несколько, и передать список фильтров параметру filters классу SpellChecker:

from enchant.checker import SpellChecker
from enchant.tokenize import EmailFilter, URLFilter

checker_with_filters = SpellChecker(«en_US», filters=[EmailFilter])
checker_with_filters.set_text(«Hi! My neim is John and thiz is my email: [email protected]»)
print([i.word for i in checker_with_filters])

Вывод: [‘neim’, ‘thiz’]

Как видно, адрес электронной почты не был выведен в качестве последовательности, содержащей ошибки в правописании.

Таким образом, комбинируя возможности модулей enchant и difflib, мы можем получить действительно мощный инструмент, позволяющий не только обнаруживать ошибки, но и подбирать варианты исправления с довольно высокой точностью, а также вносить эти исправления в текст.

License
GitHub release
Build Status
Test Coverage
PyPi Package
Downloads

Pure Python Spell Checking based on Peter
Norvig’s blog post on setting
up a simple spell checking algorithm.

It uses a Levenshtein Distance
algorithm to find permutations within an edit distance of 2 from the
original word. It then compares all permutations (insertions, deletions,
replacements, and transpositions) to known words in a word frequency
list. Those words that are found more often in the frequency list are
more likely the correct results.

pyspellchecker supports multiple languages including English, Spanish,
German, French, Portuguese, Arabic and Basque. For information on how the dictionaries were
created and how they can be updated and improved, please see the
Dictionary Creation and Updating section of the readme!

pyspellchecker supports Python 3

pyspellchecker allows for the setting of the Levenshtein Distance (up to two) to check.
For longer words, it is highly recommended to use a distance of 1 and not the
default 2. See the quickstart to find how one can change the distance parameter.

Installation

The easiest method to install is using pip:

pip install pyspellchecker

To build from source:

git clone https://github.com/barrust/pyspellchecker.git
cd pyspellchecker
python -m build

For python 2.7 support, install release 0.5.6
but note that no future updates will support python 2.

pip install pyspellchecker==0.5.6

Quickstart

After installation, using pyspellchecker should be fairly straight
forward:

from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

If the Word Frequency list is not to your liking, you can add additional
text to generate a more appropriate list for your use case.

from spellchecker import SpellChecker

spell = SpellChecker()  # loads default word frequency list
spell.word_frequency.load_text_file('./my_free_text_doc.txt')

# if I just want to make sure some words are not flagged as misspelled
spell.word_frequency.load_words(['microsoft', 'apple', 'google'])
spell.known(['microsoft', 'google'])  # will return both now!

If the words that you wish to check are long, it is recommended to reduce the
distance to 1. This can be accomplished either when initializing the spell
check class or after the fact.

from spellchecker import SpellChecker

spell = SpellChecker(distance=1)  # set at initialization

# do some work on longer words

spell.distance = 2  # set the distance parameter back to the default

Non-English Dictionaries

pyspellchecker supports several default dictionaries as part of the default
package. Each is simple to use when initializing the dictionary:

from spellchecker import SpellChecker

english = SpellChecker()  # the default is English (language='en')
spanish = SpellChecker(language='es')  # use the Spanish Dictionary
russian = SpellChecker(language='ru')  # use the Russian Dictionary
arabic = SpellChecker(language='ar')   # use the Arabic Dictionary

The currently supported dictionaries are:

  • English — ‘en’

  • Spanish — ‘es’

  • French — ‘fr’

  • Portuguese — ‘pt’

  • German — ‘de’

  • Russian — ‘ru’

  • Arabic — ‘ar’

  • Basque — ‘eu’

  • Latvian — ‘lv’

Dictionary Creation and Updating

The creation of the dictionaries is, unfortunately, not an exact science. I have provided a script that, given a text file of sentences (in this case from
OpenSubtitles) it will generate a word frequency list based on the words found within the text. The script then attempts to *clean up* the word frequency by, for example, removing words with invalid characters (usually from other languages), removing low count terms (misspellings?) and attempts to enforce rules as available (no more than one accent per word in Spanish). Then it removes words from a list of known words that are to be removed. It then adds words into the dictionary that are known to be missing or were removed for being too low frequency.

The script can be found here: scripts/build_dictionary.py`. The original word frequency list parsed from OpenSubtitles can be found in the `scripts/data/` folder along with each language’s include and exclude text files.

Any help in updating and maintaining the dictionaries would be greatly desired. To do this, a
discussion could be started on GitHub or pull requests to update the include and exclude files could be added.

Additional Methods

On-line documentation is available; below contains the cliff-notes version of some of the available functions:

correction(word): Returns the most probable result for the
misspelled word

candidates(word): Returns a set of possible candidates for the
misspelled word

known([words]): Returns those words that are in the word frequency
list

unknown([words]): Returns those words that are not in the frequency
list

word_probability(word): The frequency of the given word out of all
words in the frequency list

The following are less likely to be needed by the user but are available:

edit_distance_1(word): Returns a set of all strings at a Levenshtein
Distance of one based on the alphabet of the selected language

edit_distance_2(word): Returns a set of all strings at a Levenshtein
Distance of two based on the alphabet of the selected language

Credits

  • Peter Norvig blog post on setting up a simple spell checking algorithm

  • P Lison and J Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

For any type of text processing or analysis, checking the spelling of the word is one of the basic requirements. This article discusses various ways that you can check the spellings of the words and also can correct the spelling of the respective word.
 

Using textblob library

First, you need to install the library textblob using pip in command prompt. 

pip install textblob

You can also install this library in Jupyter Notebook as: 
 

Python3

import sys

!{sys.executable} - m pip install textblob

  
Program for Spelling checker – 
 

Python3

from textblob import TextBlob

a = "cmputr"          

print("original text: "+str(a))

b = TextBlob(a)

print("corrected text: "+str(b.correct()))

Output: 
 

original text: cmputr
corrected text: computer


Using pyspellchecker library

You can install this library as below:
Using pip: 

pip install pyspellchecker


In Jupyter Notebook: 
 

Python3

import sys

!{sys.executable} - m pip install pyspellchecker

  
Spelling Checker program using pyspellchecker – 
 

Python3

from spellchecker import SpellChecker

spell = SpellChecker()

misspelled = spell.unknown(["cmputr", "watr", "study", "wrte"])

for word in misspelled:

    print(spell.correction(word))

    print(spell.candidates(word))

Output: 
 

computer
{'caput', 'caputs', 'compute', 'computor', 'impute', 'computer'}
water
{'water', 'watt', 'warr', 'wart', 'war', 'wath', 'wat'}
write
{'wroe', 'arte', 'wre', 'rte', 'wrote', 'write'}


Using JamSpell

To achieve the best quality while making spelling corrections dictionary-based methods are not enough. You need to consider the word surroundings. JamSpell is a python spell checking library based on a language model. It makes different corrections for a different context.

1) Install swig3

apt-get install swig3.0   # for linux
brew install swig@3       # for mac

2) Install jamspell

pip install jamspell

3) Download a language model for your language

Python3

corrector = jamspell.TSpellCorrector()

corrector.LoadLangModel('Downloads/en_model.bin')

print(corrector.FixFragment('I am the begt spell cherken!'))

print(corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3))

print(corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5))

Output:

u'I am the best spell checker!'
(u'best', u'beat', u'belt', u'bet', u'bent')
(u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

Last Updated :
07 Oct, 2020

Like Article

Save Article

Autocorrect

build
Downloads
Average time to resolve an issue
Code style: black

Spelling corrector in python. Currently supports English, Polish, Turkish, Russian, Ukrainian, Czech, Portuguese, Greek, Italian, Vietnamese, French and Spanish, but you can easily add new languages.

Based on: https://github.com/phatpiglet/autocorrect and Peter Norvig’s spelling corrector.

Installation

Examples

>>> from autocorrect import Speller
>>> spell = Speller()
>>> spell("I'm not sleapy and tehre is no place I'm giong to.")
"I'm not sleepy and there is no place I'm going to."

>>> spell = Speller('pl')
>>> spell('ptaaki latatją kluczmm')
'ptaki latają kluczem'

Speed

%timeit spell("I'm not sleapy and tehre is no place I'm giong to.")
373 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit spell("There is no comin to consiousnes without pain.")
150 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you see, for some words correction can take ~200ms. If speed is important for your use case (e.g. chatbot) you may want to use option ‘fast’:

spell = Speller(fast=True)
%timeit spell("There is no comin to consiousnes without pain.")
344 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now, the correction should always work in microseconds, but words with double typos (like ‘consiousnes’) won’t be corrected.

OCR

When cleaning up OCR, replacements are the large majority of errors. If this is the case, you may want to use the option ‘only_replacements’:

spell = Speller(only_replacements=True)

Custom word sets

If you wish to use your own set of words for autocorrection, you can pass an nlp_data argument:

spell = Speller(nlp_data=your_word_frequency_dict)

Where your_word_frequency_dict is a dictionary which maps words to their average frequencies in your text. If you want to change the default word set only a bit, you can just edit spell.nlp_data parameter, after spell was initialized.

Adding new languages

First, define special letters, by adding entries in word_regexes and alphabets dicts in autocorrect/constants.py.

Now, you need a bunch of text. Easiest way is to download wikipedia.
For example for Russian you would go to:
https://dumps.wikimedia.org/ruwiki/latest/
and download ruwiki-latest-pages-articles.xml.bz2

bzip2 -d ruiwiki-latest-pages-articles.xml.bz2

After that:

First, edit the autocorrect.constants dictionaries in order to accommodate regexes and dictionaries for your language.

Then:

>>> from autocorrect.word_count import count_words
>>> count_words('ruwiki-latest-pages-articles.xml', 'ru')
tar -zcvf autocorrect/data/ru.tar.gz word_count.json

For the correction to work well, you need to cut out rarely used words. First, in test_all.py, write test words for your language, and add them to optional_language_tests the same way as it’s done for other languages. It’s good to have at least 30 words. Now run:

python test_all.py find_threshold ru

and see which threshold value has the least badly corrected words. After that, manually delete all the words with less occurences than the threshold value you found, from the file in hi.tar.gz (it’s already sorted so it should be easy).

To distribute this language support to others, you will need to upload your tar.gz file to IPFS (for example with Pinata, which will pin this file so it doesn’t disappear), and then add it’s path to ipfs_paths in constants.py. (tip: first put this file inside the folder, and upload the folder to IPFS, for the downloaded file to have the correct filename)

Good luck!

I’m fairly new to Python and NLTK. I am busy with an application that can perform spell checks (replaces an incorrectly spelled word with the correct one).
I’m currently using the Enchant library on Python 2.7, PyEnchant and the NLTK library. The code below is a class that handles the correction/replacement.

from nltk.metrics import edit_distance

class SpellingReplacer:
    def __init__(self, dict_name='en_GB', max_dist=2):
        self.spell_dict = enchant.Dict(dict_name)
        self.max_dist = 2

    def replace(self, word):
        if self.spell_dict.check(word):
            return word
        suggestions = self.spell_dict.suggest(word)

        if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist:
            return suggestions[0]
        else:
            return word

I have written a function that takes in a list of words and executes replace() on each word and then returns a list of those words, but spelled correctly.

def spell_check(word_list):
    checked_list = []
    for item in word_list:
        replacer = SpellingReplacer()
        r = replacer.replace(item)
        checked_list.append(r)
    return checked_list

>>> word_list = ['car', 'colour']
>>> spell_check(words)
['car', 'color']

Now, I don’t really like this because it isn’t very accurate and I’m looking for a way to achieve spelling checks and replacements on words. I also need something that can pick up spelling mistakes like «caaaar»? Are there better ways to perform spelling checks out there? If so, what are they? How does Google do it? Because their spelling suggester is very good.

Any suggestions?

user's user avatar

user

1,1861 gold badge11 silver badges30 bronze badges

asked Dec 18, 2012 at 7:18

Mike Barnes's user avatar

Mike BarnesMike Barnes

4,14718 gold badges40 silver badges64 bronze badges

You can use the autocorrect lib to spell check in python.
Example Usage:

from autocorrect import Speller

spell = Speller(lang='en')

print(spell('caaaar'))
print(spell('mussage'))
print(spell('survice'))
print(spell('hte'))

Result:

caesar
message
service
the

Sunil Garg's user avatar

Sunil Garg

14.2k25 gold badges129 silver badges184 bronze badges

answered Jan 16, 2018 at 11:48

Rakesh's user avatar

5

I’d recommend starting by carefully reading this post by Peter Norvig. (I had to something similar and I found it extremely useful.)

The following function, in particular has the ideas that you now need to make your spell checker more sophisticated: splitting, deleting, transposing, and inserting the irregular words to ‘correct’ them.

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

Note: The above is one snippet from Norvig’s spelling corrector

And the good news is that you can incrementally add to and keep improving your spell-checker.

Hope that helps.

answered Dec 18, 2012 at 17:13

Ram Narasimhan's user avatar

Ram NarasimhanRam Narasimhan

22.3k5 gold badges48 silver badges55 bronze badges

1

The best way for spell checking in python is by: SymSpell, Bk-Tree or Peter Novig’s method.

The fastest one is SymSpell.

This is Method1: Reference link pyspellchecker

This library is based on Peter Norvig’s implementation.

pip install pyspellchecker

from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

Method2: SymSpell Python

pip install -U symspellpy

answered Feb 17, 2019 at 18:24

Shaurya Uppal's user avatar

2

Maybe it is too late, but I am answering for future searches.
TO perform spelling mistake correction, you first need to make sure the word is not absurd or from slang like, caaaar, amazzzing etc. with repeated alphabets. So, we first need to get rid of these alphabets. As we know in English language words usually have a maximum of 2 repeated alphabets, e.g., hello., so we remove the extra repetitions from the words first and then check them for spelling.
For removing the extra alphabets, you can use Regular Expression module in Python.

Once this is done use Pyspellchecker library from Python for correcting spellings.

For implementation visit this link: https://rustyonrampage.github.io/text-mining/2017/11/28/spelling-correction-with-python-and-nltk.html

answered Apr 3, 2019 at 10:10

Rishabh Sahrawat's user avatar

2

Try jamspell — it works pretty well for automatic spelling correction:

import jamspell

corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')

corrector.FixFragment('Some sentnec with error')
# u'Some sentence with error'

corrector.GetCandidates(['Some', 'sentnec', 'with', 'error'], 1)
# ('sentence', 'senate', 'scented', 'sentinel')

jupiterbjy's user avatar

jupiterbjy

2,7921 gold badge10 silver badges28 bronze badges

answered Aug 28, 2020 at 23:00

Fippo's user avatar

FippoFippo

511 silver badge5 bronze badges

2

IN TERMINAL

pip install gingerit

FOR CODE

from gingerit.gingerit import GingerIt
text = input("Enter text to be corrected")
result = GingerIt().parse(text)
corrections = result['corrections']
correctText = result['result']

print("Correct Text:",correctText)
print()
print("CORRECTIONS")
for d in corrections:
  print("________________")  
  print("Previous:",d['text'])  
  print("Correction:",d['correct'])   
  print("`Definiton`:",d['definition'])
 

answered Mar 28, 2021 at 16:21

pouya barari's user avatar

1

You can also try:

pip install textblob

from textblob import TextBlob
txt="machne learnig"
b = TextBlob(txt)
print("after spell correction: "+str(b.correct()))

after spell correction: machine learning

answered Nov 30, 2021 at 2:47

Mayur Patil's user avatar

Mayur PatilMayur Patil

1392 silver badges5 bronze badges

2

spell corrector->

you need to import a corpus on to your desktop if you store elsewhere change the path in the code i have added a few graphics as well using tkinter and this is only to tackle non word errors!!

def min_edit_dist(word1,word2):
    len_1=len(word1)
    len_2=len(word2)
    x = [[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance
    for i in range(0,len_1+1):  
        #initialization of base case values
        x[i][0]=i
        for j in range(0,len_2+1):
            x[0][j]=j
    for i in range (1,len_1+1):
        for j in range(1,len_2+1):
            if word1[i-1]==word2[j-1]:
                x[i][j] = x[i-1][j-1]
            else :
                x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1
    return x[i][j]
from Tkinter import *


def retrieve_text():
    global word1
    word1=(app_entry.get())
    path="C:Documents and SettingsOwnerDesktopDictionary.txt"
    ffile=open(path,'r')
    lines=ffile.readlines()
    distance_list=[]
    print "Suggestions coming right up count till 10"
    for i in range(0,58109):
        dist=min_edit_dist(word1,lines[i])
        distance_list.append(dist)
    for j in range(0,58109):
        if distance_list[j]<=2:
            print lines[j]
            print" "   
    ffile.close()
if __name__ == "__main__":
    app_win = Tk()
    app_win.title("spell")
    app_label = Label(app_win, text="Enter the incorrect word")
    app_label.pack()
    app_entry = Entry(app_win)
    app_entry.pack()
    app_button = Button(app_win, text="Get Suggestions", command=retrieve_text)
    app_button.pack()
    # Initialize GUI loop
    app_win.mainloop()

pyspellchecker is the one of the best solutions for this problem. pyspellchecker library is based on Peter Norvig’s blog post.
It uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word.
There are two ways to install this library. The official document highly recommends using the pipev package.

  • install using pip
pip install pyspellchecker
  • install from source
git clone https://github.com/barrust/pyspellchecker.git
cd pyspellchecker
python setup.py install

the following code is the example provided from the documentation

from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

answered Sep 10, 2020 at 15:30

Sabesan's user avatar

SabesanSabesan

6241 gold badge10 silver badges16 bronze badges

from autocorrect import spell
for this you need to install, prefer anaconda and it only works for words, not sentences so that’s a limitation u gonna face.

from autocorrect import spell
print(spell('intrerpreter'))
# output: interpreter

Ketan's user avatar

Ketan

131 silver badge3 bronze badges

answered Dec 28, 2018 at 11:17

Saurabh Tripathi's user avatar

1

pip install scuse

from scuse import scuse

obj = scuse()

checkedspell = obj.wordf("spelling you want to check")

print(checkedspell)

answered Sep 7, 2022 at 7:05

mrithul e's user avatar

answered Mar 12, 2020 at 14:02

Nabin's user avatar

NabinNabin

11.1k8 gold badges62 silver badges98 bronze badges

Понравилась статья? Поделить с друзьями: