Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cspell performances with large (7k) words list in cspell configuration #2360

Closed
3 of 18 tasks
nvuillam opened this issue Jan 30, 2022 · 9 comments
Closed
3 of 18 tasks

Comments

@nvuillam
Copy link
Contributor

Info

Kind of Issue

  • runtime - command-line tools
  • building / compiling
  • security
  • change in behavior
  • crash / error

Which Tool or library

  • cspell -- the command-line spelling tool
  • cspell-tools -- used for building dictionary files
  • cspell-lib -- library that does the actual spell checking.
  • cspell-trie -- tool for working with trie files.

Which Version

Version: 5.16.0

Issue with supporting library?

  • No
  • cspell-glob -- library for matching glob patterns
  • cspell-io -- thin file i/o library
  • cspell-trie-lib - trie lib
  • cspell-trie2-lib - trie lib alternate format

OS:

  • Macos
  • Linux
  • Windows
  • Other

version:

python:3.9.7-alpine3.13

Bug Description

Describe the bug

image

I have a repo where there are 1500+ files to check for spelling mistakes
Before I added a lot of words in .cspell.json, performances were almost acceptable
But now I added a big .cspell.json, it takes 250 seconds
Notes:

  • the 1500 files are sent as cspell arguments: cspell file1 file2 file3 ...
  • my dockerfile already contains ENV NODE_OPTIONS="--max-old-space-size=8192"

Is it an expected performances or are there ways to improve them ?

Thanks for your tool and your answer :)

@Jason3S
Copy link
Collaborator

Jason3S commented Jan 30, 2022

@nvuillam,

Checking 1500 files should not be an issue.

It is most likely 1 or 2 files causing it to slow down.

To give the spell checker a list of files to check, use: --file-list. It takes a path to a file or can read from stdin.

Scan files in current directory:

ls -1 | cspell --file-list stdin

I suggest using a custom dictionary to avoid having a large .cspell.json: Custom Dictionaries - CSpell

@Jason3S
Copy link
Collaborator

Jason3S commented Jan 30, 2022

I took a look at your PR.

Something to try:

  1. Move the words into a dictionary file:
 jq -r ".words | .[]" .cspell.json > .cspell-words.txt 

Change your .cspell.json to be:

{
  "version": "0.2",
  "ignorePaths": [
    "**/node_modules/**",
    "**/vscode-extension/**",
    "**/.git/**",
    ".vscode",
    "megalinter",
    "package-lock.json",
    "report"
  ],
  "language": "en",
  "dictionaryDefinitions": [
    {
      "name": "custom-dictionary",
      "path": "./.cspell-words.txt",
      "addWords": true
    }
  ],
  "dictionaries": [
    "custom-dictionary"
  ],
  "words": []
}

@Jason3S
Copy link
Collaborator

Jason3S commented Jan 30, 2022

@nvuillam,

I was looking at MegaLinter to see how it called cspell. I even created an issue: oxsecurity/megalinter#1220 . Then I realized you were the maintainer.

@Jason3S
Copy link
Collaborator

Jason3S commented Jan 30, 2022

@nvuillam,

I compared the difference between the two configurations:

  1. 7k words in .cspell.json
    time cspell  "**"
    CSpell: Files checked: 2493, Issues found: 2565 in 129 files
    cspell "**"  319.47s user 9.57s system 127% cpu 4:18.59 total
    
  2. 7k words in custom dictionary.
    time cspell  "**"
    CSpell: Files checked: 2493, Issues found: 2565 in 129 files
    cspell "**"  50.56s user 2.70s system 116% cpu 45.862 total. 
    

There is a clear speed improvement. I would have to look into the reason but it could be related to:

  1. Building the internal dictionary from "words" for each file checked. (this is very likely).
  2. Scans for the config for each file, this also happens, but can be turned off with "noConfigSearch": true

@Jason3S Jason3S removed the new issue label Jan 30, 2022
@Jason3S Jason3S changed the title Cspell performances on large number of files Cspell performances with large (7k) words list in cspell configuration Jan 30, 2022
@Jason3S
Copy link
Collaborator

Jason3S commented Jan 30, 2022

@nvuillam,

I was able to speed it up a bit by caching some of the internal word lists. It is still 2x slower than using a custom dictionary.

You can try it out: npx cspell@next.

I'll release 5.18.0 tomorrow.

@nvuillam
Copy link
Contributor Author

@Jason3S that's great, thanks :)
What technically prevents to put the words in some "virtual dictionary" that would be created at cspell startup ?
That would avoid more complex configuration, with same capabilities ^^

@Jason3S
Copy link
Collaborator

Jason3S commented Jan 31, 2022

5.18.0 has been published.

What technically prevents to put the words in some "virtual dictionary" that would be created at cspell startup ? That would avoid more complex configuration, with same capabilities ^^

It is possible, but not necessarily desirable.

Every word in a document is checked against all the dictionaries. The size of a dictionary doesn't matter, the look up cost is based upon the length of the word. The look up is cached, so looking up the same word again is cheaper.

The configuration acts like a tree. Each configuration is merged including any cspell directives found in the document to create the final configuration that is used to check the document. words, ignoreWords, and flagWords are also merged to create a temporary dictionaries.

The idea here is to keep the number of dictionaries low enough for performance.

@Jason3S
Copy link
Collaborator

Jason3S commented Jan 31, 2022

I'm going to close this for now, since it is now 4-5x faster than 5.17

@Jason3S Jason3S closed this as completed Jan 31, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2022

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants