Test : Preprocessing directly with ElasticSearch Analyzers¶
Intro¶
Currently, we are preprocessing the text we ingest into ElasticSearch (ES) with the following function:
def preprocess_text(text: str):
"""
Tokenize, lemmatize, lowercase and remove stop words
:param text:
:return:
"""
if not NLP:
print('Warning NLP not loaded, text will not be preprocessed')
return text
doc = NLP(text)
text = " ".join(t.lemma_.lower() for t in doc if not t.is_stop).replace("\n", " ")
return text
Problem¶
The preprocessing function needs to be ran both while indexing the documents and also while dealign with queries.
To facilitate the installation of our pipeline, our current solution is as coupled to the original Haystack dev as possible. This entails that in production, no other Python code is run. We use the Haystack library as it is in its own master repo. Thus, the function above cannot be run without creating a wrapper that would complexify our production environment. The same is true while querying, this preprocessing would need to run client-side (hard) or creating yet-anonther wrapper for our system that add this function.
Solution¶
To solve the issue, @Guillim got an insight from the DScientists working in the Code du Travail Numerique project: use the preprocessors (called analyzers) already existing in the ElasticSearch stack.
In this way, everything would be self-contained within ES.
Methodo¶
Analizers are defined by passing a JSON object to our running ES instance (a more French-specific discussion can be found here). Once this config is part of the instantiation of ES, we must specify the analyzer to be used (if any) in the specific property of our custom mapping.
Important
We are now using the icu_tokenizer
so we need to install the analysis-icu
ES plugin. This would be achieved in ES (via Dockerfile installation) like so:
RUN bin/elasticsearch-plugin install --silent analysis-icu
The resulting PR containing these modifs is here.
Results¶
Python Pre-Processing¶
The previous simple processing yields the following improvement:
BM25, v12, 406 QA dataset, no filtering, with no preprocessing:
Mean_precision 0.21. Time per ES query (ms): 16.721
BM25, v12, 406 QA dataset, no filtering, with preprocessing:
Mean_precision 0.34. Time per ES query (ms): 19.722
In this case, the improvement is of about 13%. Still, the time to run this functions augments the total time about 3 times.
With these results in mind, we can compare the performance of the ES preprocessor below.
ES Analyzer¶
We have four analyzer filters:
"filter": [
"french_elision",
"lowercase",
"french_stop",
"french_stemmer"
]
Let’s do a quick ablation test (check the perf offered by each filter). More info about each of these filters can be found by following the links above.
Using the same settings as before (BM25, v12, 406 QA dataset, no filtering)
:
Only french_elision¶
Mean_precision 0.24. Time per ES query (ms): 20.607
Only lowercase¶
Mean_precision 0.21. Time per ES query (ms): 19.473
Only french_stop¶
Mean_precision 0.30. Time per ES query (ms): 10.056
Only french_stemmer¶
Mean_precision 0.22. Time per ES query (ms): 23.966
All of them: french_elision, lowercase, french_stop, french_stemmer¶
Mean_precision 0.34. Time per ES query (ms): 11.148
All of them: french_elision, lowercase, french_stop, french_stemmer¶
Same as before but using the standard
tokenizer instead of the recommended icu-tokenizer
Mean_precision 0.34. Time per ES query (ms): 9.000
Conclusion¶
Regarding the four filters, using all of them gives the best performance (MAP: 0.34). While the pre-processing function (in Python, spacy-based) gives a negligeable improvement over using an ES analyzer, all things being equal, the decision is simple, the advantages of an self-contained ES analyzer are multiple: same performance, faster, included in ES, and easier to deploy.
For this experiment the icu-tokenizer
does not seem to offer an advantage over the standard
tokenizer. If it is not possible to add it to the Haystack Dockerfile in master, maybe we can roll with the standard
tokenizer and it will be fine.
TODO¶
Test “default” as analyzer name. Check it is taken into account.
Wonder why using the four filters is faster than using them independently :o
Merge the proposed PR into master and add the required plugin line to Haystack’s ES Dockerfile.