Update README.md

correct error in readme : test_dataset.py -> test_preprocess.py

Update README.md
correct error in readme : test_dataset.py -> test_preprocess.py
deeedd00 · malachaux · GitHub · a8bc1433 · deeedd00
Unverified Commit deeedd00 authored Aug 13, 2020 by malachaux Committed by GitHub Aug 13, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 1 additions and 1 deletions

README.md
+1 -1

No files found.
--- a/README.md
+++ b/README.md
@@ -96,7 +96,7 @@ NB : In our case, the training data is too large to fit on a single machine. Thu

 To get the monolingual data, first download cpp / java / python source code from Google BigQuery (https://cloud.google.com/blog/products/gcp/github-on-bigquery-analyze-all-the-open-source-code). To run our preprocessing pipeline, you need to donwlaod the raw source code on your machine in json format, and put each programming language in a dedicated folder. A sample of it is given in data/test_dataset. The pipeline extracts source code from json, tokenizes it, extracts functions, applies bpe, binarizes the data and creates symlink with appropriate name to be used directly in XLM. The folder that ends with .XLM-syml is the data path you give for XLM traning. You will have to add the test and valid parallel we provide in "Run an evaluation" data to that folder. 

-To test the pipeline run ```pytest preprocessing/test_dataset.py```, you will see the pipeline output in data/test_dataset folder.
+To test the pipeline run ```pytest preprocessing/test_preprocess.py```, you will see the pipeline output in data/test_dataset folder.

 To run the pipeline (either locally or on remote machine ), command example: