Update README.md

e9c051c9 · malachaux · GitHub · deeedd00 · e9c051c9
Unverified Commit e9c051c9 authored Aug 20, 2020 by malachaux Committed by GitHub Aug 20, 2020
Show whitespace changes
Inline Side-by-side

Showing with 35 additions and 0 deletions

README.md
+35 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -264,6 +264,41 @@ If the script is missing, it means there was an issue with our automatically cre

 The code generated by your model can be tested by injecting it where the `TO_FILL` comment is in the test script.

+## Little guide to download Github from Google Big Query
+
+Hi here is a little guide :
+
+- Create a Google platform account ( you will have around 300 $ given for free , that is sufficient for Github)
+- Create a Google Big Query project here
+- In this project, create a dataset
+- In this dataset, create one table per programming language. The results of each SQL request (one per language) will be stored in these tables.
+- Before running your SQL request, make sure you change the query settings to save the query results in the dedicated table (more -> Query Settings -> Destination -> table for query results -> put table name)
+- Run your SQL request (one per language and dont forget to change the table for each request)
+- Export your results to google Cloud :
+  - In google cloud storage, create a bucket and a folder per language into it
+  - Export your table to this bucket ( EXPORT -> Export to GCS -> export format JSON , compression GZIP)
+- To download the bucket on your machine, use the API gsutil:
+  - pip install gsutil
+  - gsutil config -> to config gsutil with your google account
+  - gsutil -m cp -r gs://name_of_bucket/name_of_folder . -> copy your bucket on your machine
+
+Example of query for python :
+```
+SELECT 
+ f.repo_name,
+ f.ref,
+ f.path,
+ c.copies,
+ c.content
+FROM `bigquery-public-data.github_repos.files` as f
+  JOIN `bigquery-public-data.github_repos.contents` as c on f.id = c.id
+WHERE 
+  NOT c.binary
+  AND f.path like '%.py'
+```
+
+Google link for more info here
+
 ## References
 This Code was used to train and evaluate the TransCoder model in: