This is the multi-page printable view of this section. Click here to print.
Machine Learning Topics
- 1: CUDA
- 2: Dense Passage Retrieval (DPR)
- 3: Dimensionality Reduction
- 4: Experiment Documentation
- 5: German Electra Training
- 6: Graph Database
- 7: Graph Neural Network
- 8: Hugging Face - Datasets
- 9: Hugging Face - Transformers
- 10: LightGBM
- 11: Machine Learning at AWS
- 12: NLP Datasets
- 13: Optuna
- 14: Paraphrase Mining
- 15: Seldon
- 16: T5 and MT5 Models
1 - CUDA
- disable GPU usage:
export CUDA_VISIBLE_DEVICES=""
- enable only GPU 0:
export CUDA_VISIBLE_DEVICES="0"
2 - Dense Passage Retrieval (DPR)
Links
- Training
- SBERT:
- Beir
- Facebook-Research: https://github.com/facebookresearch/DPR
- Metrics
- BM25
- Videos
- [Dense Retrieval - Knowledge Distillation (Sebastian Hofstätter)] (https://www.youtube.com/watch?v=EJ_7Gx6amt8)
- [Crash Course IR - Evaluation] (https://www.youtube.com/watch?v=EiDltQZ713I)
3 - Dimensionality Reduction
Linear Projection
- principal component analysis (PCA)
- singular value decomposition
- random projection
Manifold Learning
This is also knon as nonlinear dimensionality reduction.
- isomap
- multidimensional scaling (MDS) - good for visualization
- locally linear embedding (LLE)
- t-distributed stochastic neighbor embedding (t-SNE)
- sklearn
- good for visualization
- dictionary learning
- random trees embedding
- independent component analysis
4 - Experiment Documentation
You should document what you did and why. The purpose is that you can repeat everything later (same data, same code, same modules) and loose no insights. The experience shows that it is tempting to do just a quick experiment but two days later you forgot all your valuable insights.
This is a documentation template:
Identifier: <unique name or id to be able to reference the experiment>
Date / Period: <can also be a range for long running experiments>
Goal: <What is the goal of this experiment? What is the hypothesis? Why do we do this experiment? What do you want to show?>
Data Source: <The data you used. Source, link, filename, histogram, GIT commit id, ...>
Base Model / Model Architecture: <The base model in case of transfer learning, architecture of the model, ...>
Data Preprocessing: <Preprocessing steps, filename, histogram, GIT commit id, ...>
Code: <Which script / notebook is used? GIT commit id>
Output: <Where do you store the output, filename, GIT commit id>
Hardware: <Which and how many GPUs / CPUs / RAM - everything that could influence the results>
Modules: <which pip modules with version - GIT commit id of the requirements.txt>
Termination criterion: <how many trials do you want to train, how long do you want to try>
Results: <only the numbers>
Insights: <The interpretation of the results. What did we learn? Did we reach our goal?>
Next Steps: <Follow up experiments, ideas>
Open Questions / Ideas: <Open and unanswered questions, ideas, notes, things for later>
5 - German Electra Training
Steps for training
Step 1 - generate Vocab
import os
from tokenizers import BertWordPieceTokenizer
from pathlib import Path
save_dir = "./<vocab_save_dir>"
paths = [str(x) for x in Path("<text_corpus_dir>").glob("*.txt")]
print('text corpus files:', paths)
vocab_size = 32_767 # 2^15-1
min_frequency = 2
os.makedirs(save_dir, exist_ok=True)
special_tokens = [
"[PAD]",
"[UNK]",
"[CLS]",
"[SEP]",
"[MASK]"]
for i in range(767-5):
special_tokens.append('[unused{}]'.format(i))
tokenizer = BertWordPieceTokenizer(strip_accents=False)
tokenizer.train(files=paths,
vocab_size=vocab_size,
min_frequency=min_frequency,
special_tokens=special_tokens,
)
tokenizer.save_model(save_dir)
tokenizer.save(save_dir + "/tokenizer.json")
Step 2 - get modified code to disable strip_accents
Use the modified fork to disable strip_accents
. Also see:
https://github.com/PhilipMay/electra/tree/no_strip_accents and this
PR: https://github.com/google-research/electra/pull/88
git clone -b no_strip_accents https://github.com/PhilipMay/electra.git
cd electra
Step 3 - create TF datasets
python3 build_pretraining_dataset.py --corpus-dir ~/data/nlp/corpus/ready/ --vocab-file ~/dev/git/german-transformer-training/src/vocab_no_strip_accents/vocab.txt --output-dir ./tf_data --max-seq-length 512 --num-processes 8 --do-lower-case --no-strip-accents
Step 4 - training
This needs to be done on a strong GPU or even better TPU machine.
python3 run_pretraining.py --data-dir gs://<dir> --model-name 02_Electra_Checkpoints_32k_766k_Combined --hparams '{"pretrain_tfrecords": "gs://<dir>/*" , "model_size": "base", "vocab_file": "gs://<dir>/german_electra_uncased_no_strip_accents_vocab.txt", "num_train_steps": 766000, "max_seq_length": 512, "learning_rate": 2e-4, "embedding_size" : 768, "generator_hidden_size": 0.33333, "vocab_size": 32767, "keep_checkpoint_max": 0, "save_checkpoints_steps": 5000, "train_batch_size": 256, "use_tpu": true, "num_tpu_cores": 8, "tpu_name": "electrav5"}'
Config:
debug False
disallow_correct False
disc_weight 50.0
do_eval False
do_lower_case True
do_train True
electra_objective True
embedding_size 768
eval_batch_size 128
gcp_project None
gen_weight 1.0
generator_hidden_size 0.33333
generator_layers 1.0
iterations_per_loop 200
keep_checkpoint_max 0
learning_rate 0.0002
lr_decay_power 1.0
mask_prob 0.15
max_predictions_per_seq 79
max_seq_length 512
model_dir gs://<dir>
model_hparam_overrides {}
model_name 02_Electra_Checkpoints_32k_766k_Combined
model_size base
num_eval_steps 100
num_tpu_cores 8
num_train_steps 766000
num_warmup_steps 10000
pretrain_tfrecords gs://<dir>/*
results_pkl gs://<dir>/results/unsup_results.pkl
results_txt gs://<dir>/results/unsup_results.txt
save_checkpoints_steps 5000
temperature 1.0
tpu_job_name None
tpu_name electrav5
tpu_zone None
train_batch_size 256
uniform_generator False
untied_generator True
untied_generator_embeddings False
use_tpu True
vocab_file gs://<dir>/german_electra_uncased_no_strip_accents_vocab.txt
vocab_size 32767
weight_decay_rate 0.01
Convert Model to PyTorch
python /home/phmay/miniconda3/envs/farm-git/lib/python3.7/site-packages/transformers/convert_electra_original_tf_checkpoint_to_pytorch.py --tf_checkpoint_path ./model.ckpt-40000 --config_file ./config.json --pytorch_dump_path ./pytorch_model.bin --discriminator_or_generator='discriminator'
More info on conversion:
7 - Graph Neural Network
Links
Know-how
Video
- Stanford CS224W: Machine Learning with Graphs (Jure Leskovec, playlist)
- Stanford Graph Learning Workshop 2022 (7:57 h) - individual videos with slides
- Petar Veličković
- Graph Neural Networks
- Hussain Kara Fallah
- Machine Learning TV - not only but also GNN topics
- Hannes Stärk
- Antonio Longa - PyG tutorials
Tools & Libraries
- PyG (PyTorch Geometric) - GitHub
- currentness: well maintained
- technology: PyTorch
- GraphGym - uses PyG internaly
- DGL (Deep Graph Library) - GitHub
- currentness: well maintained
- technology: PyTorch, TensorFlow or Apache MXNet
- Spektral - GitHub
- currentness: maintained
- technology: Keras & TensorFlow 2
- Graph Nets - GitHub
- currentness: outdated, last commit Dec. 2020
- technology: Tensorflow
- arXiv paper
- Jraph - GitHub
- currentness: well maintained
- technology: JAX
Datasets
- Open Graph Benchmark (OGB)
- PyG (PyTorch Geometric) Datasets - Dataset Cheatsheet
- TUDatasets
- Benchmarking Graph Neural Networks - GitHub
Recommender Systems at Spotify
- Speaker: Andreas Damianou
- Direct Link: Stanford Graph Learning Workshop 2022
Papers mentioned
- mentioned at 2:43:40
- mentioned at 2:57:47
- Ranking systems for RecSys
- GNN architectures
- Graph-based appoaches for RecSys
- A Survey on Knowledge Graph-Based Recommender Systems (2020)
- Graph Learning based Recommender Systems: A Review (2021)
- Graph Convolutional Neural Networks for Web-Scale Recommender Systems (2018)
- Graph Convolutional Matrix Completion (2017)
- Graph Neural Networks for Social Recommendation (2019)
- Trajectory Based Podcast Recommendation (2020)
- RecWalk: Nearly Uncoupled Random Walks for Top-N Recommendation (2019)
- LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation (2020)
- KDD 2022 tutorial by El-Kishky et al. - slides
8 - Hugging Face - Datasets
Links
- Doc: https://huggingface.co/docs/datasets/
- GitHub: https://github.com/huggingface/datasets
- Writing a dataset loading script
Dataset Class
9 - Hugging Face - Transformers
Links
- Transformers Doc - Examples
- Transformers GitHub
- Messageboard
- HF @ Medium: https://medium.com/huggingface
- Model sharing and uploading: https://huggingface.co/transformers/model_sharing.html
Model
transformers.AutoConfig
- doc - impl- Save a pretrained model: https://huggingface.co/transformers/main_classes/model.html?highlight=save_pretrained#transformers.PreTrainedModel.save_pretrained
Training
transformers.TrainingArguments
- doctransformers.Trainer
- doc - impl- Metrics (scroll down): https://huggingface.co/transformers/training.html#trainer
- Learning Rate Schedules: https://huggingface.co/transformers/main_classes/optimizer_schedules.html?highlight=warm%20restart#learning-rate-schedules-pytorch
- Freezing the encoder: https://huggingface.co/transformers/training.html?highlight=freezing#freezing-the-encoder
- Early Stopping
- Custom Loss Function to add class weights for unbalanced datasets:
Subclass
Trainer
and override thecompute_loss
method (see example here). - Multi Class (multi head) classification: https://discuss.huggingface.co/t/how-do-i-do-multi-class-multi-head-classification/1140
Tokenizer
tokenizers.BertWordPieceTokenizer
- for vocab generation - impltokenizers.normalizers.BertNormalizer
- impltransformers.tokenization_bert.BertTokenizer
- tokenizer for normal usage - impltransformers.tokenization_bert.BertTokenizerFast
- fast tokenizer for normal usage - impltokenizers.AutoTokenizer
- doc - implPreTrainedTokenizerBase.__call__
- doc
Pipelines
- GitHub: https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines.py
TextClassificationPipeline
->Pipeline
->_ScikitCompat
Data Handling
- https://github.com/huggingface/nlp
- Usage Example: https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM
- Fine-tuning with custom datasets: https://huggingface.co/transformers/master/custom_datasets.html#fine-tuning-with-custom-datasets
Important Torch Classes
Links & Know-how
10 - LightGBM
Advantages
LightGBM loads the best model after early stopping (in contrast to XGBoost). See here: https://lightgbm.readthedocs.io/en/latest/Python-Intro.html#early-stopping
Hyperparameter
Useful hyperopt search space:
space = {
'num_leaves' : hp.quniform('num_leaves', 100, 600, 10),
'min_data_in_leaf' : hp.quniform('min_data_in_leaf', 10, 30, 1),
'max_bin' : hp.quniform('max_bin', 200, 2000, 10),
'bagging_fraction' : hp.uniform('bagging_fraction', 0.01, 1.0),
'bagging_freq' : hp.quniform('bagging_freq', 0, 10, 1),
'feature_fraction' : hp.uniform('feature_fraction', 0.5, 1.0),
'lambda_l2' : hp.uniform('lambda_l2', 0.0, 70.0),
'min_gain_to_split' : hp.uniform('min_gain_to_split', 0.0, 2.0),
}
Links
- Root Documentation:
https://lightgbm.readthedocs.io/en/latest/index.html
- Python API: https://lightgbm.readthedocs.io/en/latest/Python-API.html
- Parameters: https://lightgbm.readthedocs.io/en/latest/Parameters.html
- Parameters Tuning: https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
- Early Stopping: https://lightgbm.readthedocs.io/en/latest/Python-Intro.html#early-stopping
- Load and save the booster: https://lightgbm.readthedocs.io/en/latest/Python-Intro.html#training
- Regression Example: https://github.com/Microsoft/LightGBM/tree/master/examples/regression
- Binary Classification Example: https://github.com/Microsoft/LightGBM/tree/master/examples/binary_classification
- Multiclass Classification Example: https://github.com/Microsoft/LightGBM/tree/master/examples/multiclass_classification
- GPU Tutorial: https://lightgbm.readthedocs.io/en/latest/GPU-Tutorial.html
- Parameter overview with filter: https://sites.google.com/view/lauraepp/parameters
GPU Support
- Build with GPU Support:
python setup.py install --gpu --opencl-include-dir=/usr/local/cuda/include/ --opencl-library=/usr/local/cuda/lib64/libOpenCL.so
- Install with GPU Support:
pip install lightgbm --install-option=--gpu --install-option=--opencl-include-dir=/usr/local/cuda/include/ --install-option=--opencl-library=/usr/local/cuda/lib64/libOpenCL.so
To use the GPU you have to provide 'device_type': 'gpu'
to parameters.
11 - Machine Learning at AWS
AWS Service Overview
General Purpose Machine Learning Services
- Amazon Elastic Inference - attach GPU acceleration to Amazon instances: https://aws.amazon.com/machine-learning/elastic-inference/
- Amazon SageMaker - general purpose machine learning: https://aws.amazon.com/sagemaker/
- AWS Deep Learning AMI (DLAMI) - Linux with Deep Learning Hardware (AMI = Amazon Machine Image)
Text and Speech Services
- Amazon Polly - text to speech service (blackbox): https://aws.amazon.com/polly/
- Amazon Comprehend - natural language processing (NLP) service (blackbox): https://aws.amazon.com/comprehend/
- Amazon Lex - Chat like Alexa (speech recognition and natural language understanding): https://aws.amazon.com/lex/
- Amazon Textract - extract text and data from scanned documents (blackbox): https://aws.amazon.com/textract/
- Amazon Transcribe - speech to text (blackbox): https://aws.amazon.com/transcribe/
- Amazon Translate - neural machine translation service (blackbox): https://aws.amazon.com/translate/
Image and Video Services
- Amazon Rekognition - image and video analysis (blackbox): https://aws.amazon.com/rekognition/
Toy Services
- AWS DeepLens - deep learning video camera (toy): https://aws.amazon.com/deeplens/
- AWS DeepRacer - autonomous deep learning 1/18th scale car (toy): https://aws.amazon.com/deepracer/
Other Services
- Amazon Forecast - time-series forecasting service (blackbox): https://aws.amazon.com/forecast/
- Amazon Personalize - personalization and recommendation (blackbox): https://aws.amazon.com/personalize/
- Amazon SageMaker Ground Truth - manual labeling tools: https://aws.amazon.com/sagemaker/groundtruth/
- Amazon SageMaker Neo - compile models for speed and low memory footprint: https://aws.amazon.com/sagemaker/neo/
SageMaker Links
- Amazon SageMaker 10-Minute Tutorial
- Amazon SageMaker Technical deep-dive
- Learn fundamentals and deep dive into Amazon SageMaker algorithmsLearn more
- SageMaker - Developer Guide - Get Started
Python API
Services
- P3 (V100 GPU): https://aws.amazon.com/de/ec2/instance-types/p3/
- P2 (K80 GPU): https://aws.amazon.com/de/ec2/instance-types/p2/
- G3 (M60 GPU): https://aws.amazon.com/de/ec2/instance-types/g3/
Performance Experiment with EC2 and LightGBM
Given:
- LightGBM Classification
- x_train.shape (13788428, 30)
- x_val.shape (2433253, 30)
- y_train.shape (13788428,)
- y_val.shape (2433253,)
- num_classes: 401
- num_boost_round: 4
Results:
- c5.24xlarge, 4,656 $/h, 96 vCPUs, 2nd Gen Intel Xeon Platinum 8275CL, 3.6 GHz, 192 GiB RAM, time: 512 s
- m5.8xlarge, 1,84 $/h, 32 vCPUs, Intel Xeon Platinum 8175, 3.1 GHz, 128 GiB RAM, time: 574 s
- r5.4xlarge, 1,216 $/h, 16 vCPUs, Intel Xeon Platinum 8175, 3.1 GHz, 128 GiB RAM, time: 763 s
- other server, 4 (8) CPUs, Xeon W-2123, 3.6 GHz, 128 GiB RAM, time: 856 s
Connect
12 - NLP Datasets
Links
- XGLUE: https://microsoft.github.io/XGLUE/
- Lincense: non-commercial research purposes only
- XNLI: https://github.com/facebookresearch/XNLI
- XNLI 1.0
- multi lang
- 2490 dev per lang., 5010 test per lang.
- XNLI-15way: 10,000 multi lingual parallel sentences without label
- Lincense: non-commercial
- XNLI 1.0
- Stanford Natural Language Inference (SNLI) Corpus: https://nlp.stanford.edu/projects/snli/
- English language
- 3 classes: neutral, contradiction, entailment
- The Multi-Genre NLI Corpus: https://cims.nyu.edu/~sbowman/multinli/
- English language
- 3 classes: neutral, contradiction, entailment
- PAWS: https://github.com/google-research-datasets/paws
- PAWS-X: https://github.com/google-research-datasets/paws/tree/master/pawsx
- lang: de, en, es, fr, ja, ko, zh
- 29401 train, 2000 dev, 2000 test - size can be slightly different between languages
- sentence1, sentence2, label
- dev & test overlaps!
- label: binary (paraphrase or not paraphrase)
- STS benchmark
- Original dataset: https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark
- STSb Multi MT: https://github.com/PhilipMay/stsb-multi-mt
- Paper: https://arxiv.org/abs/1708.00055
- CORD19STS
German Sentiment
- Model: https://huggingface.co/oliverguhr/german-sentiment-bert
- Paper: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.202.pdf
- GitHub: https://github.com/oliverguhr/german-sentiment
- Data: https://github.com/oliverguhr/german-sentiment/issues/3#issuecomment-700942718
- More Data: https://github.com/WladimirSidorenko/CGSA
More
13 - Optuna
Links
- Add additional hyperparameter to existing study (workaround)
Transfer Trials to other DB
local_study = optuna.load_study(study_name="foo", storage="SQLITE URL")
remote_study = optuna.create_study(study_name="foo", storage="REMOTE SQL DB URL")
for trial in local_study.trials:
remote_study.add_trial(trial)
Save Plot to Disk
fig = optuna.visualization.plot_slice(study)
fig.write_image("<filename>.png") # save to image
fig.write_html("<filename>.html") # save to html
Extend HP-space
It might be that the HP space was selected to narrow and we want to extend it during a study to continue the HP search with an extended HP space. This is a bit difficult in Optuna.
This is known to the Optuna maintainers but needs someone to write a pull request. Details here: https://github.com/optuna/optuna/issues/4037
A workaround is as following:
Lets say we had a hyperparameterspace with num_epochs from 2 to 4. Now we did some trials for this study but want to extend num_epochs to 2 to 7. This is done like this:
# load old study
study_name = "my_study_01"
storage='sqlite:///optuna.db'
study = optuna.create_study(
study_name=study_name,
storage=storage,
load_if_exists=True,
direction='maximize',
)
# iterate trials from old study and modify (extend) the distributions
new_trials = []
for trial in study.trials:
# this is only valid for int distributions
# if it is a float distribution this needs to be changed
trial.distributions["num_epochs"] = optuna.distributions.IntDistribution(2, 7)
new_trials.append(trial)
# delete old study from memory
# for some reason I always do this because there was some trouble in the past
# but I do not remember why - maybe because there is a DB connection behind it
del study
# create new study
study_name_new = "my_study_02"
new_study = optuna.create_study(
study_name=study_name_new,
storage=storage,
load_if_exists=True,
direction='maximize',
)
# add old trials to new trial
for trial in new_trials:
new_study.add_trial(trial)
# now also modify your training code
# this would be the following change for example
# change from
# num_epochs = trial.suggest_int("num_epochs", 2, 4)
# to
# num_epochs = trial.suggest_int("num_epochs", 2, 7)
# now you can continue training on "study_name_new" with the old knowledge but extended HP space
More Topics / To-do
- choose HP space
- when to use category
- when to use step
- how many?
- monitor HP space / slice plot
- monitor training
- only train for x non failed iteration
- xval vs. not xval
14 - Paraphrase Mining
Data
- German Wikipedia
- https://dumps.wikimedia.org/dewiki/
- currently:
dewiki-20220101-pages-articles.xml.bz2
- Quora for duplicate questions
Paper and Links
15 - Seldon
Links
- Seldon https://www.seldon.io/
- GitHub https://github.com/SeldonIO/seldon-core
- Doc https://docs.seldon.io/projects/seldon-core/en/latest/
- Installation
Install
- Use k8s
1.17.0
:minikube start -p aged --kubernetes-version=v1.17.0
- current version line
1.18.x
does not work (at the moment) - when you want to start k8s again you have to provide this command again - with version number
- maybe delete all k8s before:
minikube delete --all
- maybe with--purge
- current version line
- follow this guide to install istio:
https://github.com/SeldonIO/seldon-core/tree/master/examples/auth
- let’s use version 1.5 like the guide sais
%%curl -L https://istio.io/downloadIstio | ISTIO_VERSION=1.5.0 TARGET_ARCH=x86_64 sh - %%
- add
export PATH="$PATH:/home/mike/bin/istio-1.5.0/bin"
to.bashrc
istioctl manifest apply --set profile=demo
- install the rest
- checkout seldon-core from git: https://github.com/SeldonIO/seldon-core
- change to directory
examples/auth
- follow the rest of the guide: https://github.com/SeldonIO/seldon-core/blob/master/examples/auth/README.md
- Uninstall:
helm uninstall seldon-core --namespace seldon-system
Using and Testing with Python
- Usage with Python:
https://docs.seldon.io/projects/seldon-core/en/v1.1.0/python/index.html
- Seldon Core Python Package: https://docs.seldon.io/projects/seldon-core/en/v1.1.0/python/python_module.html
- Seldon Python Component: https://docs.seldon.io/projects/seldon-core/en/v1.1.0/python/python_component.html
- Seldon Python Client: https://docs.seldon.io/projects/seldon-core/en/v1.1.0/python/seldon_client.html
- Python API: https://docs.seldon.io/projects/seldon-core/en/v1.1.0/python/api/modules.html
- Class name and script name have to be exactly the same!
- Testing - without Kubernetis - just localy
- start API to test
export PREDICTIVE_UNIT_PARAMETERS='[{"name":"model_uri","value":"<uri>","type":"STRING"}]'
- example:
export PREDICTIVE_UNIT_PARAMETERS='[{"name":"model_uri","value":"file:///home/myuser/models/german_sentiment_electra","type":"STRING"}]'
- get model from cloud storage:
model_file = os.path.join(seldon_core.Storage.download(self.model_uri))
seldon-core-microservice <script_name_without_.py> REST
curl -X POST -H 'Content-Type: application/json' -d '{"data": { "ndarray": ["data1", "data2"]}}' http://127.0.0.1:5000/api/v1.0/predictions
- start API to test
- see here: https://docs.seldon.io/projects/seldon-core/en/v0.3.0/workflow/api-testing.html
- Alternatively, if your component is a Python module you can run it directly from python using the core tool
seldon-core-microservice
" […]
- Alternatively, if your component is a Python module you can run it directly from python using the core tool
- Testing Your Model Endpoints: https://docs.seldon.io/projects/seldon-core/en/v1.1.0/workflow/serving.html
16 - T5 and MT5 Models
Links
- T5
- MT5