1 - Dimensionality Reduction

Linear Projection

  • principal component analysis (PCA)
  • singular value decomposition
  • random projection

Manifold Learning

This is also knon as nonlinear dimensionality reduction.

  • isomap
  • multidimensional scaling (MDS) - good for visualization
  • locally linear embedding (LLE)
  • t-distributed stochastic neighbor embedding (t-SNE)
  • dictionary learning
  • random trees embedding
  • independent component analysis

2 - Experiment Documentation

You should document what you did and why. The purpose is that you can repeat everything later (same data, same code, same modules) and loose no insights. The experience shows that it is tempting to do just a quick experiment but two days later you forgot all your valuable insights.

This is a documentation template:

Identifier: <unique name or id to be able to reference the experiment>
Date / Period: <can also be a range for long running experiments>
Goal: <What is the goal of this experiment? What is the hypothesis? Why do we do this experiment? What do you want to show?>
Data Source: <The data you used. Source, link, filename, histogram, GIT commit id, ...>
Base Model / Model Architecture: <The base model in case of transfer learning, architecture of the model, ...>
Data Preprocessing: <Preprocessing steps, filename, histogram, GIT commit id, ...>
Code: <Which script / notebook is used? GIT commit id>
Output: <Where do you store the output, filename, GIT commit id>
Hardware: <Which and how many GPUs / CPUs / RAM - everything that could influence the results>
Modules: <which pip modules with version - GIT commit id of the requirements.txt>
Termination criterion: <how many trials do you want to train, how long do you want to try>
Results: <only the numbers>
Insights: <The interpretation of the results. What did we learn? Did we reach our goal?>
Next Steps: <Follow up experiments, ideas>
Open Questions / Ideas: <Open and unanswered questions, ideas, notes, things for later>

3 - German Electra Training

Steps for training

Step 1 - generate Vocab

import os
from tokenizers import BertWordPieceTokenizer
from pathlib import Path

save_dir = "./<vocab_save_dir>"
paths = [str(x) for x in Path("<text_corpus_dir>").glob("*.txt")]
print('text corpus files:', paths)
vocab_size = 32_767  # 2^15-1
min_frequency = 2

os.makedirs(save_dir, exist_ok=True)

special_tokens = [
    "[PAD]",
    "[UNK]",
    "[CLS]",
    "[SEP]",
    "[MASK]"]

for i in range(767-5):
    special_tokens.append('[unused{}]'.format(i))

tokenizer = BertWordPieceTokenizer(strip_accents=False)
tokenizer.train(files=paths,
                vocab_size=vocab_size,
                min_frequency=min_frequency,
                special_tokens=special_tokens,
                )

tokenizer.save_model(save_dir)
tokenizer.save(save_dir + "/tokenizer.json")

Step 2 - get modified code to disable strip_accents

Use the modified fork to disable strip_accents. Also see: https://github.com/PhilipMay/electra/tree/no_strip_accents and this PR: https://github.com/google-research/electra/pull/88

git clone -b no_strip_accents https://github.com/PhilipMay/electra.git
cd electra

Step 3 - create TF datasets

python3 build_pretraining_dataset.py --corpus-dir ~/data/nlp/corpus/ready/ --vocab-file ~/dev/git/german-transformer-training/src/vocab_no_strip_accents/vocab.txt --output-dir ./tf_data --max-seq-length 512 --num-processes 8 --do-lower-case --no-strip-accents

Step 4 - training

This needs to be done on a strong GPU or even better TPU machine.

python3 run_pretraining.py --data-dir gs://<dir> --model-name 02_Electra_Checkpoints_32k_766k_Combined --hparams '{"pretrain_tfrecords": "gs://<dir>/*" , "model_size": "base", "vocab_file": "gs://<dir>/german_electra_uncased_no_strip_accents_vocab.txt", "num_train_steps": 766000, "max_seq_length": 512, "learning_rate": 2e-4, "embedding_size" : 768, "generator_hidden_size": 0.33333, "vocab_size": 32767, "keep_checkpoint_max": 0, "save_checkpoints_steps": 5000, "train_batch_size": 256, "use_tpu": true, "num_tpu_cores": 8, "tpu_name": "electrav5"}'
    Config:
    debug False
    disallow_correct False
    disc_weight 50.0
    do_eval False
    do_lower_case True
    do_train True
    electra_objective True
    embedding_size 768
    eval_batch_size 128
    gcp_project None
    gen_weight 1.0
    generator_hidden_size 0.33333
    generator_layers 1.0
    iterations_per_loop 200
    keep_checkpoint_max 0
    learning_rate 0.0002
    lr_decay_power 1.0
    mask_prob 0.15
    max_predictions_per_seq 79
    max_seq_length 512
    model_dir gs://<dir>
    model_hparam_overrides {}
    model_name 02_Electra_Checkpoints_32k_766k_Combined
    model_size base
    num_eval_steps 100
    num_tpu_cores 8
    num_train_steps 766000
    num_warmup_steps 10000
    pretrain_tfrecords gs://<dir>/*
    results_pkl gs://<dir>/results/unsup_results.pkl
    results_txt gs://<dir>/results/unsup_results.txt
    save_checkpoints_steps 5000
    temperature 1.0
    tpu_job_name None
    tpu_name electrav5
    tpu_zone None
    train_batch_size 256
    uniform_generator False
    untied_generator True
    untied_generator_embeddings False
    use_tpu True
    vocab_file gs://<dir>/german_electra_uncased_no_strip_accents_vocab.txt
    vocab_size 32767
    weight_decay_rate 0.01

Convert Model to PyTorch

 python /home/phmay/miniconda3/envs/farm-git/lib/python3.7/site-packages/transformers/convert_electra_original_tf_checkpoint_to_pytorch.py --tf_checkpoint_path ./model.ckpt-40000 --config_file ./config.json --pytorch_dump_path ./pytorch_model.bin --discriminator_or_generator='discriminator'

More info on conversion:

4 - Graph Neural Network

Know-how

Video

Tools & Libraries

Datasets

6 - Hugging Face - Transformers

Model

Training

Tokenizer

  • tokenizers.BertWordPieceTokenizer - for vocab generation - impl
  • tokenizers.normalizers.BertNormalizer - impl
  • transformers.tokenization_bert.BertTokenizer - tokenizer for normal usage - impl
  • transformers.tokenization_bert.BertTokenizerFast - fast tokenizer for normal usage - impl
  • tokenizers.AutoTokenizer - doc - impl
  • PreTrainedTokenizerBase.__call__ - doc

Pipelines

Data Handling

Important Torch Classes

7 - LightGBM

Advantages

LightGBM loads the best model after early stopping (in contrast to XGBoost). See here: https://lightgbm.readthedocs.io/en/latest/Python-Intro.html#early-stopping

Hyperparameter

Useful hyperopt search space:

space = {
        'num_leaves' : hp.quniform('num_leaves', 100, 600, 10),
        'min_data_in_leaf' : hp.quniform('min_data_in_leaf', 10, 30, 1),
        'max_bin' : hp.quniform('max_bin', 200, 2000, 10),
        'bagging_fraction' : hp.uniform('bagging_fraction', 0.01, 1.0),
        'bagging_freq' : hp.quniform('bagging_freq', 0, 10, 1),
        'feature_fraction' :  hp.uniform('feature_fraction', 0.5, 1.0),
        'lambda_l2' : hp.uniform('lambda_l2', 0.0, 70.0),
        'min_gain_to_split' : hp.uniform('min_gain_to_split', 0.0, 2.0),
        }

GPU Support

  • Build with GPU Support: python setup.py install --gpu --opencl-include-dir=/usr/local/cuda/include/ --opencl-library=/usr/local/cuda/lib64/libOpenCL.so
  • Install with GPU Support: pip install lightgbm --install-option=--gpu --install-option=--opencl-include-dir=/usr/local/cuda/include/ --install-option=--opencl-library=/usr/local/cuda/lib64/libOpenCL.so

To use the GPU you have to provide 'device_type': 'gpu' to parameters.

8 - Machine Learning at AWS

AWS Service Overview

General Purpose Machine Learning Services

Text and Speech Services

Image and Video Services

Toy Services

Other Services

Python API

Services

Performance Experiment with EC2 and LightGBM

Given:

  • LightGBM Classification
  • x_train.shape (13788428, 30)
  • x_val.shape (2433253, 30)
  • y_train.shape (13788428,)
  • y_val.shape (2433253,)
  • num_classes: 401
  • num_boost_round: 4

Results:

  • c5.24xlarge, 4,656 $/h, 96 vCPUs, 2nd Gen Intel Xeon Platinum 8275CL, 3.6 GHz, 192 GiB RAM, time: 512 s
  • m5.8xlarge, 1,84 $/h, 32 vCPUs, Intel Xeon Platinum 8175, 3.1 GHz, 128 GiB RAM, time: 574 s
  • r5.4xlarge, 1,216 $/h, 16 vCPUs, Intel Xeon Platinum 8175, 3.1 GHz, 128 GiB RAM, time: 763 s
  • other server, 4 (8) CPUs, Xeon W-2123, 3.6 GHz, 128 GiB RAM, time: 856 s

Connect

10 - Optuna

Transfer Trials to other DB

local_study = optuna.load_study(study_name="foo", storage="SQLITE URL")
remote_study = optuna.create_study(study_name="foo", storage="REMOTE SQL DB URL")

for trial in local_study.trials:
    remote_study.add_trial(trial)

Save Plot to Disk

fig = optuna.visualization.plot_slice(study)
fig.write_image("<filename>.png")  # save to image
fig.write_html("<filename>.html")  # save to html

11 - Paraphrase Mining

Data

12 - Seldon

Install

  • Use k8s 1.17.0: minikube start -p aged --kubernetes-version=v1.17.0
    • current version line 1.18.x does not work (at the moment)
    • when you want to start k8s again you have to provide this command again - with version number
    • maybe delete all k8s before: minikube delete --all - maybe with --purge
  • follow this guide to install istio: https://github.com/SeldonIO/seldon-core/tree/master/examples/auth
    • let’s use version 1.5 like the guide sais
    • %%curl -L https://istio.io/downloadIstio | ISTIO_VERSION=1.5.0 TARGET_ARCH=x86_64 sh - %%
    • add export PATH="$PATH:/home/mike/bin/istio-1.5.0/bin" to .bashrc
    • istioctl manifest apply --set profile=demo
  • install the rest

Using and Testing with Python