2 - Colab

Shortcuts

  • Show shortcuts: ctrl + m, h
  • Insert code cell above: ctrl + m, a
  • Insert code cell below: ctrl + m, b
  • Delete cell/selection: ctrl + m, d
  • Convert to code cell: ctrl + m, y
  • Convert to markdown cell:: ctrl + m, m

More

  • render pandas dataframes into interactive tables: %load_ext google.colab.data_table

Download Files from Colab

from google.colab import files
files.download('<file_to_download>')

Upload Files to Colab

from google.colab import files
uploaded = files.upload()

Mount Google Drive

from google.colab import drive
drive.mount('/gdrive')
ls -la /gdrive

3 - Conda

Manage Environments

  • create environment: conda create --name <new_env_name>
  • create environment with python 3.9: conda create --name <new_env_name> python=3.9
  • activate environment: conda activate <env_name>
  • deactivate (leave) environment: conda deactivate
  • list available environments: conda info --envs
  • remove environment: conda remove --name <env_name> --all

Updates

Other Commands

Rename Conda Environment

Rename <src_env> to <target_env>:

conda create --name <target_env> --clone <src_env>
conda remove --name <src_env> --all

Installation

Conda installation on Linux

  • download Miniconda (not Anaconda): https://conda.io/en/latest/miniconda.html#windows-installers
  • download the 64 bit Miniconda3 for the highest Python version of your architecture
  • change install file to executable: chmod +x Miniconda3-latest-Linux-x86_64.sh
  • start installation: ./Miniconda3-latest-Linux-x86_64.sh
  • use default settings
  • log out and back in to activate new settings

Windows Install

  • download Miniconda (not Anaconda): https://conda.io/en/latest/miniconda.html#windows-installers
  • download Miniconda3 for the highest Python version
  • preferably the 64 bit version
  • proxy setup
    • add the following content to the .condarc file
    • located at C:\Users\<User>
    • <user> and <pass> are optional
    • some https settings use the http protocol and not https
proxy_servers:
  http: http://[<user>:<pass>@]corp.com:8080
  https: https://[<user>:<pass>@]corp.com:8080

5 - Docstrings

Description

Python docstrings can be written in many different formats. An overview of different methods can be found in the following Stack Overflow entry: What is the standard Python docstring format?

It seems to be clever to use the docstring format of scipy and numpy since these packages are very popular. A guide to the numpy docstring format is here: numpydoc docstring guide and here is a Restructured Text (reST) syntax CheatSheet for Sphinx.

6 - Exceptions

Most Important Exceptions

  • TypeError: Raised when an operation or function is applied to an object of inappropriate type. The associated value is a string giving details about the type mismatch.
  • NotImplementedError: In user defined base classes, abstract methods should raise this exception when they require derived classes to override the method, or while the class is being developed to indicate that the real implementation still needs to be added.
  • ValueError: Raised when an operation or function receives an argument that has the right type but an inappropriate value, and the situation is not described by a more precise exception such as IndexError.

Logging Exceptions

try:
    something()
except Exception:
    logger.error("something bad happened", exc_info=True)

also see https://www.loggly.com/blog/exceptional-logging-of-exceptions-in-python/

7 - Filesystem

Path

Create Path

from pathlib import Path
Path("/my/pythondirectory").mkdir(parents=True, exist_ok=True)

8 - Iterate

  • break iterable into lists of length n: list(more_itertools.chunked(iterable, n)) - see
  • flatten a list of lists: list(itertools.chain.from_iterable(list_of_lists))

Dict

Iterate keys of dict

d = {'x': 1, 'y': 2, 'z': 3}
for key in d:
    print(key, 'corresponds to', d[key])

Iterate keys and values of dict:

d = {'x': 1, 'y': 2, 'z': 3}
for key, value in d.items():
    print(key, 'corresponds to', value)

9 - Joblib

Commands

10 - JSON

11 - Jupyter & JupyterLab

Install JupyterLab

  • conda install jupyterlab nb_conda_kernels

View Jupyter Notebook online

This can be done here: https://nbviewer.jupyter.org/

Add Conda Environment to Jupyter Lab

python -m ipykernel install --user --name <conda_env_name> --display-name "<conda_env_name>"

Environment settings for Jupyter

Environment settings for Jupyter are not read from .bashrc. You have to specify them in a .py file at ~/.ipython/profile_default/startup/

For example:

echo -e "import os\n\nos.environ[\"SOME_URL\"] = \"http://mlflow.company.tld:5000\"" > ~/.ipython/profile_default/startup/set_env.py

Install Jupyter on Server for Remote Access

Clean the Trash

When you use the “File Browser” of Jupyter Lab to delete files they are not deleted but moved to ~/.local/share/Trash. Clean that folder to delete them.

13 - Logging

Module-Level Logger

A convention is to use a module-level logger as follows:

import logging
_logger = logging.getLogger(__name__)

Log variable Data

Logging uses the old “%-style” of string formatting. Also see:

Example:

_logger.warning("%s before you %s', 'Look', 'leap!")

Root Logger Configuration

The easiest way to configure the root logger works like this:

import logging
logging.getLogger().setLevel(logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler())

14 - Mock

Asserts

Assert Methods

Assert only some Arguments

Assert not all exact arguments but just some of them.

Example:

from unittest.mock import Mock, ANY
mock = Mock(return_value=None)
mock("foo", bar="something_I_do_not_care_about")
mock.assert_called_once_with("foo", bar=ANY)

Spy

If you do not want to replace a function with a mock, but want to observe the parameters passed to the function, a so-called spy can be used.

Spy on a Function

To do this we use a combination of patch and the wraps argument. Note that the first argument of patch is a string.

Example:

def my_function(x):
    return x + 1

with patch("__main__.my_function", wraps=my_function) as wrapped_function:
    result = my_function(5)
    assert result == 6
    wrapped_function.assert_called_once_with(5)

Spy on a Method

To do this we use a combination of patch.object and the wraps argument. Note that the first argument of patch.object is a object (instance).

Example:

class MyClass:
    def my_function(self, x):
        return x + 1

my_instance = MyClass()
with patch.object(my_instance, 'my_function', wraps=my_instance.my_function) as wrapped_method:
    result = my_instance.my_function(5)
    assert result == 6
    wrapped_method.assert_called_once_with(5)

Seal a Mock

Seal disables the automatic creation of mocks when accessing an attribute of the mock. Also see: https://docs.python.org/3/library/unittest.mock.html#unittest.mock.seal

One problem with normal mocks is that they auto-create attributes on demand. If you misspell one of these assert methods then your assertion is simply swallowed and ignored. If you seal the mock after creation this avoids this problem.

Example:

from unittest.mock import Mock, seal
mock = Mock()
# more configuration here
seal(mock)

15 - Pandas

Create Dataframe

data = {"col1": [1, 2], "col2": [3, 4]}
df = pd.DataFrame(data=data)

Load and Save CSV

  • save to CSV: df.to_csv("path_or_buffer")
  • save to CSV (without row names / index): df.to_csv("path_or_buffer", index=False)
  • load from CSV:
df = pd.read_csv(
    "path_or_buffer",
    sep=";",
    encoding="us-ascii",
    usecols=col_list,
    nrows=number_of_rows_to_read,
    low_memory=False,
    quoting=csv.QUOTE_NONE,
)
  • load csv without header: df = pd.read_csv("path_or_buffer", names=["column_name_1", "column_name_2"], header=None)

Load and Save CSV Parquet

  • save to parquet: df.to_parquet("<file_name>.parquet.gz", compression="gzip")
  • load from parquet: df = read_parquet("<file_name>.parquet.gz")

Display Data

  • count values in column (without NaN values): df["col_name"].value_counts()
  • count values in column (with NaN values): df["col_name"].value_counts(dropna=False)
  • duplicates
    • display duplicate rows: df[df.duplicated(keep=False)]
    • display duplicate entries in column: df[df["column_name"].duplicated(keep=False)]

Delete Data

  • delete column inline
    • df.drop("column_name", axis=1, inplace=True)
    • column_name can also be a list of str
  • remove rows on condition: df.drop(df[df["col_name"] == condition].index, inplace=True)
  • remove duplicates
    • keep first (inplace): df.drop_duplicates(inplace=True, keep="first")
    • only consider certain columns to identify duplicates, keep first (inplace): df.drop_duplicates(list_of_cols, inplace=True, keep="first")

Modify Data

  • sort
    • low to high values: df.sort_values("column_name", inplace=True)
    • high to low values: df.sort_values("column_name", ascending=False, inplace=True)
    • high to low values & Nan values on top: df.sort_values("column_name", ascending=False, na_position="first")
  • shuffle: df = df.sample(frac=1).reset_index(drop=True)

Combine Data

Stack two Dataframes

Never forget to ignore_index or you have duplicate index values and bad things might happen later!

df = pd.concat([df_01, df_02], ignore_index=True)

Display Settings

Examples for display settings:

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

# display long sentences in multiple rows (Jupyter)
pd.set_option("display.max_colwidth", None)

Filter nan Values

nan == nan is always false. That is why we can not use == to check for nan-values. Use pd.isnull(obj : scalar or array-like) instead or isnull(). Examples:

df.loc[pd.isnull(df["col"])]
df[df["col"].isnull()]

Other

  • rename columns: df.rename(columns={"a": "x"}, inplace=True)

16 - PIP

Install Packages

List Packages

  • list outdated packages: pip list -o
  • list packages in requirements.txt format: pip list --format freeze

Other Commands

  • delete package cache: pip cache purge

Install and update Packages from a File

For pip you can create so called requirements files. These files just list one package per line. Packages from this file can be installes with pip install -r <requirements.txt> and updatet with pip install -r <requirements.txt> -U. The update only makes sence when you do not specify a version number with the package.

Since pip does not support an “update all” mechanism this is a good way to install and update the needed packages.

To add a package from GIT just add git+<https_git_clone_link> instead of the normal package name.

Build PyPI Packages

18 - PyCharm

Hotkeys

  • bookmarks
    • add anonymous bookmark: F3
    • view bookmarks: command + 2 (Mac)

Settings

  • use git credential manager: Settings -> Version Control -> Git -> activate “Use Credential helper”

19 - pyenv

Commands

  • list available Python versions: pyenv install -l
  • install new Python version: pyenv install <version>
  • update with brew: brew upgrade pyenv

20 - Python Naming

This is about how things are named in Python.

Classes

  • class object: the raw class itself (not instantiated) (see here)
  • instance object or instance of a class: an instantiated class like x = MyClass() (see here)
    an instance object has two kinds of valid attribute names:
    • data attributes: are variables belonging to the instance, attributes need not be declared - like local variables - they spring into existence when they are first assigned to
    • methods: method is a function that “belongs to” an object
  • class variables: for attributes (and methods) shared by all instances of a class (see here)
  • instance variables: for data unique to each instance (see here)

classmethod

see https://www.geeksforgeeks.org/class-method-vs-static-method-python/

staticmethod

see https://www.geeksforgeeks.org/class-method-vs-static-method-python/

constructor vs. initializer vs. allocator (__init__)

see https://stackoverflow.com/a/6578504

22 - tqdm

Usage

Simple usage:

from tqdm import tqdm

for i in tqdm(range(10000)):
    # ...

Manual usage:

from tqdm import tqdm

with tqdm(total=100) as pbar:
    for i in range(10):
        sleep(0.1)
        pbar.update(10)
  • automatically choose between console or notebook versions: from tqdm.auto import tqdm

23 - Typing

Types

  • Any: Special type indicating an unconstrained type.
  • Optional: Optional type.
  • Dict
    • Dict[str, int]
    • Dict[str, List[int]]
  • Callable
    • Callable[[str], Dict[str, List[int]]]

Example

    def __init__(
        self,
        tokenizer_func: Callable[[str], Dict[str, List[int]]],
        augmentation_func: Callable[[str], str],
        train_data_sampling_callback: Callable[[List[str]], List[int]] = None,
    ):