This is the multi-page printable view of this section. Click here to print.
Python Topics
- 1: Beautiful Soup
- 2: Colab
- 3: Conda
- 4: Context Manager
- 5: Docstrings
- 6: Exceptions
- 7: Filesystem
- 8: Iterate
- 9: Joblib
- 10: JSON
- 11: Jupyter & JupyterLab
- 12: Linter
- 13: Logging
- 14: Mock
- 15: Pandas
- 16: PIP
- 17: Poetry
- 18: PyCharm
- 19: pyenv
- 20: Python Naming
- 21: REST API with Python
- 22: tqdm
- 23: Typing
1 - Beautiful Soup
Links
2 - Colab
Links
- Colab: https://colab.research.google.com/
- Colab on GitHub: https://github.com/googlecolab/colabtools
Shortcuts
- Show shortcuts:
ctrl + m, h
- Insert code cell above:
ctrl + m, a
- Insert code cell below:
ctrl + m, b
- Delete cell/selection:
ctrl + m, d
- Convert to code cell:
ctrl + m, y
- Convert to markdown cell::
ctrl + m, m
More
- render pandas dataframes into interactive tables:
%load_ext google.colab.data_table
Download Files from Colab
from google.colab import files
files.download('<file_to_download>')
Upload Files to Colab
from google.colab import files
uploaded = files.upload()
Mount Google Drive
from google.colab import drive
drive.mount('/gdrive')
ls -la /gdrive
3 - Conda
Manage Environments
- create environment:
conda create --name <new_env_name>
- create environment with python 3.9:
conda create --name <new_env_name> python=3.9
- activate environment:
conda activate <env_name>
- deactivate (leave) environment:
conda deactivate
- list available environments:
conda info --envs
- remove environment:
conda remove --name <env_name> --all
Updates
- update conda:
conda update -n base -c defaults conda
- update all conda packages in current environment:
conda update --all
- also see: https://docs.conda.io/projects/conda/en/latest/commands/update.html
Other Commands
- remove unused cached packages:
conda clean -a
- also see: https://docs.conda.io/projects/conda/en/latest/commands/clean.html - disable automatic base activation:
conda config --set auto_activate_base false
- also see: https://stackoverflow.com/a/54560785/271118
Rename Conda Environment
Rename <src_env>
to <target_env>
:
conda create --name <target_env> --clone <src_env>
conda remove --name <src_env> --all
Installation
Conda installation on Linux
- download Miniconda (not Anaconda): https://conda.io/en/latest/miniconda.html#windows-installers
- download the 64 bit Miniconda3 for the highest Python version of your architecture
- change install file to executable:
chmod +x Miniconda3-latest-Linux-x86_64.sh
- start installation:
./Miniconda3-latest-Linux-x86_64.sh
- use default settings
- log out and back in to activate new settings
Windows Install
- download Miniconda (not Anaconda): https://conda.io/en/latest/miniconda.html#windows-installers
- download Miniconda3 for the highest Python version
- preferably the 64 bit version
- proxy setup
- add the following content to the
.condarc
file - located at
C:\Users\<User>
<user>
and<pass>
are optional- some https settings use the http protocol and not https
- add the following content to the
proxy_servers:
http: http://[<user>:<pass>@]corp.com:8080
https: https://[<user>:<pass>@]corp.com:8080
4 - Context Manager
Links
5 - Docstrings
Description
Python docstrings can be written in many different formats. An overview of different methods can be found in the following Stack Overflow entry: What is the standard Python docstring format?
It seems to be clever to use the docstring format of scipy and numpy since these packages are very popular. A guide to the numpy docstring format is here: numpydoc docstring guide and here is a Restructured Text (reST) syntax CheatSheet for Sphinx.
Links
- PEP 8: https://www.python.org/dev/peps/pep-0008/
- PEP 257: https://www.python.org/dev/peps/pep-0257/
- PEP 287 - reStructuredText Docstring Format: https://www.python.org/dev/peps/pep-0287/
- Google Python Style Guide: https://github.com/google/styleguide/blob/gh-pages/pyguide.md
- Example Google Style Python Docstrings: https://www.sphinx-doc.org/en/master/usage/extensions/example_google.html
6 - Exceptions
Links
- Built-in Exceptions: https://docs.python.org/3/library/exceptions.html
Most Important Exceptions
TypeError
: Raised when an operation or function is applied to an object of inappropriate type. The associated value is a string giving details about the type mismatch.NotImplementedError
: In user defined base classes, abstract methods should raise this exception when they require derived classes to override the method, or while the class is being developed to indicate that the real implementation still needs to be added.ValueError
: Raised when an operation or function receives an argument that has the right type but an inappropriate value, and the situation is not described by a more precise exception such as IndexError.
Logging Exceptions
try:
something()
except Exception:
logger.error("something bad happened", exc_info=True)
also see https://www.loggly.com/blog/exceptional-logging-of-exceptions-in-python/
7 - Filesystem
Path
Create Path
from pathlib import Path
Path("/my/pythondirectory").mkdir(parents=True, exist_ok=True)
8 - Iterate
- break
iterable
into lists of lengthn
:list(more_itertools.chunked(iterable, n))
- see - flatten a list of lists:
list(itertools.chain.from_iterable(list_of_lists))
Dict
Iterate keys of dict
d = {'x': 1, 'y': 2, 'z': 3}
for key in d:
print(key, 'corresponds to', d[key])
Iterate keys and values of dict:
d = {'x': 1, 'y': 2, 'z': 3}
for key, value in d.items():
print(key, 'corresponds to', value)
9 - Joblib
Links
Commands
- pickle compressed data to disk
joblib.dump(data, '<the_file>.pkl.gz', compress=('gzip', 3))
- also see https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html
- read compressed pickled data from disk
data = joblib.load('<the_file>.pkl.gz')
- also see https://joblib.readthedocs.io/en/latest/generated/joblib.load.html
11 - Jupyter & JupyterLab
Links
nb_conda_kernels
: https://github.com/Anaconda-Platform/nb_conda_kernels
Install JupyterLab
conda install jupyterlab nb_conda_kernels
View Jupyter Notebook online
This can be done here: https://nbviewer.jupyter.org/
Add Conda Environment to Jupyter Lab
python -m ipykernel install --user --name <conda_env_name> --display-name "<conda_env_name>"
Environment settings for Jupyter
Environment settings for Jupyter are not read from .bashrc
. You have to
specify them in a .py
file at ~/.ipython/profile_default/startup/
For example:
echo -e "import os\n\nos.environ[\"SOME_URL\"] = \"http://mlflow.company.tld:5000\"" > ~/.ipython/profile_default/startup/set_env.py
Install Jupyter on Server for Remote Access
- create blank config:
jupyter notebook --generate-config
- Guide: https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#running-a-public-notebook-server
- Password hash generation: https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#preparing-a-hashed-password
- Certificate generation: https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#using-ssl-for-encrypted-communication
Clean the Trash
When you use the “File Browser” of Jupyter Lab to delete files they are not deleted but moved to ~/.local/share/Trash
. Clean that folder to delete them.
12 - Linter
Black
- see https://black.readthedocs.io/
- ignore formatting in one line:
# fmt: skip
- https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#code-style
13 - Logging
Links
Module-Level Logger
A convention is to use a module-level logger as follows:
import logging
_logger = logging.getLogger(__name__)
Log variable Data
Logging uses the old “%-style” of string formatting. Also see:
Example:
_logger.warning("%s before you %s', 'Look', 'leap!")
Root Logger Configuration
The easiest way to configure the root logger works like this:
import logging
logging.getLogger().setLevel(logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler())
14 - Mock
Links
Asserts
Assert Methods
assert_called()
assert_called_once()
assert_called_with(*args, **kwargs)
- only passes if the call is the most recent oneassert_called_once_with(*args, **kwargs)
- passes if the mock was called exactly once with the specified argumentsassert_any_call(*args, **kwargs)
- passes if the mock has ever been called with the specified argumentsassert_has_calls(calls, any_order=False)
assert_not_called()
Assert only some Arguments
Assert not all exact arguments but just some of them.
Example:
from unittest.mock import Mock, ANY
mock = Mock(return_value=None)
mock("foo", bar="something_I_do_not_care_about")
mock.assert_called_once_with("foo", bar=ANY)
Spy
If you do not want to replace a function with a mock, but want to observe the parameters passed to the function, a so-called spy can be used.
Spy on a Function
To do this we use a combination of patch
and the wraps
argument.
Note that the first argument of patch
is a string.
Example:
def my_function(x):
return x + 1
with patch("__main__.my_function", wraps=my_function) as wrapped_function:
result = my_function(5)
assert result == 6
wrapped_function.assert_called_once_with(5)
Spy on a Method
To do this we use a combination of patch.object
and the wraps
argument.
Note that the first argument of patch.object
is a object (instance).
Example:
class MyClass:
def my_function(self, x):
return x + 1
my_instance = MyClass()
with patch.object(my_instance, 'my_function', wraps=my_instance.my_function) as wrapped_method:
result = my_instance.my_function(5)
assert result == 6
wrapped_method.assert_called_once_with(5)
Seal a Mock
Seal disables the automatic creation of mocks when accessing an attribute of the mock. Also see: https://docs.python.org/3/library/unittest.mock.html#unittest.mock.seal
One problem with normal mocks is that they auto-create attributes on demand. If you misspell one of these assert methods then your assertion is simply swallowed and ignored. If you seal the mock after creation this avoids this problem.
Example:
from unittest.mock import Mock, seal
mock = Mock()
# more configuration here
seal(mock)
15 - Pandas
Links
- indexing and selecting data: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
- using iloc, loc, & ix: https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/
- iterating dataframes
- iterrows: https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iterrows.html
- itertuples: https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.itertuples.html
- iteritems: https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iteritems.html
Create Dataframe
data = {"col1": [1, 2], "col2": [3, 4]}
df = pd.DataFrame(data=data)
Load and Save CSV
- save to CSV:
df.to_csv("path_or_buffer")
- save to CSV (without row names / index):
df.to_csv("path_or_buffer", index=False)
- load from CSV:
df = pd.read_csv(
"path_or_buffer",
sep=";",
encoding="us-ascii",
usecols=col_list,
nrows=number_of_rows_to_read,
low_memory=False,
quoting=csv.QUOTE_NONE,
)
- load csv without header:
df = pd.read_csv("path_or_buffer", names=["column_name_1", "column_name_2"], header=None)
Load and Save CSV Parquet
- save to parquet:
df.to_parquet("<file_name>.parquet.gz", compression="gzip")
- load from parquet:
df = read_parquet("<file_name>.parquet.gz")
Display Data
- count values in column (without
NaN
values):df["col_name"].value_counts()
- count values in column (with
NaN
values):df["col_name"].value_counts(dropna=False)
- duplicates
- display duplicate rows:
df[df.duplicated(keep=False)]
- display duplicate entries in column:
df[df["column_name"].duplicated(keep=False)]
- display duplicate rows:
Delete Data
- delete column inline
df.drop("column_name", axis=1, inplace=True)
column_name
can also be a list ofstr
- remove rows on condition:
df.drop(df[df["col_name"] == condition].index, inplace=True)
- remove duplicates
- keep first (inplace):
df.drop_duplicates(inplace=True, keep="first")
- only consider certain columns to identify duplicates, keep first (inplace):
df.drop_duplicates(list_of_cols, inplace=True, keep="first")
- keep first (inplace):
Modify Data
- sort
- low to high values:
df.sort_values("column_name", inplace=True)
- high to low values:
df.sort_values("column_name", ascending=False, inplace=True)
- high to low values &
Nan
values on top:df.sort_values("column_name", ascending=False, na_position="first")
- low to high values:
- shuffle:
df = df.sample(frac=1).reset_index(drop=True)
Combine Data
Stack two Dataframes
Never forget to ignore_index
or you have duplicate index values and
bad things might happen later!
df = pd.concat([df_01, df_02], ignore_index=True)
Display Settings
Examples for display settings:
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
# display long sentences in multiple rows (Jupyter)
pd.set_option("display.max_colwidth", None)
Filter nan Values
nan == nan
is always false
. That is why we can not use ==
to
check for nan
-values. Use pd.isnull(obj : scalar or array-like)
instead or isnull()
. Examples:
df.loc[pd.isnull(df["col"])]
df[df["col"].isnull()]
Other
- rename columns:
df.rename(columns={"a": "x"}, inplace=True)
16 - PIP
Install Packages
- install GIT development code:
pip install git+<https_git_clone_link>
- install GIT development code from branch:
pip install git+<https_git_clone_link>@<branch_name>
- install editable local Projects:
pip install -e .
- also see https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs
List Packages
- list outdated packages:
pip list -o
- list packages in requirements.txt format:
pip list --format freeze
Other Commands
- delete package cache:
pip cache purge
Install and update Packages from a File
For pip you can create so called requirements
files.
These files just list one package per line. Packages from this file can
be installes with pip install -r <requirements.txt>
and updatet
with pip install -r <requirements.txt> -U
. The update only makes
sence when you do not specify a version number with the package.
Since pip does not support an “update all” mechanism this is a good way to install and update the needed packages.
To add a package from GIT just add git+<https_git_clone_link>
instead of the normal package name.
Build PyPI Packages
- Python Packaging User Guide: https://packaging.python.org/
- Packaging Python Projects: https://packaging.python.org/tutorials/packaging-projects/
- Writing the Setup Script: https://docs.python.org/3.6/distutils/setupscript.html
- Classifiers: https://pypi.org/classifiers/
- PEP 440 - Version Identification and Dependency Specification: https://www.python.org/dev/peps/pep-0440/
- twine’s documentation: https://twine.readthedocs.io/en/latest/
- install_requires: https://setuptools.readthedocs.io/en/latest/setuptools.html?highlight=python_requires#declaring-dependencies
- python_requires: https://setuptools.readthedocs.io/en/latest/setuptools.html?highlight=python_requires#new-and-changed-setup-keywords
- setup.py Beispiele:
- Kurzanleitung pip Paket erstellen und zu PyPI hochladen:
python3 setup.py sdist bdist_wheel
-twine upload dist/*
18 - PyCharm
Hotkeys
- bookmarks
- add anonymous bookmark:
F3
- view bookmarks:
command
+2
(Mac)
- add anonymous bookmark:
Settings
- use git credential manager: Settings -> Version Control -> Git -> activate “Use Credential helper”
19 - pyenv
Links
Commands
- list available Python versions:
pyenv install -l
- install new Python version:
pyenv install <version>
- update with brew:
brew upgrade pyenv
20 - Python Naming
This is about how things are named in Python.
Classes
- class object: the raw class itself (not instantiated) (see here)
- instance object or instance of a class: an instantiated class like
x = MyClass()
(see here)
an instance object has two kinds of valid attribute names:- data attributes: are variables belonging to the instance, attributes need not be declared - like local variables - they spring into existence when they are first assigned to
- methods: method is a function that “belongs to” an object
- class variables: for attributes (and methods) shared by all instances of a class (see here)
- instance variables: for data unique to each instance (see here)
classmethod
see https://www.geeksforgeeks.org/class-method-vs-static-method-python/
staticmethod
see https://www.geeksforgeeks.org/class-method-vs-static-method-python/
constructor vs. initializer vs. allocator (__init__
)
21 - REST API with Python
Links
- Flask: https://flask.palletsprojects.com/
- Deploy to Production: https://flask.palletsprojects.com/en/1.1.x/tutorial/deploy/
- Celery Background Tasks: https://flask.palletsprojects.com/en/1.1.x/patterns/celery/
- Multi-Processing in Flask: http://wiki.glitchdata.com/index.php/Flask:_Multi-Processing_in_Flask
- Thread Locals: https://flask.palletsprojects.com/en/1.1.x/design/#thread-locals
- Thread-Locals in Flask: https://flask.palletsprojects.com/en/1.1.x/advanced_foreword/#thread-locals-in-flask
- huey: https://huey.readthedocs.io/en/latest/index.html
- Mini-Huey: https://huey.readthedocs.io/en/latest/mini.html
- Huey Extensions: https://huey.readthedocs.io/en/latest/contrib.html
- greenlet: https://greenlet.readthedocs.io/
- multiprocessing: https://docs.python.org/3.6/library/multiprocessing.html
- What is the Python Global Interpreter Lock (GIL)? https://realpython.com/python-gil/
How-tos and Questions
- How can I listen a message queue in flask microservices: https://stackoverflow.com/questions/53901699/how-can-i-listen-a-message-queue-in-flask-microservices
- A scalable Keras + deep learning REST API: https://www.pyimagesearch.com/2018/01/29/scalable-keras-deep-learning-rest-api/
- Waitress Board: https://groups.google.com/forum/#!forum/pylons-discuss
22 - tqdm
Links
Usage
Simple usage:
from tqdm import tqdm
for i in tqdm(range(10000)):
# ...
Manual usage:
from tqdm import tqdm
with tqdm(total=100) as pbar:
for i in range(10):
sleep(0.1)
pbar.update(10)
- automatically choose between console or notebook versions:
from tqdm.auto import tqdm
23 - Typing
Links
Types
Any
: Special type indicating an unconstrained type.Optional
: Optional type.Dict
Dict[str, int]
Dict[str, List[int]]
Callable
Callable[[str], Dict[str, List[int]]]
Example
def __init__(
self,
tokenizer_func: Callable[[str], Dict[str, List[int]]],
augmentation_func: Callable[[str], str],
train_data_sampling_callback: Callable[[List[str]], List[int]] = None,
):