Pandas

Create Dataframe

data = {"col1": [1, 2], "col2": [3, 4]}
df = pd.DataFrame(data=data)

Load and Save

  • save to CSV: df.to_csv("path_or_buffer")

  • save to CSV (without row names / index): df.to_csv("path_or_buffer", index=False)

  • load from CSV:

df = pd.read_csv(
    "path_or_buffer",
    sep=";",
    encoding="us-ascii",
    usecols=col_list,
    nrows=number_of_rows_to_read,
    low_memory=False,
    quoting=csv.QUOTE_NONE,
)

Display Data

  • count values in column (without NaN values): df["col_name"].value_counts()

  • count values in column (with NaN values): df["col_name"].value_counts(dropna=False)

Delete Data

  • delete column inline

    • df.drop("column_name", axis=1, inplace=True)

    • column_name can also be a list of str

  • remove rows on condition: df.drop(df[df["col_name"] == condition].index, inplace=True)

  • remove duplicates

    • keep first (inplace): df.drop_duplicates(inplace=True, keep="first")

    • only consider certain columns to identify duplicates, keep first (inplace): df.drop_duplicates(list_of_cols, inplace=True, keep="first")

Modify Data

  • sort Data

    • low to high values: df.sort_values("column_name", inplace=True)

    • high to low values: df.sort_values("column_name", ascending=False, inplace=True)

Combine Data

Stack two Dataframes

Never forget to ignore_index or you have duplicate index values and bad things might happen later!

df = pd.concat([df_01, df_02], ignore_index=True)

Display Settings

Examples for display settings:

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

Filter nan Values

nan == nan is always false. That is why we can not use == to check for nan-values. Use pd.isnull(obj : scalar or array-like) instead or isnull(). Examples:

df.loc[pd.isnull(df["col"])]
df[df["col"].isnull()]

Other

  • rename columns: df.rename(columns={"a": "x"}, inplace=True)