mds_2025_helper_functions.dataset_summary ========================================= .. py:module:: mds_2025_helper_functions.dataset_summary Functions --------- .. autoapisummary:: mds_2025_helper_functions.dataset_summary.dataset_summary Module Contents --------------- .. py:function:: dataset_summary(data) Generates a comprehensive summary of a dataset. This function analyzes the input DataFrame and provides insights, including: - Missing value statistics for each column - Counts of numerical and categorical features - Number of duplicate rows - Descriptive statistics for numerical features - Unique value counts for categorical features :param data: The dataset to analyze. Must be a pandas DataFrame. :type data: pandas.DataFrame :returns: A dictionary containing the following keys: - 'missing_values' (pd.DataFrame): Summary of missing values, including counts and percentages for each column. - 'feature_types' (dict): Counts of numerical and categorical features in the dataset. Format: {'numerical_features': int, 'categorical_features': int}. - 'duplicates' (int): The number of duplicate rows in the dataset. - 'numerical_summary' (pd.DataFrame): Descriptive statistics for numerical columns. - 'categorical_summary' (pd.DataFrame): Unique value counts for categorical columns. :rtype: dict :raises TypeError: If the input is not a pandas DataFrame. :raises ValueError: If the DataFrame is empty or contains unsupported data types. .. rubric:: Example >>> import pandas as pd >>> from mds_2025_helper_functions.dataset_summary import dataset_summary >>> >>> # Example dataset >>> data = { ... 'Name': ['Alice', 'Bob', 'Charlie', 'Alice', None], ... 'Age': [25, 32, 47, None, 29], ... 'Salary': [50000, 60000, 120000, None, 80000], ... 'Department': ['HR', 'Finance', 'IT', 'HR', 'Finance'] ... } >>> df = pd.DataFrame(data) >>> # Generate summary >>> summary = dataset_summary(df) >>> # Access individual components of the summary >>> print(summary['missing_values']) # Missing values per column >>> print(summary['feature_types']) # Count of numerical and categorical features >>> print(summary['duplicates']) # Number of duplicate rows >>> print(summary['numerical_summary']) # Descriptive statistics for numerical columns >>> print(summary['categorical_summary']) # Unique values for categorical columns >>> # A specific example interpretation: # 'missing_values' contains: # column missing_count missing_percentage # 0 Name 1 20.0 # 1 Age 1 20.0 # 2 Salary 1 20.0 # 3 Department 0 0.0 >>> # 'feature_types' looks like: # {'numerical_features': 2, 'categorical_features': 2} >>> # 'duplicates' : # 1 (One duplicate row based on the data)