Close Menu
    Facebook X (Twitter) Instagram
    • Privacy Policy
    • Terms Of Service
    • Legal Disclaimer
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Facebook X (Twitter) Instagram
    Brief ChainBrief Chain
    • Home
    • Crypto News
      • Bitcoin
      • Ethereum
      • Altcoins
      • Blockchain
      • DeFi
    • AI News
    • Stock News
    • Learn
      • AI for Beginners
      • AI Tips
      • Make Money with AI
    • Reviews
    • Tools
      • Best AI Tools
      • Crypto Market Cap List
      • Stock Market Overview
      • Market Heatmap
    • Contact
    Brief ChainBrief Chain
    Home»AI News»Building a Stable Fable 5 Traces Workflow in Colab: Parsing Tool Calls, Auditing Data, and Training Baselines
    Building a Stable Fable 5 Traces Workflow in Colab: Parsing Tool Calls, Auditing Data, and Training Baselines
    AI News

    Building a Stable Fable 5 Traces Workflow in Colab: Parsing Tool Calls, Auditing Data, and Training Baselines

    June 28, 20265 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email
    kraken


    rprint(Panel.fit(“[bold]Baseline 1: Predict output_type from context using pure Python Naive Bayes[/bold]”))
    model_artifacts = {}
    classifier_df = df.dropna(subset=[“output_type”]).copy()
    classifier_df = classifier_df[
    classifier_df[“output_type”].astype(str).str.len() > 0
    ].copy()
    if classifier_df[“output_type”].nunique() >= 2 and len(classifier_df) >= 30:
    X_text = (
    classifier_df[“context”]
    .fillna(“”)
    .astype(str)
    .map(lambda text: text[:12000])
    .tolist()
    )
    y = classifier_df[“output_type”].astype(str).tolist()
    train_indices, test_indices = stratified_train_test_indices(y, test_size=0.2, seed=SEED)
    X_train = [X_text[i] for i in train_indices]
    y_train = [y[i] for i in train_indices]
    X_test = [X_text[i] for i in test_indices]
    y_test = [y[i] for i in test_indices]
    output_type_classifier = PureMultinomialNB(
    max_features=20000,
    min_df=2,
    alpha=1.0,
    )
    output_type_classifier.fit(X_train, y_train)
    predictions = output_type_classifier.predict(X_test)
    output_type_metrics, output_report_df = evaluate_predictions(y_test, predictions)
    output_matrix_df = confusion_matrix_df(y_test, predictions)
    output_type_metrics[“train_rows”] = len(X_train)
    output_type_metrics[“test_rows”] = len(X_test)
    output_type_metrics[“vocab_size”] = len(output_type_classifier.vocab)
    rprint(“[bold]Output type classifier report:[/bold]”)
    display(output_report_df)
    display(output_matrix_df)
    output_report_df.to_csv(OUT_DIR / “output_type_classifier_report.csv”, index=False)
    output_matrix_df.to_csv(OUT_DIR / “output_type_confusion_matrix.csv”)
    top_token_records = []
    for label in output_type_classifier.labels:
    for token, margin in output_type_classifier.top_tokens_for_class(label, n=25):
    top_token_records.append(
    {
    “label”: label,
    “token”: token,
    “score_margin”: margin,
    }
    )
    pd.DataFrame(top_token_records).to_csv(
    OUT_DIR / “output_type_top_tokens.csv”,
    index=False,
    )
    with open(
    OUT_DIR / “output_type_classifier_metrics.json”,
    “w”,
    encoding=”utf-8″,
    ) as file:
    json.dump(output_type_metrics, file, ensure_ascii=False, indent=2)
    model_artifacts[“output_type_classifier_metrics”] = str(
    OUT_DIR / “output_type_classifier_metrics.json”
    )
    model_artifacts[“output_type_classifier_report”] = str(
    OUT_DIR / “output_type_classifier_report.csv”
    )
    model_artifacts[“output_type_confusion_matrix”] = str(
    OUT_DIR / “output_type_confusion_matrix.csv”
    )
    model_artifacts[“output_type_top_tokens”] = str(
    OUT_DIR / “output_type_top_tokens.csv”
    )
    else:
    rprint(
    “[yellow]Skipping output_type classifier because there are too few ”
    “classes or rows.[/yellow]”
    )
    output_type_metrics = {}
    rprint(Panel.fit(“[bold]Baseline 2: Predict tool_name from context using pure Python Naive Bayes[/bold]”))
    tool_classifier_df = df[
    df[“output_type”].eq(“tool_use”)
    & df[“tool_name”].fillna(“”).astype(str).str.len().gt(0)
    ].copy()
    if len(tool_classifier_df) >= 50 and tool_classifier_df[“tool_name”].nunique() >= 2:
    top_tools = tool_classifier_df[“tool_name”].value_counts().head(12).index.tolist()
    tool_classifier_df[“tool_label”] = tool_classifier_df[“tool_name”].where(
    tool_classifier_df[“tool_name”].isin(top_tools),
    “__OTHER__”,
    )
    y_tool = tool_classifier_df[“tool_label”].astype(str).tolist()
    X_tool_text = (
    tool_classifier_df[“context”]
    .fillna(“”)
    .astype(str)
    .map(lambda text: text[:12000])
    .tolist()
    )
    if len(set(y_tool)) >= 2:
    train_indices, test_indices = stratified_train_test_indices(y_tool, test_size=0.2, seed=SEED)
    X_train = [X_tool_text[i] for i in train_indices]
    y_train = [y_tool[i] for i in train_indices]
    X_test = [X_tool_text[i] for i in test_indices]
    y_test = [y_tool[i] for i in test_indices]
    tool_classifier = PureMultinomialNB(
    max_features=20000,
    min_df=2,
    alpha=1.0,
    )
    tool_classifier.fit(X_train, y_train)
    tool_predictions = tool_classifier.predict(X_test)
    tool_metrics, tool_report_df = evaluate_predictions(y_test, tool_predictions)
    tool_matrix_df = confusion_matrix_df(y_test, tool_predictions)
    tool_metrics[“train_rows”] = len(X_train)
    tool_metrics[“test_rows”] = len(X_test)
    tool_metrics[“vocab_size”] = len(tool_classifier.vocab)
    rprint(“[bold]Tool classifier report:[/bold]”)
    display(tool_report_df)
    display(tool_matrix_df)
    tool_report_df.to_csv(OUT_DIR / “tool_name_classifier_report.csv”, index=False)
    tool_matrix_df.to_csv(OUT_DIR / “tool_name_confusion_matrix.csv”)
    top_tool_token_records = []
    for label in tool_classifier.labels:
    for token, margin in tool_classifier.top_tokens_for_class(label, n=25):
    top_tool_token_records.append(
    {
    “label”: label,
    “token”: token,
    “score_margin”: margin,
    }
    )
    pd.DataFrame(top_tool_token_records).to_csv(
    OUT_DIR / “tool_name_top_tokens.csv”,
    index=False,
    )
    with open(
    OUT_DIR / “tool_name_classifier_metrics.json”,
    “w”,
    encoding=”utf-8″,
    ) as file:
    json.dump(tool_metrics, file, ensure_ascii=False, indent=2)
    model_artifacts[“tool_name_classifier_metrics”] = str(
    OUT_DIR / “tool_name_classifier_metrics.json”
    )
    model_artifacts[“tool_name_classifier_report”] = str(
    OUT_DIR / “tool_name_classifier_report.csv”
    )
    model_artifacts[“tool_name_confusion_matrix”] = str(
    OUT_DIR / “tool_name_confusion_matrix.csv”
    )
    model_artifacts[“tool_name_top_tokens”] = str(
    OUT_DIR / “tool_name_top_tokens.csv”
    )
    else:
    rprint(“[yellow]Skipping tool classifier because labels collapsed to one class.[/yellow]”)
    tool_metrics = {}
    else:
    rprint(
    “[yellow]Skipping tool classifier because there are too few tool-use ”
    “rows or tool classes.[/yellow]”
    )
    tool_metrics = {}
    rprint(Panel.fit(“[bold]Building simple keyword search helper[/bold]”))
    def search_rows(keyword, limit=5, search_cols=(“context”, “cot”, “completion”, “text_payload”)):
    keyword = str(keyword).lower()
    mask = pd.Series(False, index=df.index)
    for column in search_cols:
    mask = mask | (
    df[column]
    .fillna(“”)
    .astype(str)
    .str.lower()
    .str.contains(re.escape(keyword), regex=True)
    )
    hits = df[mask].head(limit)
    results = []
    for _, row in hits.iterrows():
    results.append(
    {
    “uid”: row.get(“uid”),
    “session”: row.get(“session”),
    “output_type”: row.get(“output_type”),
    “tool_name”: row.get(“tool_name”),
    “context_preview”: preview_text(row.get(“context”), 400),
    “payload_preview”: preview_text(row.get(“text_payload”), 400),
    }
    )
    return results
    example_queries = [
    “Bash”,
    “Write”,
    “browser”,
    “test”,
    “README”,
    ]
    search_demo = {
    query: search_rows(query, limit=2)
    for query in example_queries
    }
    with open(
    OUT_DIR / “keyword_search_demo.json”,
    “w”,
    encoding=”utf-8″,
    ) as file:
    json.dump(search_demo, file, ensure_ascii=False, indent=2)
    rprint(“[bold]Example keyword search results:[/bold]”)
    rprint(safe_json_dumps(search_demo, max_chars=5000))
    summary = {
    “dataset_id”: DATASET_ID,
    “flat_jsonl_filename”: FLAT_JSONL_FILENAME,
    “output_directory”: str(OUT_DIR),
    “repo_file_summary”: file_summary,
    “rows”: int(len(df)),
    “columns”: list(df.columns),
    “output_type_distribution”: (
    df[“output_type”]
    .fillna(“missing”)
    .value_counts()
    .to_dict()
    ),
    “top_tools”: (
    df.loc[df[“output_type”].eq(“tool_use”), “tool_name”]
    .replace(“”, “unknown”)
    .value_counts()
    .head(20)
    .to_dict()
    ),
    “top_source_roots”: (
    df[“source_root”]
    .fillna(“unknown”)
    .value_counts()
    .head(20)
    .to_dict()
    ),
    “length_summary”: {
    column: {
    “mean”: float(df[column].mean()),
    “median”: float(df[column].median()),
    “p90”: float(df[column].quantile(0.90)),
    “p95”: float(df[column].quantile(0.95)),
    “max”: int(df[column].max()),
    }
    for column in [
    “context_chars”,
    “cot_chars”,
    “completion_chars”,
    “text_payload_chars”,
    ]
    },
    “possible_secret_rows”: int(df[“possible_secret_anywhere”].sum()),
    “plots”: plot_paths,
    “model_artifacts”: model_artifacts,
    “safe_exports”: {
    “train”: str(OUT_DIR / “fable5_no_cot_chat_train.jsonl”),
    “validation”: str(OUT_DIR / “fable5_no_cot_chat_validation.jsonl”),
    “test”: str(OUT_DIR / “fable5_no_cot_chat_test.jsonl”),
    },
    “analysis_files”: {
    “csv”: str(OUT_DIR / “fable5_analysis_index.csv”),
    “pickle”: str(OUT_DIR / “fable5_analysis_index.pkl”),
    “keyword_search_demo”: str(OUT_DIR / “keyword_search_demo.json”),
    },
    }
    with open(
    OUT_DIR / “analysis_summary.json”,
    “w”,
    encoding=”utf-8″,
    ) as file:
    json.dump(clean_for_json(summary), file, ensure_ascii=False, indent=2, default=str)
    FENCE = chr(96) * 3
    report_md = (
    “# Fable 5 Traces Advanced Tutorial Report\n\n”
    “## Dataset\n\n”
    f”- Dataset: `{DATASET_ID}`\n”
    f”- Flat JSONL: `{FLAT_JSONL_FILENAME}`\n”
    f”- Rows loaded: `{len(df):,}`\n”
    f”- Unique source sessions: `{df[‘session’].nunique(dropna=True):,}`\n”
    f”- Unique models: `{df[‘model’].nunique(dropna=True):,}`\n\n”
    “## Important safety note\n\n”
    “This tutorial treats the dataset as agent telemetry. It previews and analyzes commands, ”
    “tool calls, file edits, and transcript text, but it never executes commands found inside ”
    “the traces.\n\n”
    f”Potential secret-like patterns detected: `{int(df[‘possible_secret_anywhere’].sum()):,}` rows.\n”
    “Exports redact common API-key/token-like patterns.\n\n”
    “## Output type distribution\n\n”
    f”{FENCE}json\n”
    f”{json.dumps(clean_for_json(summary[‘output_type_distribution’]), indent=2, ensure_ascii=False)}\n”
    f”{FENCE}\n\n”
    “## Top tools\n\n”
    f”{FENCE}json\n”
    f”{json.dumps(clean_for_json(summary[‘top_tools’]), indent=2, ensure_ascii=False)}\n”
    f”{FENCE}\n\n”
    “## Saved files\n\n”
    “- `analysis_summary.json`\n”
    “- `fable5_analysis_index.csv`\n”
    “- `fable5_analysis_index.pkl`\n”
    “- `fable5_no_cot_chat_train.jsonl`\n”
    “- `fable5_no_cot_chat_validation.jsonl`\n”
    “- `fable5_no_cot_chat_test.jsonl`\n”
    “- plot PNG files\n”
    “- baseline classifier metrics, when enough rows/classes are available\n\n”
    “## Recommended next steps\n\n”
    “1. Inspect `fable5_no_cot_chat_train.jsonl` before any fine-tuning.\n”
    “2. Keep the dataset license in mind before model training or redistribution.\n”
    “3. Avoid training directly on raw terminal outputs without additional privacy and safety filtering.\n”
    “4. Start with the no-CoT chat export unless your research explicitly requires reasoning-trace supervision.\n”
    )
    with open(
    OUT_DIR / “REPORT.md”,
    “w”,
    encoding=”utf-8″,
    ) as file:
    file.write(report_md)
    rprint(
    Panel.fit(
    f”[bold green]Tutorial complete.[/bold green]\n\n”
    f”Artifacts saved in:\n{OUT_DIR}\n\n”
    f”Key files:\n”
    f”- {OUT_DIR / ‘REPORT.md’}\n”
    f”- {OUT_DIR / ‘analysis_summary.json’}\n”
    f”- {OUT_DIR / ‘fable5_no_cot_chat_train.jsonl’}\n”
    f”- {OUT_DIR / ‘fable5_analysis_index.csv’}”,
    title=”Done”,
    )
    )
    display(
    pd.DataFrame(
    {
    “artifact”: [
    “Report”,
    “Summary JSON”,
    “No-CoT train export”,
    “No-CoT validation export”,
    “No-CoT test export”,
    “Analysis CSV”,
    “Analysis pickle”,
    “Keyword search demo”,
    ],
    “path”: [
    str(OUT_DIR / “REPORT.md”),
    str(OUT_DIR / “analysis_summary.json”),
    str(OUT_DIR / “fable5_no_cot_chat_train.jsonl”),
    str(OUT_DIR / “fable5_no_cot_chat_validation.jsonl”),
    str(OUT_DIR / “fable5_no_cot_chat_test.jsonl”),
    str(OUT_DIR / “fable5_analysis_index.csv”),
    str(OUT_DIR / “fable5_analysis_index.pkl”),
    str(OUT_DIR / “keyword_search_demo.json”),
    ],
    }
    )
    )



    Source link

    quillbot
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    CryptoExpert
    • Website

    Related Posts

    David Autor named head of the Department of Economics | MIT News

    June 27, 2026

    Most companies think they're building a software factory. They're actually just shipping bugs faster.

    June 26, 2026

    Using Graphify and NetworkX to Map Python Codebase Structure with God Nodes, Communities, and Architecture Visualizations

    June 24, 2026

    New chip could help tiny robots traverse complex environments | MIT News

    June 23, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    frase
    Latest Posts

    Key Dogecoin Indicator Flashes a Buy Signal After DOGE Sank to a 3-Year Low

    June 28, 2026

    The Future Cyberpunk Imagined Is Here: How Much Did It Get Right?

    June 28, 2026

    BlackRock Sends $217M in Bitcoin and Ethereum to Coinbase Prime

    June 28, 2026

    Stock Indexes Finish Lower as Chipmakers Sell Off

    June 28, 2026

    Bitcoin UTXO Signal Points to Bear Market Bottom

    June 28, 2026
    frase
    LEGAL INFORMATION
    • Privacy Policy
    • Terms Of Service
    • Legal Disclaimer
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Top Insights

    Bitcoin Cheaper Than 90% of Its History Right Now, Says Big Print Author Lawrence Lepard

    June 28, 2026

    Building a Stable Fable 5 Traces Workflow in Colab: Parsing Tool Calls, Auditing Data, and Training Baselines

    June 28, 2026
    kraken
    Facebook X (Twitter) Instagram Pinterest
    © 2026 BriefChain.com - All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.