0.7 alpha 2 by Chenglong-MS · Pull Request #258 · microsoft/data-formulator

Chenglong-MS · 2026-03-18T21:36:06Z

This pull request introduces comprehensive Docker support for Data Formulator, improves developer experience, and updates documentation and configuration for easier deployment and internationalization. The main changes include adding Docker and Docker Compose files, updating documentation to guide users on Docker usage, enhancing environment and ignore files for containerization, and adding i18n-related dependencies. There are also updates to default model configurations and improvements to the README for clarity and developer onboarding.

Dockerization & Deployment:

Added a production-ready Dockerfile and docker-compose.yml for building and running Data Formulator with persistent workspace storage and health checks. [1] [2]
Introduced a .dockerignore file to optimize Docker build context and exclude unnecessary files.

Documentation & Developer Experience:

Expanded DEVELOPMENT.md and updated README.md with detailed Docker usage instructions, quickstart guides, and clarified installation options. Also improved developer onboarding messaging. [1] [2] [3] [4]
Added .vscode/settings.json to streamline Python development and common terminal tasks.

Configuration & Environment:

Enhanced .env.template with new options for logging, data directory, and UI languages. Updated default LLM model lists for OpenAI, Azure, and Ollama providers. [1] [2] [3]

Frontend Internationalization:

Added i18next, i18next-browser-languagedetector, and react-i18next to dependencies in package.json to support future UI localization. [1] [2]
Updated package.json scripts to include vitest for testing.This pull request introduces Docker support for the Data Formulator project, making it easier to run the application without local Python or Node.js setup. It also refactors CORS handling for better security and configuration, and includes several improvements and fixes to agent logic and logging. The changes span new Docker-related files, Python backend adjustments, and updates to agent code for more accurate metadata and logging.

Dockerization and Development Environment:

Added a production-ready Dockerfile with a multi-stage build to bundle the frontend and backend, and a docker-compose.yml for easy orchestration and persistent workspace data. .dockerignore is included to optimize builds. [1] [2] [3]
Updated DEVELOPMENT.md with Docker usage instructions and caveats about sandboxing in containerized environments.
Added VS Code workspace settings for Python interpreter and terminal tool auto-approval.

Backend API and CORS Handling:

Centralized and configurable CORS headers using an @after_request handler in agent_routes.py, removing duplicated and insecure Access-Control-Allow-Origin: * headers from individual endpoints. Now, CORS is controlled via the CORS_ORIGIN environment variable. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Agent Logic and Metadata Improvements:

Updated semantic type references and metadata in agent prompts and logic for more precise field categorization (e.g., using "Category" instead of "Type", refining monetary and discrete numeric types). [1] [2] [3]
Improved guidance in agent prompts for scale/domain inference and clarified open-ended measure handling.
Enhanced logging in agent_data_rec.py and agent_data_transform.py to include LLM token usage and clarify timing breakdowns. [1] [2]
Adjusted image detail levels in chart/image payloads for agents to optimize LLM input and output. [1] [2] [3] [4]

Other Improvements:

Added more detailed logging in agent routines for better observability during LLM calls. [1] [2]
Minor corrections and consistency improvements in agent prompt templates and comments. [1] [2] [3]

These changes collectively improve deployment flexibility, security, and developer experience, while also enhancing the correctness and observability of agent operations.

Bumps [immutable](https://github.com/immutable-js/immutable-js) from 5.1.4 to 5.1.5. - [Release notes](https://github.com/immutable-js/immutable-js/releases) - [Changelog](https://github.com/immutable-js/immutable-js/blob/main/CHANGELOG.md) - [Commits](immutable-js/immutable-js@v5.1.4...v5.1.5) --- updated-dependencies: - dependency-name: immutable dependency-version: 5.1.5 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>

add src/lib/agents-chart/core/color-decisions.ts, undated corresponding Echarts code

…ble-5.1.5 Bump immutable from 5.1.4 to 5.1.5

Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.5.4 to 6.5.5. - [Changelog](https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst) - [Commits](tornadoweb/tornado@v6.5.4...v6.5.5) --- updated-dependencies: - dependency-name: tornado dependency-version: 6.5.5 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>

…ment)

Bumps [pyjwt](https://github.com/jpadilla/pyjwt) from 2.11.0 to 2.12.0. - [Release notes](https://github.com/jpadilla/pyjwt/releases) - [Changelog](https://github.com/jpadilla/pyjwt/blob/master/CHANGELOG.rst) - [Commits](jpadilla/pyjwt@2.11.0...2.12.0) --- updated-dependencies: - dependency-name: pyjwt dependency-version: 2.12.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>

fix color setting of echarts and chart.js

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Remove user: "0:0" override in docker-compose.yml — the Dockerfile already creates /home/appuser/.data_formulator and chowns it to appuser before switching to USER appuser, so the override was causing the app to run as root and write to /root/.data_formulator, bypassing the mounted volume entirely. Pass --user with host uid:gid to docker run in DockerSandbox so the sandbox container UID matches the host user that created the bind-mounted output directory. Without this, the non-root sandbox user cannot write the output parquet file, silently breaking all Docker sandbox executions.

update colors problem

Bumps [pyasn1](https://github.com/pyasn1/pyasn1) from 0.6.2 to 0.6.3. - [Release notes](https://github.com/pyasn1/pyasn1/releases) - [Changelog](https://github.com/pyasn1/pyasn1/blob/main/CHANGES.rst) - [Commits](pyasn1/pyasn1@v0.6.2...v0.6.3) --- updated-dependencies: - dependency-name: pyasn1 dependency-version: 0.6.3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>

…K encoding support Implement cross-platform file encoding handling, including: 1. Add readFileText function on frontend to handle UTF-8 and GBK encodings

…ion logic - Add trusted encoding detection set, optimize GBK-first strategy - Add integration tests to verify Chinese CSV file processing - Improve encoding detection fallback chain, finally fallback to latin-1

…but different extensions Add test cases to verify the following scenarios: 1. Mismatch between frontend preview ID and backend workspace table name 2. Whether files with same name but different extensions can coexist after upload 3. Index mismatch between file array and table array

Explicitly specify the minimum version requirement for openpyxl in both requirements.txt and pyproject.toml to ensure dependency compatibility

…me processing logic - Add original table name field to preserve table names before backend processing and display them in the frontend - Refactor table name processing logic, centralizing it into the table_names.py module - Update frontend interface to display original table names and source information

…mbed to support interactive features - Remove static SVG caching and rendering logic, use vega-embed instead to enable chart interactivity - Delete no longer needed PNG export and Vega editor open features - Simplify component state management, focusing on interactive chart display

Add a unified diagnostic information builder AgentDiagnostics for all agent pipelines to centrally manage the returned JSON structure, ensuring a single schema definition is used across frontend and backend. Refactored the diagnostic information generation logic for DataRecAgent , DataTransformationAgent , and DataLoadAgent , removed duplicate code, and added diagnostic support for DataLoadAgent . Also added relevant unit tests to verify functionality.

…to avoid redundant computation Change system_prompt from a local variable to an instance variable self.system_prompt , avoiding repeated string concatenation in both the constructor and run method, improving code reusability

…taTransformationAgent - Introduced ensure_output_variable_in_code function to automatically append missing output variables in generated code. - Updated logging to provide clearer diagnostics on whether the output variable was patched or not. - Modified AgentDiagnostics to include a new code_patched field for better tracking of code modifications.

…utput variable detection - Updated supplement_missing_block function to request only the missing JSON or code piece, improving success rates for smaller models. - Enhanced ensure_output_variable_in_code function to provide a deterministic local fix for output variable assignment, optimizing performance before sandbox execution.

…ement, and reliability This update consolidates the main 0.7 work after 48a2b11, covering data ingestion, table management, agent robustness, frontend UX, internationalization, server-side model management, and broader automated test coverage. It improves file parsing and encoding support, adds safer filename and metadata handling, strengthens derive/refine error recovery and diagnostics, upgrades visualization and upload interactions, introduces Chinese/English language support across the app, and enables globally managed server-side model configuration with better security boundaries. In addition, this range significantly expands both frontend and backend tests to protect key workflows such as Excel parsing, Unicode table names, multimodal fallback behavior, JSON serialization, global model APIs, and rendering safety. Overall, the changes move 0.7 from a set of isolated feature additions into a more complete, stable, and deployment-ready release

py-src/data_formulator/agent_routes.py

py-src/data_formulator/tables_routes.py

… language instructions - Add multilingual prompt message "Maximum exploration steps reached" for exploration feature - Change data agent's recommended sub-agent language instruction from full mode to concise mode - Fix status display issue in SimpleChartRecBox when maximum steps are reached - Fix language issue when user clarification is needed

Added 13 new language supports including Japanese, Korean, French, German, etc., and supplemented special rules for Japanese

Avoid unnecessary chart insight requests when the auto chart insights configuration is turned off

…resh to be visible when generating reports Add capturedImages cache to resolve React 18 batch rendering issues During report generation, temporarily store captured chart images in the capturedImages object, and update them to Redux state in a batch at the end, ensuring React 18 can process these updates in batches

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

### Detailed Changes 1. Internationalization Enhancements - Added multilingual prompt message "Maximum exploration steps reached" for exploration feature - Changed data agent's recommended sub-agent language instruction from full mode to concise mode - Added 13 new language supports including Japanese, Korean, French, German, etc. - Supplemented special rules for Japanese - Fixed language issue when user clarification is needed - Fixed status display issue in SimpleChartRecBox when maximum steps are reached 2. Performance Optimization - Avoid unnecessary chart insight requests when the auto chart insights configuration is turned off 3. Bug Fixes - Fixed issue where charts sometimes require browser refresh to be visible when generating reports - Added capturedImages cache to resolve React 18 batch rendering issues - During report generation, temporarily store captured chart images in the capturedImages object, and update them to Redux state in a batch at the end, ensuring React 18 can process these updates properly

Add error message sanitization across multiple routes to prevent sensitive information leakage Remove duplicate custom sanitization logic and unify usage of the new sanitize_error_message function Replace json.dumps with jsonify to maintain consistent response formatting

- Remove package-lock.json in favor of yarn.lock - Update yarn.lock with latest dependency resolutions

Prevent npm lock file from being committed to version control as we use yarn

chore: migrate to yarn and enhance security error handling

py-src/data_formulator/agent_routes.py

        result = {'status': 'error'}

-    return json.dumps(result)
+    return jsonify(result)


In general, to fix this category of problem you should avoid returning exception messages or stack traces (even “sanitized” ones) directly to clients. Instead, log the full exception on the server for debugging, and send only a generic, pre-defined error message, optionally with a simple status or code that does not contain implementation details.

For this specific case, the problematic flow is in test_model:

except Exception as e: logger.warning(f"Error testing model {content['model'].get('id', '')}: {e}") is_global = content['model'].get('is_global', False) result = { "model": content['model'], "status": 'error', "message": "Connection failed, please check server configuration" if is_global else sanitize_model_error(str(e)), }

The best fix with minimal functional change is:

Keep logging the error server-side (possibly upgrade to logger.exception so the stack trace is recorded in logs).

Stop passing str(e) through sanitize_model_error to the client.

Replace the exception-derived user message with a generic message that does not depend on e. We can still differentiate global vs non-global if needed, but both should use safe, static text.

We do not need to change sanitize_error_message or the alias sanitize_model_error themselves; they may be used safely elsewhere. We only need to change the construction of result in the except block of test_model in py-src/data_formulator/agent_routes.py. No new imports or helpers are required.

py-src/data_formulator/tables_routes.py

    except Exception as e:
        logger.error(f"Failed to open workspace: {e}")
-        return jsonify(status="error", message=str(e)), 500
+        return jsonify(status="error", message=sanitize_error_message(str(e))), 500


In general, to fix this kind of issue you should avoid returning exception details to the client. Instead, log the full exception (including stack trace) on the server and send back a generic, non-sensitive error message such as “An internal error has occurred” or something similarly high-level. If you want to expose some context to the client, it should be a controlled, static message, not derived from Exception text.

For this specific code, the simplest, non-breaking change is to adjust the open_workspace route’s except block. We should keep logging the rich error detail on the server, but stop using sanitize_error_message(str(e)) in the HTTP response. Instead, return a fixed, generic error string. Because other endpoints already use a more structured sanitize_db_error_message pattern, this endpoint can simply respond with a generic message like "Failed to open workspace" or "An internal server error occurred while opening the workspace." without affecting upstream logic (clients are already checking status and possibly message as a human-readable string). We do not need to modify sanitize_error_message itself for this finding.

Concretely:

In py-src/data_formulator/tables_routes.py, inside open_workspace, update the except block at lines 179–181:

Keep logger.error(f"Failed to open workspace: {e}") so developers see the details.

Change the return jsonify(status="error", message=sanitize_error_message(str(e))), 500 call to instead return a fixed string, e.g. message="Failed to open workspace".

No additional imports or helper methods are required; we are only changing the error message content.

py-src/data_formulator/tables_routes.py

                df = pd.DataFrame(json.loads(raw_data))
            except Exception as e:
-                return jsonify({"status": "error", "message": f"Invalid JSON data: {str(e)}, it must be a list of dictionaries"}), 400
+                return jsonify({"status": "error", "message": f"Invalid JSON data: {sanitize_error_message(str(e))}, it must be a list of dictionaries"}), 400


In general, to fix information exposure via exceptions, you should avoid returning raw (or semi-sanitized) exception messages to clients. Instead, log the detailed exception, including stack trace, on the server, and send back a generic, stable error message that does not depend on str(e) or any other implementation detail. If you wish to surface some context (e.g., “invalid JSON”), use a static message or one derived from validated input, not from the exception object.

For this specific case in create_table in py-src/data_formulator/tables_routes.py, the best minimal fix is:

In the except Exception as e: block around json.loads(raw_data), stop embedding sanitize_error_message(str(e)) into the client-facing message.

Instead:

Log the full exception and stack trace on the server using the module logger, e.g. logger.exception(...) or logger.error(..., exc_info=True).

Return a generic error string such as "Invalid JSON data, it must be a list of dictionaries", without including str(e) at all.

This preserves existing functionality (the route still returns a 400 error indicating invalid JSON) while removing any dependence on attacker-influenced exception text.

No changes are needed to sanitize_error_message itself for this fix.

Concretely:

Edit the except Exception as e: block around lines 464–467 in create_table to:

Add a logging call to record the exception.

Replace the existing jsonify({"status": "error", "message": f"...{sanitize_error_message(str(e))}..."}) with a variant that does not reference e or sanitize_error_message.

No additional imports or new helper methods are required; you can reuse the existing logger instance.

py-src/data_formulator/tables_routes.py

+
+    except Exception as e:
+        logger.error("Error parsing file", exc_info=True)
+        return jsonify({"status": "error", "message": sanitize_error_message(str(e))}), 400


General approach: avoid sending any content derived from the exception object back to the client. Instead, log the exception (with stack trace) on the server and return a generic, user-friendly error message that does not depend on e. The sanitizer can still be used elsewhere if needed, but for this endpoint we should not expose the parsed exception text.

Concrete best fix here: in parse_file’s except block (lines 540–542 in py-src/data_formulator/tables_routes.py), keep the logging statement as is (it already logs with exc_info=True), but replace the JSON response so that "message" is a fixed, generic string such as "Failed to parse file", independent of e. This change preserves existing functionality (client still receives a 400 with an error status) while eliminating any residual risk of leaking stack-trace or internal details via the exception string.

Required changes:

File: py-src/data_formulator/tables_routes.py

In the parse_file function’s except Exception as e: block, update the return statement on line 542 to use a static message instead of sanitize_error_message(str(e)).

No changes are required in py-src/data_formulator/sanitize.py for this specific issue.

No new imports or helper methods are needed.

Add openpyxl and xlrd packages for Excel file read/write functionality, and add pytest as dev dependency to support testing

Update i18next from ^25.8.13 to pinned version 25.8.19 to ensure dependency consistency

build: update i18next dependency to pinned version 25.8.19 Update i18next from ^25.8.13 to pinned version 25.8.19 to ensure dependency consistency

IAMkecheng and others added 30 commits February 27, 2026 21:35

bar chart completed

3675ab8

complete bar, line, boxplot, heatmap, area, ranged-dot

8d1ee82

complete echarts

97c39c7

Merge remote-tracking branch 'upstream/dev' into dev

98b6916

add .vscode/settings.json

b31af43

Merge remote-tracking branch 'refs/remotes/origin/dev' into dev

acaaa02

update dataloader dependency

f469e06

add color suggestion, update echarts

bede19f

Merge pull request #249 from IAMkecheng/dev

927efcb

add src/lib/agents-chart/core/color-decisions.ts, undated corresponding Echarts code

Merge pull request #248 from microsoft/dependabot/npm_and_yarn/immuta…

e193b82

…ble-5.1.5 Bump immutable from 5.1.4 to 5.1.5

Configure pipeline: copilot instructions, CI, and static analysis

d34ae91

Switch copilot-setup-steps to github-hosted runners (firewall enforce…

2350dca

…ment)

library update

e35bbe1

Add Docker support

e1a890e

fix requirement for deployment

5ffacfa

fix color problem of echarts

9d72dda

fix color of chart.js

8d35834

Merge pull request #255 from IAMkecheng/dev

16717cd

fix color setting of echarts and chart.js

Potential fix for pull request finding

3080868

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

9b5b8ee

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

dc0bd46

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

fix the color decision function of echarts and chartjs

3aac5ad

Update pie.ts

ba7bd02

Merge pull request #256 from IAMkecheng/dev

76c7114

update colors problem

my effort improving stuff

aacd947

zhb-y-agent and others added 11 commits March 23, 2026 03:34

feat(file encoding): Add file reading functionality with UTF-8 and GB…

be433b2

…K encoding support Implement cross-platform file encoding handling, including: 1. Add readFileText function on frontend to handle UTF-8 and GBK encodings

feat(file encoding): Improve text file encoding detection and convers…

5df6c40

…ion logic - Add trusted encoding detection set, optimize GBK-first strategy - Add integration tests to verify Chinese CSV file processing - Improve encoding detection fallback chain, finally fallback to latin-1

build: Update openpyxl dependency to 3.1.0 or higher

4ddd212

Explicitly specify the minimum version requirement for openpyxl in both requirements.txt and pyproject.toml to ensure dependency compatibility

github-advanced-security bot found potential problems Mar 24, 2026

View reviewed changes

py-src/data_formulator/agent_routes.py Fixed Show fixed Hide fixed

py-src/data_formulator/agent_routes.py Fixed Show fixed Hide fixed

py-src/data_formulator/tables_routes.py Fixed Show fixed Hide fixed

zhb-y-agent and others added 5 commits March 24, 2026 23:44

feat(i18n): Add more language options and corresponding additional rules

4c1ca48

Added 13 new language supports including Japanese, Korean, French, German, etc., and supplemented special rules for Japanese

fix: Only fetch chart insights when auto chart insights is enabled

69418fd

Avoid unnecessary chart insight requests when the auto chart insights configuration is turned off

minor

bf1c8ee

Chenglong-MS requested a review from Copilot March 24, 2026 20:34

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Chenglong-MS requested a review from Copilot March 24, 2026 20:36

Copilot AI reviewed Mar 24, 2026

View reviewed changes

zhb-ai and others added 5 commits March 25, 2026 13:22

chore(deps): migrate from npm to yarn and update lockfile

ae9a24f

- Remove package-lock.json in favor of yarn.lock - Update yarn.lock with latest dependency resolutions

chore: add package-lock.json to .gitignore

f2b433a

Prevent npm lock file from being committed to version control as we use yarn

Merge pull request #264 from microsoft/feature/i18n-react-i18next

66dc27b

chore: migrate to yarn and enhance security error handling

github-advanced-security bot found potential problems Mar 25, 2026

View reviewed changes

zhb-y-agent and others added 3 commits March 25, 2026 21:39

build: add openpyxl and xlrd dependencies for Excel file processing

74e7ed0

Add openpyxl and xlrd packages for Excel file read/write functionality, and add pytest as dev dependency to support testing

build: update i18next dependency to pinned version 25.8.19

acc39d2

Update i18next from ^25.8.13 to pinned version 25.8.19 to ensure dependency consistency

build: update i18next dependency to pinned version 25.8.19

7c3ce25

build: update i18next dependency to pinned version 25.8.19 Update i18next from ^25.8.13 to pinned version 25.8.19 to ensure dependency consistency

@@ -178,7 +178,7 @@
                     return jsonify(status="ok", path=home_path)
                 except Exception as e:
                     logger.error(f"Failed to open workspace: {e}")
-                    return jsonify(status="error", message=sanitize_error_message(str(e))), 500
+                    return jsonify(status="error", message="Failed to open workspace"), 500
             @tables_bp.route('/list-tables', methods=['GET'])

@@ -464,7 +464,11 @@
                         try:
                             df = pd.DataFrame(json.loads(raw_data))
                         except Exception as e:
-                            return jsonify({"status": "error", "message": f"Invalid JSON data: {sanitize_error_message(str(e))}, it must be a list of dictionaries"}), 400
+                            logger.exception("Failed to parse raw_data as JSON when creating table.")
+                            return jsonify({
+                                "status": "error",
+                                "message": "Invalid JSON data, it must be a list of dictionaries",
+                            }), 400
                         workspace.write_parquet(df, sanitized_table_name)
                         row_count = len(df)
                         columns = list(df.columns)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.7 alpha 2#258

0.7 alpha 2#258
Chenglong-MS wants to merge 119 commits intomainfrom
dev

Chenglong-MS commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

@@ -234,13 +234,15 @@
                                 "message": ""
                             }
                     except Exception as e:
-                        logger.warning(f"Error testing model {content['model'].get('id', '')}: {e}")
+                        # Log full details server-side, but return only a generic message to the client.
+                        logger.exception(f"Error testing model {content['model'].get('id', '')}: {e}")
                         is_global = content['model'].get('is_global', False)
                         result = {
                             "model": content['model'],
                             "status": 'error',
-                            "message": "Connection failed, please check server configuration" if is_global
-                                       else sanitize_model_error(str(e)),
+                            "message": "Connection failed, please check server configuration"
+                                       if is_global
+                                       else "Model test failed due to an internal error.",
                         }
                 else:
                     result = {'status': 'error'}

Conversation

Chenglong-MS commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Check warning

Uh oh!

Copilot Autofix

Check warning

Uh oh!

Copilot Autofix

Check warning

Uh oh!

Copilot Autofix

Check warning

Uh oh!

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Chenglong-MS commented Mar 18, 2026 •

edited

Loading