Coding with Steven. Microsoft Data Formulator
https://www.youtube.com/watch?v=Yj9JVPUTGHE This video introduces Microsoft’s Data Formulator, an AI-powered application designed to create rich data visualizations by leveraging a large language model (LLM) as its backend. 1. Setup and Installation:
- Steven outlines the installation process, which involves using Python’s PIP:
pip install data_formulator. - After installation, the tool can be launched using
data_formulatororpython -m data_formulator. - A crucial step is configuring the AI model. Steven demonstrates using the Gemini API. Users need to obtain an API key from Google AI Studio, copy it, and paste it into Data Formulator. The specific model name used in the demo is
gemini-2.5-pro-exp-03-25. A green status indicator confirms the model is ready.
2. Key Features and Workflow:
- Users can load data from existing examples, local files, or directly from the clipboard.
- The interface prominently displays available data fields on the right-hand side.
- A unique feature is “Visualization challenges,” which provides pre-defined prompts categorized by difficulty (easy, medium, hard) to guide users in exploring their data.
- The tool supports various chart types, and for each, users can drag and drop data fields onto X/Y axes, legends, colors, and other visual properties.
3. Demonstrations: Steven uses the “unemployment-across-industries” sample dataset to showcase Data Formulator’s capabilities:
-
Scatter Plot: He attempts to create a scatter plot of
yearvs.count. The tool initially renders a bar-like chart, indicating that AI-generated results might require refinement or manual adjustment. He highlights that the underlying Python transformation code generated by the AI is accessible for inspection and understanding. -
Box Plot: A box plot of
yearvs.count(with aggregation set to average) is demonstrated. This visualization effectively shows the fluctuation of monthly unemployment counts per year. -
Auto-recommendation (Natural Language Query): Steven inputs the natural language query “trend of unemployment data” into the “Auto” chart type. The AI successfully generates a line chart plotting
dateagainstrate, with different industryseriesas color-coded legends. He refers to this as a “functional data visualization.” -
Bar Chart: A bar chart is created plotting
yearon the X-axis and averagecounton the Y-axis. To add more detail, theseries(representing different industries) is used for color encoding. This provides a clear view of average monthly unemployment trends by year across various industries. -
Facets: Demonstrating advanced layout options, Steven drags
yearto the “columns” section, which separates the main chart into multiple smaller bar charts, each representing unemployment data for a specific year. This facilitates year-on-year comparisons. -
Heatmap: Steven requests a “correlation heat map of the average monthly unemployment rate by industries.” The AI intelligently transforms the data to calculate correlation rates between all industry pairs, presenting it in a new data table (
table-34). The resulting heatmap visually represents these correlations, showing a diagonal of 1 (perfect self-correlation) and varying shades of blue for correlations between different industries. -
Custom Point (Bubble Chart): He uses the “Custom Point” chart type, setting
yearon X,counton Y,monthto control opacity, andcountto control bubble size. This demonstrates how multiple data dimensions can be incorporated into a single visualization. -
Dotted Line Chart: A dotted line chart is created with
monthon X, mediancounton Y, andyearfor color. This provides a functional visualization showing the median monthly unemployment count for each year as distinct lines, highlighting seasonal patterns or long-term trends.
4. Steven’s Key Takeaways:
- Understand Your Data: Like any data science task, a thorough understanding of the dataset is paramount before attempting visualizations.
- Careful Variable Arrangement: For complex visualizations or specific chart types, careful arrangement of variables on axes and within legends is crucial for meaningful results.
- LLM for Data Transformation: The “Formulate data” feature is powerful, allowing the LLM to generate Python code for custom data transformations (as seen with the correlation heatmap), which is then used to build the requested visualizations.
- Guided Exploration: The “Visualization challenges” on the lower right-hand side provide valuable guidance on what can be visualized with the current dataset, indicating the difficulty level of the AI’s task.
- Session Management: The tool allows users to reset, export, and import sessions, as well as add tables from the clipboard or files.
- Convenient Operators: Quick data field operators (count, sum, average, median, bin) are readily available for immediate use.
- Capabilities vs. Limitations: Data Formulator is not a full-fledged BI tool like Power BI or Excel. Its capabilities are limited, but it serves as a highly useful tool for quickly generating visualizations and corresponding code within a specific, confined analytical setting.
Conclusion: Steven concludes by thanking viewers and encouraging them to like, share, and subscribe for more content on generative AI and large language model data analysis.