Data Flows & Data Preparation¶
Last updated: 2026-04-27 Tags: data flow, data preparation, ETL, transform, machine learning, output, dataset
π Full Oracle Documentation: Connecting OAC to Your Data β Curate Data Β· Visualizing Data β Data Flows
Summary¶
Data Flows are OAC's built-in ETL/ELT tool for preparing, cleansing, enriching, joining, and transforming data. They read from one or more sources (connections, datasets, Subject Areas), apply transformation steps, and write to an output (Dataset, Database, Object Storage). Data Flows can also apply Machine Learning models as pipeline steps.
What Is a Data Flow?¶
A Data Flow is a visual, step-by-step pipeline:
[Source 1] βββ
βββ [Join] β [Filter] β [Rename Columns] β [Add Calculation] β [Output: Dataset]
[Source 2] βββ
When to Use Data Flows¶
- Pre-join multiple tables before visualization
- Cleanse and standardize raw data (nulls, formats, case)
- Aggregate/summarize data for performance
- Apply ML models (scoring / prediction)
- Load results into a database table
- Schedule recurring data preparation jobs
Data Flow Sources¶
| Source Type | Description |
|---|---|
| Dataset | Existing OAC dataset |
| Connection (DB table) | Direct table from a registered connection |
| Subject Area | Logical SQL query result |
| Object Storage | OCI Object Storage file |
| Dropbox / Google Drive | Cloud file source |
Transformation Steps¶
Data Shaping¶
| Step | Purpose |
|---|---|
| Select Columns | Choose which columns to keep |
| Rename Columns | Change column display names |
| Add Columns | Add a calculated/derived column |
| Filter Rows | Keep or remove rows by condition |
| Sort | Order rows |
Data Cleaning¶
| Step | Purpose |
|---|---|
| Null Handling | Replace nulls with default value |
| Data Type Convert | Change column type (stringβdate, etc.) |
| Case Conversion | Upper, lower, title case |
| Trim Whitespace | Remove leading/trailing spaces |
Data Enrichment¶
| Step | Purpose |
|---|---|
| Add Custom Date | Extract year, quarter, month from date |
| Enrich (Geography) | Geocode address columns |
| Sentiment Analysis | Classify text column sentiment (positive/negative) |
| System Connection (ML) | Apply a trained OML model |
Data Integration¶
| Step | Purpose |
|---|---|
| Join | Inner, left, right, full outer join between two sources |
| Union Rows | Append rows from two sources (same schema) |
| Group By | Aggregate β SUM, COUNT, AVG by dimension |
| Branch | Split pipeline into two paths |
| Merge | Combine two branches back |
Machine Learning Steps¶
| Step | Purpose |
|---|---|
| Train Numeric Prediction | Train a regression model |
| Train Binary Classifier | Train a binary classification model |
| Train Multi-Classifier | Train a multi-class classification model |
| Train Clustering | Train a K-Means clustering model |
| Apply Model | Score rows using a trained model |
Output Targets¶
| Output | Description |
|---|---|
| Dataset | Save result as OAC in-memory dataset |
| Database Table | Write to a table in a registered connection |
| OCI Object Storage | Write as Parquet/CSV to OCI bucket |
| Dropbox / Google Drive | Write file to cloud storage |
Output Modes¶
- Replace: overwrite existing data
- Append: add rows to existing dataset
- Merge (Upsert): update matching rows, insert new ones
Scheduling Data Flows¶
Data Flows can run on a schedule:
- Open Data Flow β Schedule tab
- Set frequency: hourly, daily, weekly
- Set start time and timezone
- Optionally chain: run Data Flow B after Data Flow A completes
Sequences¶
A Sequence is an orchestration container that runs multiple Data Flows in order:
Sequence: "Daily ETL"
Step 1: Run "Load Sales Data" Data Flow
Step 2: Run "Calculate KPIs" Data Flow
Step 3: Run "Load to Reporting DB" Data Flow
Data Preparation (Dataset Editor)¶
Separate from Data Flows, the Dataset Editor provides in-situ data prep:
- Applied when creating/editing a Dataset
- Non-destructive transformations stored as metadata
- Recommendations engine suggests fixes (nulls, obvious mismatches)
- Changes are applied at query time (not persisted to source)
Recommendations¶
OAC analyzes the dataset and suggests:
- Fill nulls with median/mean/mode
- Correct date formats
- Standardize text values (e.g., "USA" and "United States" β same value)
- Extract parts from strings
Data Flow Expression Language¶
Used in Add Columns and Filter steps:
-- Calculated column
REVENUE - COST
-- Conditional
CASE WHEN REGION = 'EMEA' THEN 'Europe'
WHEN REGION = 'AMER' THEN 'Americas'
ELSE 'Other' END
-- Date extract
YEAR(ORDER_DATE)
-- String operations
UPPER(TRIM(CUSTOMER_NAME))
-- Null handling
COALESCE(DISCOUNT, 0)
Performance Tips¶
π‘ Tip: Use Group By in a Data Flow to pre-aggregate large tables. Workbook queries against aggregated datasets are much faster.
π‘ Tip: Write large Data Flow outputs to Oracle Autonomous Database (ADW) for best downstream query performance.
β οΈ Warning: Data Flows loading data into OAC in-memory datasets have a size limit (~250MB compressed). For larger data, write to a database and connect via Subject Area.
π‘ Tip: Use Sequences to orchestrate dependent Data Flows and ensure proper execution order.