Cloud Integration Test/CI Pipeline
We use GitHub Actions to do cloud integration test. Currently the integration test has 5 jobs:
- running
./gradlew test to verify if the scala/spark related code has passed all the test - running
flake8 to lint python scripts and make sure there are no obvious syntax errors - running the built jar in databricks environment with end to end test to make sure it passed the end to end test
- running the built jar in Azure Synapse environment with end to end test to make sure it passed the end to end test
- running the end to end test cases for registry server to make sure related code can passed all the tests
The above 5 jobs will ran in parallel, and if any one of them fails, the integration test will fail.
Cloud Testing Pipelines
Since there are many cloud integration testing jobs that could be run in parallel, currently the workflow is like this:
Optimizing Parallel Runs
Since Feathr is using cloud resources to do CI testing, we have those optimizations in place:
- set
pytest -n 4 to run 4 tests in parallel - Use pre-exist spark pools to reduce the setup time. All the spark jobs are running on “instance pools” that has certain idle compute instances so the setup time will be short. For example, for Databricks:
"instance_pool_id":"0403-214809-inlet434-pool-l9dj3kwz"
More on GitHub Actions
The integration test will be triggered once there are push or for new pull requests.
The integration test will also skip the files in the /docs folder and for files that are ending with md.
For more info on GitHub actions, refer to the documentation here.
push:
branches: [main]
paths-ignore:
- "docs/**"
- "**/README.md"
pull_request:
branches: [main]
paths-ignore:
- "docs/**"
- "**/README.md"