# Data Engineering

### **Data Pipeline Verification**

* **Run pipeline to load data** and verify it completes without errors.
* **Ensure data cleaning and integrity checks** are running and passing.
* **Confirm that the database is populated** with up-to-date and accurate data.
* **Validate that CMS tables** are correctly filled with token project data.
* **Ensure that all necessary data transformations** are performed correctly.
* **Implement dynamic project list generation**:
  * Automate the update of project rankings daily or as frequently as data is ingested to ensure the pipeline can dynamically account for changes in rankings.
  * Establish a **flagging system** to identify projects near the inclusion/exclusion threshold (e.g., top 300 projects by market cap).
* **Define and apply a policy for inclusion/exclusion**:
  * Set clear criteria for when a project should be included (e.g., top 300 by market cap) and when it should be excluded.
  * Implement **soft deletion** or archiving strategies for projects that fall out of the top 300, ensuring historical data is preserved while keeping the main database current and performant.
* **Define and apply policy for&#x20;**<mark style="background-color:green;">**white listing**</mark>
* **Ensure transformation consistency**:
  * Apply the same data transformation rules to all projects, whether newly included or previously included, to maintain consistency across the dataset.
  * Implement procedures to **backfill data** for projects re-entering the top 300.

### **Data Access Testing**

* **Confirm that the agent can access and retrieve data** from the database as required.
* **Test that user queries correctly match and return data** related to hashtags, slugs, and CMS entries.
* **Ensure data retrieval times** are within acceptable limits and optimized for performance.
* **Account for project fluctuations**:
  * Regularly test the system's response to project rank changes to ensure accurate and efficient data retrieval even as projects enter and exit the top 300.

### **Data Quality Checks**

* **Conduct random sampling** to verify data accuracy and integrity.
* **Ensure that data deduplication processes** are effective.
* **Validate that data retention policies** are implemented and adhered to.
* **Implement regular audits**:
  * Perform regular audits to maintain historical data integrity, ensuring that even projects excluded from the current top 300 remain accurate in the database.
* **Monitor project movements**:
  * Generate regular reports tracking which projects have entered or exited the top 300, providing insights into market trends and the overall performance of the data pipeline.
* **Set up automated alerts** for significant rank fluctuations, enabling quick identification and resolution of potential data issues.
* **Run URL checks of all white listed projects**

&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.exponent.ai/internal-qa/engineering-qa-checklist/data-engineering.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
