Data Engineering

Data Pipeline Verification

  • Run pipeline to load data and verify it completes without errors.

  • Ensure data cleaning and integrity checks are running and passing.

  • Confirm that the database is populated with up-to-date and accurate data.

  • Validate that CMS tables are correctly filled with token project data.

  • Ensure that all necessary data transformations are performed correctly.

  • Implement dynamic project list generation:

    • Automate the update of project rankings daily or as frequently as data is ingested to ensure the pipeline can dynamically account for changes in rankings.

    • Establish a flagging system to identify projects near the inclusion/exclusion threshold (e.g., top 300 projects by market cap).

  • Define and apply a policy for inclusion/exclusion:

    • Set clear criteria for when a project should be included (e.g., top 300 by market cap) and when it should be excluded.

    • Implement soft deletion or archiving strategies for projects that fall out of the top 300, ensuring historical data is preserved while keeping the main database current and performant.

  • Define and apply policy for white listing

  • Ensure transformation consistency:

    • Apply the same data transformation rules to all projects, whether newly included or previously included, to maintain consistency across the dataset.

    • Implement procedures to backfill data for projects re-entering the top 300.

Data Access Testing

  • Confirm that the agent can access and retrieve data from the database as required.

  • Test that user queries correctly match and return data related to hashtags, slugs, and CMS entries.

  • Ensure data retrieval times are within acceptable limits and optimized for performance.

  • Account for project fluctuations:

    • Regularly test the system's response to project rank changes to ensure accurate and efficient data retrieval even as projects enter and exit the top 300.

Data Quality Checks

  • Conduct random sampling to verify data accuracy and integrity.

  • Ensure that data deduplication processes are effective.

  • Validate that data retention policies are implemented and adhered to.

  • Implement regular audits:

    • Perform regular audits to maintain historical data integrity, ensuring that even projects excluded from the current top 300 remain accurate in the database.

  • Monitor project movements:

    • Generate regular reports tracking which projects have entered or exited the top 300, providing insights into market trends and the overall performance of the data pipeline.

  • Set up automated alerts for significant rank fluctuations, enabling quick identification and resolution of potential data issues.

  • Run URL checks of all white listed projects

Last updated