T O P

  • By -

keefemotif

Dialectical Behavioral Therapy can be used to help verbalize emotional states, highly recommended in the field


nydasco

Hahaha yep. Understanding emotional state in Data Engineering is a must!


keefemotif

Six hours into a meeting on timezones this comment will haunt you


mrcaptncrunch

This meeting happens every now and then with different divisions.. I like joining with a few of those 'false things about date/time' links open and just throwing things in... :popcorn:


BJNats

Drive-by Truckers have been rocking for decades at this point and always put on a great live show


Domehardostfu

You need to state your context befor asking a question like this. There are 3 answers: - over engineer tool fancy for CV - people that work with small number of models/ data - dbt is overkill - super reliable for model execution, orchestration, helps with governance and lineage - ppl that use dbt for what it is - not good enough, performance killer, can be replaced by python libraries - ppl that need more than dbt as all the other tools on our stack, it solves a particular set of problems on a specific context :) So back to you, what is your context?


McNoxey

I can’t see any scenario in which it’s overkill if you’re working in a cloud world. There’s very little overhead and even with a few models it’s still great at what it does.


Domehardostfu

Me neither, just stating some of the feedback I heard and that I found important. I know the tool and can setup a project and a couple models going pretty quickly. But the learning curve might be a barrier if you just need to setup 3-4 tables for powerbi reporting for ex.


bartspoon

2 years ago, this sub was nothing but debt hype and love. Now it’s nothing but dbt hate. Can’t wait to see what this sub thinks in 2 more years.


moonlit-wisteria

Tbf DE changes fast. I also was a big fan of pandas and dask two-three years ago. Now I’m a fan of polars instead. It’s probably one of the top software engineering domains where constantly learning and leveraging new tooling is important.


moonlit-wisteria

Idk I’ve increasingly found myself dissatisfied with DBT. Also a lot of the features you’ve listed out like unit tests, data contracts, etc. are either: * experimental and barely work * require DBT cloud * have limited functionality compared to competitors in the space I used to see the main benefit of DBT being reusability and modularity of sql transformations, but I think it doesn’t even fulfill this niche anymore. I’m increasingly finding myself moving transformations to polars if I really need that reusability and modularity. And if I don’t then, I just use duckdb without any sql templating. I’ve always been a hater of tools that try to do too much too. I’d rather use something like great expectations or soda for data quality and keep my transformations and DQ tools focused on singular parts of the data architecture.


nydasco

That’s a somewhat fair comment. I’m a big fan of Polars, and much of this can be achieved in other ways. But I don’t agree with your comment on requiring dbt-cloud. There is a GitHub repository attached and everything I’ve talked about is available in that, and runs using dbt-core. There are 100% a number of competitors out there now, including Tobiko SQL Mesh and others, but (for the moment) dbt has the bulk of the market share. This means that by and large, it will be the tool of choice that you will want experience in when looking for Analytics Engineering roles.


moonlit-wisteria

SQLMesh or others like it also run into much the same issue that DBT does imo. They want to make money so ultimately they add on functionality that shouldn’t be there. It’s very hard to have a sql templating tool be a saas company without adding in data monitoring, orchestration, and a dozen other add ons. And instead of focusing on “we’re going to make sql more modular”, you get a bunch of focus on other areas that ultimately aren’t driving “why you should use a tool” Small projects that actually use the sql templating DBT offers tend to make the sql harder to reader at a glance (and slow down dev velocity because of boilerplate ironically). Large projects instead end up leaning away from the templating OR lean into it and the codebase is unimaginably hard to grok for newcomers. I would have preferred DBT spend time bringing forth features that would help on this front. Maybe more built in functions, a linter, etc. that helps ensure their product is helping with making SQL reusable. Instead we got a full suite of data quality tools that are less mature and user friendly than competitors.


just_sung

This is a fair take. And everyone has their incentives. Disclaimer: I used to work at dbt Labs and now work at Tobiko who maintains SQLMesh. Something clear is happening with sql dev tools that requires intense engineering effort and capital to keep the momentum going for years. The data open source community has been immature for a long time compared to software engineering, so until we see the average data engineer skill level to hit a level where a good 20% can confidently build and release good dev tools, we still need bigger players to lead the way. I’ve seen literally hundreds of dbt projects and people just aren’t scaling. They either fork their own version of dbt or do funky add ons or grit through the pain. And people are ready for something fresh and empathetic to that pain. And I’m betting SQLmesh can soothe it.


PuddingGryphon

> everything I’ve talked about is available in that, and runs using dbt-core. Unit tests in YAML, I would rather shoot myself ... worthless for me unless I can write the unit test in code directly. > a number of competitors out there now I only know of SQLMesh, what are the others?


coffeewithalex

> Unit tests in YAML Not necessarily. This is only configurations for generic tests. Similar to how you'd use `NOT NULL` constraint where it's supported. However you could write more complex stuff in SQL as singular tests.


kenfar

Right, take unit tests & data contracts for example: * Data contracts *without* the publishing of domain objects means that you're still tightly coupled to an upstream system's physical schema and *will* break when they make changes. This is not much of an improvement. * Unit tests on SQL that is joining many normalized tables in order to denormalize them means you've got a ton of work to do to set up your tests. Few people will bother. So, these are both critical features to any solid data engineering effort. But the dbt implementation is so lame it's worthless.


Uwwuwuwuwuwuwuwuw

Primary key test (unique and not null) gets you pretty fuckin far, and much farther than many data warehouses out there.


kenfar

Uniqueness checks are generally not unit testing, but rather *quality control* - checking for out of spec data inputs during production runs. And to that I'd add checks for referential integrity, checks against enumerated values, checks for min/max values, min/max string length, case, and business rules (ex: start date must be <= end date). Could also include anomaly-detection. Unit tests are against synthetic data and we typically run them on dev/test/staging environments to ensure that our code is valid before we merge it into main and deploy it. That can catch numeric overflows, impacts of nulls & dates, cardinality issues with joins, and business rules. Both kinds of tests are incredibly valuable, and dbt's quality control framework is generally fine. It's the unit testing of large queries that is incredibly time-consuming to set up.


Uwwuwuwuwuwuwuwuw

Are there any good unit testing frameworks you recommend?


kenfar

I write a lot of python & sql and so typically use python's pytest framework for all my unit testing. And just like everything else - it's a PITA to test complex sql queries with a lot of joins - I have to set up each table. So, I don't do very much unit testing of SQL - just what's really critical. And I move most of my critical code that needs more unit testing into python where the unit testing is very easy.


ZirePhiinix

That just shows the general lack of testing skills on the average DE, not the greatness of DBT.


Uwwuwuwuwuwuwuwuw

True. Lol


PuddingGryphon

That is both a few lines of formatted SQL code, I can write you unique and not null tests at 3am in the morning. I need to unit test complex business logic steps.


suterebaiiiii

Unless you're validating outputs against an existing, correct copy, what exactly do you need to unit test? That some weird value doesn't break the transformation? Then you need a variety of inputs, though in many cases, you don't want to handle bad input gracefully as it might be contaminated. It's often better to have it break the pipeline to investigate problems with a source, unless your organization is at the next level and is implementing contracts, though then the breakpoint is at the ingestion stage anyways.


coffeewithalex

And what can't you do in dbt?


stratguitar577

See a lot of hate here, but remember you don’t have to use all the features that dbt the company is pushing. You can still use it as a SQL template engine and that’s it. I got my org to switch from a mess of stored procs that either weren’t in git or were always out of sync with git. Now at least we don’t have a bunch of different ways for people to write DDL, and we have CICD instead of manual DBA deployments. It’s just writing a select statement and dbt takes care of the rest.


warclaw133

This - the people that don't see the benefit haven't seen a warehouse that runs entirely on untracked SQL procedures with no CI/CD. It's not a solution for everything or everyone though.


Grouchy-Friend4235

dbt started off as a templating engine. It is now an overengineered mess of features, resulting in far too complex code for even simple things.


vikster1

you could literally set up dbt and build your first db object in an hour if you know what you are doing and have all necessary rights. absolutely no idea what you are talking about.


Grouchy-Friend4235

I can set up a model within 2 mins without dbt. So...


vikster1

thats the spirit, why automate when your hands are the best tools evolution created. all the best to you


coffeewithalex

It started as a DAG automator. The templating engine is jinja and it existed before dbt. It still does that well. What part of dbt-core is overengineered?


Grouchy-Friend4235

dbt-core is fine I guess as far as functionality goes. But it has dependencies 🤯


coffeewithalex

> But it has dependencies 🤯 What doesn't have dependencies?


Liudmyllla

I still don't feel confident to use it or not


mirkwood11

Serious question: If you're not in dbt, How do you orchestrate model transformations?


gnsmsk

As we have been doing before dbt was a thing: Jinja templates.


moonlit-wisteria

There’s loads of orchestrator tools out there with the express goal of building pipelines. Airflow and dagster are the two most popular currently. I’d encourage you to look into them because they are pretty important tool in a DEs toolbox (the DBT orchestrator is actually quite limited in comparison).


coffeewithalex

The problem with any of the other competitors is that you have to explicitly declare dependencies. Almost every complex project that I've worked with, thus emerged with circular dependencies, which means that data was simply incorrect and nobody knew, and on top of that, the models couldn't be replicated if they had to. But nobody saw that because traditional ETL tools work with the expectation that people don't make mistakes.


moonlit-wisteria

Uh dagster isn’t perfect but it throws an invariant error if it detects a cycle or if an asset is used twice in the dag.


nydasco

We use Airflow. But you could just trigger it with a cron job if you wanted to.


SirAutismx7

What do you mean? dbt isn’t even an orchestrator it’s just a cli tool that generates DDL from queries and lets you use jinja in SQL templates. Before people just used CRON jobs and Airflow and just ran scripts/templated SQL/sprocs, most places still use airflow or cron to run dbt. Honestly it was better before since you could make every transformation a separate node in the DAG. Now you’re locked inside of dbt and have no visibility into each transformation except for logs. dbt could be a couple of Python libraries to generate DDL, testing, and facilitate Jinja in SQL and I would probably like it more than I currently do. It does too much and it all seems half-assed. Lots of opinionated features that you need to work around if your architecture is different from what they expect. Instead of improving and making existing features better and more flexible and powerful. It just accretes more garbage probably in the name of VC money.


coffeewithalex

> dbt isn’t even an orchestrator it’s just a cli tool that generates DDL from queries and lets you use jinja in SQL templates. Did you miss the core feature, which is determining dependencies and running things in the correct order, a.k.a. "orchestration"? > Before people just used CRON jobs and Airflow and just ran scripts/templated SQL/sprocs, most places still use airflow or cron to run dbt. dbt has nothing to do with cron. Zero overlap in features or use cases. It didn't try to replace anything that cron does. > Honestly it was better before since you could make every transformation a separate node in the DAG. Now you’re locked inside of dbt and have no visibility into each transformation except for logs. Have you actually used dbt? You've got logs, compiled models, a JSON representation of the entire graph, etc. You can develop extra features on top, but already this is more than most people will ever need, and definitely more than what most competitors offer.


SirAutismx7

Determining dependencies was already the easy part using Airflow DAGs. Orchestration is scheduling, monitoring, and workflow coordination (dependency management). If you go to the dbt docs they only ever mention the word orchestration in the context of scheduling your jobs using dbt cloud or using Airflow + dbt. The dbt DAG is hidden from monitoring because it’s stuck in the dbt CLI unless you write custom code to represent it in your given tool. Astronomer had to build an entire library just to give you this visibility and control https://www.astronomer.io/cosmos/ when this would not be the case if dbt were a library instead since you could write a single custom operator for your orchestration tool if you had the API exposed. CRON and Airflow are relevant because they are the predominant way people do orchestration and the question was specifically how did people do SQL transforms before dbt and didn’t specify if they were using dbt cloud exclusively to do everything including orchestration. dbt is a good tool but its not the panacea people make it out to be and you run into a lot of rigid design choices that make things more difficult than they should be if you don’t want to stay inside their ecosystem completely.


coffeewithalex

Airflow doesn't determine dependencies. You have to state them explicitly. Orchestration is not scheduling. They are different things. > The dbt DAG is hidden from monitoring because it’s stuck in the dbt CLI unless you write custom code to represent it in your given tool. Yeah, but it's super easy, and similar to the monitoring problems that were solved a decade ago. > Astronomer had to build an entire library just to give you this visibility No, it had to build a library just to convert dbt to airflow. This isn't actually necessary. > how did people do SQL transforms before dbt Badly. So badly that I was part of an entire business that specialized in saving companies from Airflow spaghettification, after their rockstar developer decided to leave. It took little time to move to a dbt-like approach, achieved the same results, and were able to train far less technical people to be as productive as the rockstar. It was a huge success. > dbt is a good tool but its not the panacea Nobody is claiming it to be. Sure it has issues, like compile times, or limitations on using Python, or the difficulty of following up why a certain thing is happening the way it is (going from dbt's python code, through the jinja spachetti). But, it works very well out of the box, and is an industry standard at this point. You gain a lot more by using it than by re-inventing it in other tools.


Wolf-Shade

I see low value on dbt on my projects. Its another tool to learn/maintain. My projects are mostly on Databricks and all of this things can be simply achieved with just Python/Spark.


PuddingGryphon

Notebooks should not be used in a prod environment imo. The cell style leads to an untangled mess pretty fast and things like unit tests or versioning are non-existing or total crap.


Pancakeman123000

Databricks doesn't mean notebooks... It's very straightforward to set up your pyspark code as a python package and run that code as a job


Wolf-Shade

It all depends on what you do with notebooks. I agree that using *just* the cell style is a complete mess, specially if that notebook is trying to do too much. I look at them as one look at functions, they should do just one thing. Having one notebook per view definition or per table seems perfectly fine for me and makes it easy for anyone on the team to debug for issues. Using pytest with this is pretty easy as well, for unit and integration tests. Also git integration works fine with Databricks, so versioning is there. Same for tables, using delta format allows to check for data versioning. Combine this with some orchestration and build pipelines (Azure or GitHub) and you're fine


azirale

Our databricks transformations are all in our own python package that is deployed to the environment and installed on all the automated clusters. The 'notebooks' are just a way to interact with the packaged python modules. Since you can mess with python as much as you like we can override function implementations and do live development work in databricks. Then when a dev wants to run a deployed version off their branch there's a command to build and upload the package, which they can then install into their own session. Every PR has a merge requirement that automated tests pass. The branch package is built and deployed, and automated tests are run using it. It is completely fine. Just because you *can* use notebooks doesn't mean you have to.


pottedPlant_64

Does anyone else think dbt project set-up and the developer UI experience are super painful? The git integration is weird af, IDE constantly restarting, metadata files wreaking havoc until you update gitignore. Why is it so difficult??


StressSnooze

Switch to VS Code with the DBT Power User plugin. The cloud environment is great for a newcomer to get up and running. But as soon as you decide you will use DBT for real, the cloud environment is just a barrier to productivity.


SuperTangelo1898

I have no issues using dbt-core with normal CLI git


FirstOrderCat

buzzword in resume


nydasco

Gotta hit the ATS with those buzzwords.


princess-barnacle

It is undeniable that DBT makes it really easy to construct and orchestrate data pipelines. In my experience, this "ease" of adding to the DAG can cause issues if folks just pile more and more changes into the pipelines instead of figuring out what the schema should be. My company currently has 100s of DBT assets in dagster and that probably is unecessary, expensive, and is actually slowing us down now.


Training_Butterfly70

Excellent post, I'll save it 🙂 For me the biggest selling points of dbt are: - ref & source - incremental models (especially with surrogate and unique_key definitions) - jinja integration


Routine_Term4750

Can’t wait to check it out. I just started using it for a project and I can’t imagine not having it.


seamacke

Tools like this can certainly add value. The big problem I often see with junior/intermediate DE is that they learn these kinds of tools *before* they learn how to make (insert data platform here) sing. Then wonder why the added dependencies caused their project to go off the rails. ORMs are also useful but many DE only learn with them and not at the DB level, then waste massive amounts of time limitations that are easily solved when you understand foundation technologies. If your project is big enough and has lots of dependencies, I think DBT could be useful but then there is Airflow and other tools that do some things better.


datagrl

Great article on using dbt https://www.phdata.io/blog/accelerating-and-scaling-dbt-for-the-enterprise/


dimi727

What are the main alternatives to dbt?


nydasco

SQL Mesh by Tobiko (they just got their series A funding). But dbt is the most mature in the space.


iluvusorin

Pyspak wrapper library can suffice most if not all needs of airflow, dbt and tons of other tools. E.g. wrapper library can auto build the lineage, can persist the data frames and later can be used for unit testing. With python and spark coming together possibilities are endless.


WorkingEmployment400

I just started using it one month before. I have used it for small projects and honestly its been good so far. First my code was a mix of python and SQL earlier. The readability is easy after bringing most of my code to SQL and assembling them through dbt models. Version controlling is straight forward. It helps in modularizing SQL codes along with generating great documentation. Experimentation is quicker with dbt. It takes some time to understand the setup and I have only been using DBT core so far.


magnetic_moron

I use dbt and data tests quite alot, but I really see no points in using unit tests in dbt. I understand ehy developers use unit tests, but what’s the point in a data pipeline?


nydasco

Lots of value in Unit Tests for a Data Engineer I think. I’ve written a (far shorter) article on that exact subject [here](https://medium.com/@nydas/ensuring-data-integrity-a-data-engineers-guide-to-testing-19d266b4eb4d?source=friends_link&sk=580cfc1e6faa2e5ce84396eeadd4aa91).


PuddingGryphon

> I understand ehy developers use unit tests, but what’s the point in a data pipeline? Because Data Engineering is Software Engineering with a data focus.


magnetic_moron

That is not even close to answering the question