Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What should every developer know about databases? [closed]

People also ask

Do software engineers need to know about databases?

Every software engineer needs to have a basic knowledge of SQL (Structured Query Language) for querying databases or for creating or modifying tables, indexes, views, or even a stored procedure or a trigger when needed.

Do engineers need to know SQL?

Considering the prevalence of data-driven applications and SQL's advantages in interacting with relational databases, knowing SQL is an essential skill for any software engineer.

Do backend developers need to know SQL?

Though requirements vary between jobs, back end developers will need to have a passing familiarity with, if not command of, several technical languages and programs. These essential back end developer skills include but are not limited to: Python, Java, SQL, NoSQL, and Git.


The very first thing developers should know about databases is this: what are databases for? Not how do they work, nor how do you build one, nor even how do you write code to retrieve or update the data in a database. But what are they for?

Unfortunately, the answer to this one is a moving target. In the heydey of databases, the 1970s through the early 1990s, databases were for the sharing of data. If you were using a database, and you weren't sharing data you were either involved in an academic project or you were wasting resources, including yourself. Setting up a database and taming a DBMS were such monumental tasks that the payback, in terms of data exploited multiple times, had to be huge to match the investment.

Over the last 15 years, databases have come to be used for storing the persistent data associated with just one application. Building a database for MySQL, or Access, or SQL Server has become so routine that databases have become almost a routine part of an ordinary application. Sometimes, that initial limited mission gets pushed upward by mission creep, as the real value of the data becomes apparent. Unfortunately, databases that were designed with a single purpose in mind often fail dramatically when they begin to be pushed into a role that's enterprise wide and mission critical.

The second thing developers need to learn about databases is the whole data centric view of the world. The data centric world view is more different from the process centric world view than anything most developers have ever learned. Compared to this gap, the gap between structured programming and object oriented programming is relatively small.

The third thing developers need to learn, at least in an overview, is data modeling, including conceptual data modeling, logical data modeling, and physical data modeling.

Conceptual data modeling is really requirements analysis from a data centric point of view.

Logical data modeling is generally the application of a specific data model to the requirements discovered in conceptual data modeling. The relational model is used far more than any other specific model, and developers need to learn the relational model for sure. Designing a powerful and relevant relational model for a nontrivial requirement is not a trivial task. You can't build good SQL tables if you misunderstand the relational model.

Physical data modeling is generally DBMS specific, and doesn't need to be learned in much detail, unless the developer is also the database builder or the DBA. What developers do need to understand is the extent to which physical database design can be separated from logical database design, and the extent to which producing a high speed database can be accomplished just by tweaking the physical design.

The next thing developers need to learn is that while speed (performance) is important, other measures of design goodness are even more important, such as the ability to revise and extend the scope of the database down the road, or simplicity of programming.

Finally, anybody who messes with databases needs to understand that the value of data often outlasts the system that captured it.

Whew!


Good question. The following are some thoughts in no particular order:

  1. Normalization, to at least the second normal form, is essential.

  2. Referential integrity is also essential, with proper cascading delete and update considerations.

  3. Good and proper use of check constraints. Let the database do as much work as possible.

  4. Don't scatter business logic in both the database and middle tier code. Pick one or the other, preferably in middle tier code.

  5. Decide on a consistent approach for primary keys and clustered keys.

  6. Don't over index. Choose your indexes wisely.

  7. Consistent table and column naming. Pick a standard and stick to it.

  8. Limit the number of columns in the database that will accept null values.

  9. Don't get carried away with triggers. They have their use but can complicate things in a hurry.

  10. Be careful with UDFs. They are great but can cause performance problems when you're not aware how often they might get called in a query.

  11. Get Celko's book on database design. The man is arrogant but knows his stuff.


First, developers need to understand that there is something to know about databases. They're not just magic devices where you put in the SQL and get out result sets, but rather very complicated pieces of software with their own logic and quirks.

Second, that there are different database setups for different purposes. You do not want a developer making historical reports off an on-line transactional database if there's a data warehouse available.

Third, developers need to understand basic SQL, including joins.

Past this, it depends on how closely the developers are involved. I've worked in jobs where I was developer and de facto DBA, where the DBAs were just down the aisle, and where the DBAs are off in their own area. (I dislike the third.) Assuming the developers are involved in database design:

They need to understand basic normalization, at least the first three normal forms. Anything beyond that, get a DBA. For those with any experience with US courtrooms (and random television shows count here), there's the mnemonic "Depend on the key, the whole key, and nothing but the key, so help you Codd."

They need to have a clue about indexes, by which I mean they should have some idea what indexes they need and how they're likely to affect performance. This means not having useless indices, but not being afraid to add them to assist queries. Anything further (like the balance) should be left for the DBA.

They need to understand the need for data integrity, and be able to point to where they're verifying the data and what they're doing if they find problems. This doesn't have to be in the database (where it will be difficult to issue a meaningful error message for the user), but has to be somewhere.

They should have the basic knowledge of how to get a plan, and how to read it in general (at least enough to tell whether the algorithms are efficient or not).

They should know vaguely what a trigger is, what a view is, and that it's possible to partition pieces of databases. They don't need any sort of details, but they need to know to ask the DBA about these things.

They should of course know not to meddle with production data, or production code, or anything like that, and they should know that all source code goes into a VCS.

I've doubtless forgotten something, but the average developer need not be a DBA, provided there is a real DBA at hand.


Basic Indexing

I'm always shocked to see a table or an entire database with no indexes, or arbitrary/useless indexes. Even if you're not designing the database and just have to write some queries, it's still vital to understand, at a minimum:

  • What's indexed in your database and what's not:
  • The difference between types of scans, how they're chosen, and how the way you write a query can influence that choice;
  • The concept of coverage (why you shouldn't just write SELECT *);
  • The difference between a clustered and non-clustered index;
  • Why more/bigger indexes are not necessarily better;
  • Why you should try to avoid wrapping filter columns in functions.

Designers should also be aware of common index anti-patterns, for example:

  • The Access anti-pattern (indexing every column, one by one)
  • The Catch-All anti-pattern (one massive index on all or most columns, apparently created under the mistaken impression that it would speed up every conceivable query involving any of those columns).

The quality of a database's indexing - and whether or not you take advantage of it with the queries you write - accounts for by far the most significant chunk of performance. 9 out of 10 questions posted on SO and other forums complaining about poor performance invariably turn out to be due to poor indexing or a non-sargable expression.