As the clock ticked towards January 1, 2000...Y2K...I began to see a disturbing trend in IT. It seemed to me that Information Technology was devolving. Instead of learning from our mistakes and making better systems, systems and projects were getting worse. When the technology industry imploded in 2001 and 2002, it hit new lows. And it has not really recovered. And it won't until some fundamental changes are made.
I graduated from college in 1991 and began working for a company that developed medical information systems. I programmed in Cobol on the VMS operating system. Writing a functional program took weeks, sometimes months. And it was all character based on a 24 line green screen. Very simplistic, but it worked. It was often more than a year before new functionality made it from development through testing to an actual production client. And this was for inventory management! The process for laboratory, pharmacy and blood bank was even longer. We spent a lot of time clarifying requirements, desk testing logic, slowly developing smaller program units.
Now...it seems that development consists of throwing code together for weekly releases and hope that it runs. Instead of testing, code is put into production and bug reports drive the next release. Imagine living in a house or riding in a car that was designed, built and tested in the same manner as software. I'd be living in a tent and riding my bike everywhere (come to think of it...that is not such a bad idea).
The tech bust cleared out promising talent. As technology boomed, companies were desperate for people to fill open positions. I have worked with developers who majored in music, psychology, sociology. They were hired because they "knew Java"...which usually consisted of a computer course in college or working their way through a book. Most of these people were Junior staff, focused on simple tasks and starting to learn good design/development skills from the hands and minds of mid-level and senior staff. When the bust hit, there was still the same amount of work...but less budget. The senior staff was kept on because they had so much knowledge and their skills made them very efficient. The junior staff was cheap. So the middle staff were the ones to go. Organizations lost people with good skills and outstanding promise...and were left with overworked senior staff who had little time left to pass on knowledge and skills to the junior staff. Now the junior staff was doing most of the work with almost no real skills.
Training has become very low priority. One of the first budget items cut in the tech bust was for training. I saw conference attendance drop by 1/3, user group meetings by 1/2 and almost no actual formal training. Companies no longer invested in their staff development and people were so busy with their own work (as well as not being sufficiently motivated) that they did not pursue training opportunities on their own. This meant that technical knowledge growth was crippled. Only the most self-motivated continued to grow in the areas that truly mattered the most...and those individuals are few and far between.
Organizations are not prepared to deal with technical advancement. Most organizations are set up to deal with labor focused career paths. People moved from a task position to a series of supervisory positions. Foreman, supervisor, manager...all positions where the focus was on managing people. As you were promoted, you managed more and more people. Eventually, you moved to more strategic positions, Director, Vice President, President. But what about the technical personnel who wanted to grow technically? Sorry, the only recognized career path was horizontal...you had to be a manager to get a raise in salary. However, Competition Company needs a person with more senior skills and is able to offer a 20% increase in salary...so the technologist bolts, taking a lot of corporate knowledge with them.
IT is now faced with a serious problem. 15 years ago, IT projects were slow, but they produced solid, working systems. Granted...not all the time...but most of the time. They weren't fancy, but they accomplished the task at hand. Now, we have systems that are unstable, bloated and, in some cases, unusable. Unless the IT community and organizations put forth the effort to stop and reverse this devolution of IT, it will continue.
People need to take control of their own education. Not just learn about the latest development method or programming language, but return to the basics. Learn good design and development skills. Learn about the platforms such as Oracle, SQL*Server, MySQL, Linux, in depth and in reality.
Users need to slow their appetite for new releases. Focus on doing core functionality properly.
Organizations need to support technical staff with vertical career options and proper training budgets.
Sunday, December 16, 2007
Tuesday, December 11, 2007
How useful is the wait interface?
I write this blog with all due respect for Cary Millsap, Gaja Vaidyanatha, K Gopal, Richmond Shee and Kirti Deshpande. They have all made amazing strides in troubleshooting and performance optimization and I hold them in the highest regard. In no way am I criticizing their ideas and hard work, I am simply challenging current thought as they have all challenged it (and continue to do so).
I was recently preparing a presentation on "Why Tune SQL?" for the UKOUG. This is a contrarian presentation in more ways than one. In terms of format, it contained all of 8 slides (1 title slide, 3 "blank" slides, 3 content slides and 1 end slide) for 45 minutes of talk (and I used all 45). The focus was on the idea that all of our attempts to tune SQL gain us little because SQL is not the root of the problem...but that is fodder for another blog entry.
One of the topics I covered was the lack of instrumentation in non-database tiers. As I was preparing this part of my presentation, a thought occurred to me. A very contrarian thought and one that is sure to cause some controversy. A subsequent discussion with a fellow member of the Oak Table has convinced me that this is an issue to be raised.
“Democracy is the worst form of government except for all those others that have been tried.” - Winston Churchill
How useful is the wait (or timed event) interface?
I do not ask this question to disparage the wait interface. Actually, my original assertion was “The wait interface is nearly useless”. This statement is far too aggressive and confrontational, but the idea remains the same. The critical point that I am trying to make is that there are some problems with the interface, but having some instrumentation is better than not having any!
My intention is not to dissuade people from using the wait interface. Quite the opposite. It has and will continue to be a valuable tool for diagnosing performance problems. It cannot and will not be the only tool....no tool can be the one and only. Every tool offers you a view of the data/situation that the tool developer thinks is important. Every tool has limitations, blind spots, misinterpretations...flaws. This does not mean the tool is useless, we just have to know how and when to use the tool.
I point out the problems with the instrumentation for two reasons. First, we need to understand it's limitations. Second, Oracle (and others) need to continue to enhance their instrumentation layer so we can see more, analyze more and ultimately optimize more.
What timed events don't tell you
The wait interface is part of the current implementation of Oracle kernel instrumentation. It was introduced in version 7 and continues to 11g. Each release adds to the number of events (calls) that are instrumented. New events are not always related to new functionality; they may be instrumentation of existing code.
Oracle instrumentation often tells us where the session is NOT spending time in the database tier. Unfortunately, the system tiers that are outside of the database lag far behind Oracle in terms of instrumentation. Application, network, operating system and storage tier instrumentation is nonexistent, inconsistent or plain hard to understand (at least for me). Nothing is more frustrating to me than to work with my counterparts in the non-DBA world and
ask them where time is being spent. Much of the time they throw up their hands and say “I don't know” (usually followed by blaming the database tier for the poor performance).
Instrumentation can tell you that something has occurred or is occurring, but it cannot tell you why. I can see 'db file sequential read' as the top timed event, but that does not tell my why. Excess I/O, slow SAN, read consistent view generation? I can see that a session is spending a large amount of time requesting data from disk, but I cannot tell whether there are issues with the operating system, server hardware or storage array. A single statement can have multiple dbfs events with the time ranging from several microseconds to several seconds. So...why such a spread in time? I can't tell you...because the wait interface can't tell you.
Or “SQL*Net message from client”, a supposed idle event. In reality, it often is a high contributor to response time because it is the 'bucket' where user and application time goes, certainly an area where response time is consumed. And yet, we are told to ignore this time and tools don't report on it. Time and again, I am asked to diagnose a user's session that is “hung” only to find that the database session is waiting on SQL*Net message from client...but the user is still seeing the spinning hour glass. Obviously, the problem is somewhere else...but where? And the user does not care, they want to get their work done, so the DBA sounds like he/she is passing the buck when, in reality, we probably know much more about what a session is doing/has done than any of the other tiers in the system stack.
What about all the other time?
Much of the time spent “inside” the database is not instrumented! Logical I/O, CPU activity, etc. is not instrumented. Some years ago, I did some research (along with Kirti Deshpande) on tracing logical i/o. We found several 'events' we could set to dump out information on logical i/o. All of it was undocumented, much of it was indecipherable and none of it included any timing information. I could look at trace files to see that a session was creating consistent versions of the block...but I could not tell how long it was taking.
What about CPU time that is not logical i/o? In 10g, some timing information is being exposed thanks to the DB Time model. I can see some summary information on parse time, pl/sql time, java time, etc. Definitely an improvement...but the devil is often in the details. I can query v$sesstat and see the “CPU used by this session” value increasing, but I can't tell you much more. My next step for diagnosing problems is to drill down...but the lack of more detailed instrumentation has stopped me.
Profiler obfuscation
When running extended sql trace, it is not unusual for a single business process/user action to generate a trace file that is millions of lines long. While this makes for enlightening reading, it also slows the process of problem determination when performing troubleshooting. In order to summarize the information, the trace file is processed by a profiler, which aggregates the information and removes almost all of the detail.
By removing these details, the profiler loses key troubleshooting information. For the same action performed by 2 different users, one has 100 “db file sequential reads” while the other has 100,000 “db file sequential reads”. Or the first session took 100 minutes for “db file sequential reads”, while the second took 1000 seconds. Why the difference? Different execution plans, read consistency, non-database problems are all possible explanations.
One of the effects of current profiling technology is that it hides variations in the detail. I can see that there were 10 “db file sequential read” events consuming 10 seconds. I do not know if there were 10 1 second events or 1 9 second event and 9 1/9th of a second events. Unless I dive into the actual trace file and read each timed event call. And even that won't tell me why there was an even distribution or a skew.
Instrumentation is absolutely critical
We need two kinds of instrumentation at every level, time line instrumentation and resource instrumentation (time is a resource...but it is important enough that I separate it out). Without this instrumentation we are left with vast parts of the system stack we cannot analyze. We cannot optimize what we cannot analyze! We can only guess.
We have to find the balance between instrumentation and the intrusion effect. If we instrumented everything in detail...we would likely see that the time/resource required for instrumentation was greater than the time/resource required for the actual work.
Consider the alternative
The wait interface is somewhat useless...the only thing more useless is not having a wait interface at all. If we don't have it or, worse, don't use it...we are working blind. We guess.
What can we do? We can continue to raise the issue of instrumentation within our community and organizations. If you are a developer, look for ways to instrument your code. Perhaps not every call, but look at the main calls, the main groups of actions. Start somewhere!
I am reminded of the fable of the blind men and the elephant. Until we can see all of the elephant in detail, our understanding is incomplete and perhaps just plain wrong.
I was recently preparing a presentation on "Why Tune SQL?" for the UKOUG. This is a contrarian presentation in more ways than one. In terms of format, it contained all of 8 slides (1 title slide, 3 "blank" slides, 3 content slides and 1 end slide) for 45 minutes of talk (and I used all 45). The focus was on the idea that all of our attempts to tune SQL gain us little because SQL is not the root of the problem...but that is fodder for another blog entry.
One of the topics I covered was the lack of instrumentation in non-database tiers. As I was preparing this part of my presentation, a thought occurred to me. A very contrarian thought and one that is sure to cause some controversy. A subsequent discussion with a fellow member of the Oak Table has convinced me that this is an issue to be raised.
“Democracy is the worst form of government except for all those others that have been tried.” - Winston Churchill
How useful is the wait (or timed event) interface?
I do not ask this question to disparage the wait interface. Actually, my original assertion was “The wait interface is nearly useless”. This statement is far too aggressive and confrontational, but the idea remains the same. The critical point that I am trying to make is that there are some problems with the interface, but having some instrumentation is better than not having any!
My intention is not to dissuade people from using the wait interface. Quite the opposite. It has and will continue to be a valuable tool for diagnosing performance problems. It cannot and will not be the only tool....no tool can be the one and only. Every tool offers you a view of the data/situation that the tool developer thinks is important. Every tool has limitations, blind spots, misinterpretations...flaws. This does not mean the tool is useless, we just have to know how and when to use the tool.
I point out the problems with the instrumentation for two reasons. First, we need to understand it's limitations. Second, Oracle (and others) need to continue to enhance their instrumentation layer so we can see more, analyze more and ultimately optimize more.
What timed events don't tell you
The wait interface is part of the current implementation of Oracle kernel instrumentation. It was introduced in version 7 and continues to 11g. Each release adds to the number of events (calls) that are instrumented. New events are not always related to new functionality; they may be instrumentation of existing code.
Oracle instrumentation often tells us where the session is NOT spending time in the database tier. Unfortunately, the system tiers that are outside of the database lag far behind Oracle in terms of instrumentation. Application, network, operating system and storage tier instrumentation is nonexistent, inconsistent or plain hard to understand (at least for me). Nothing is more frustrating to me than to work with my counterparts in the non-DBA world and
ask them where time is being spent. Much of the time they throw up their hands and say “I don't know” (usually followed by blaming the database tier for the poor performance).
Instrumentation can tell you that something has occurred or is occurring, but it cannot tell you why. I can see 'db file sequential read' as the top timed event, but that does not tell my why. Excess I/O, slow SAN, read consistent view generation? I can see that a session is spending a large amount of time requesting data from disk, but I cannot tell whether there are issues with the operating system, server hardware or storage array. A single statement can have multiple dbfs events with the time ranging from several microseconds to several seconds. So...why such a spread in time? I can't tell you...because the wait interface can't tell you.
Or “SQL*Net message from client”, a supposed idle event. In reality, it often is a high contributor to response time because it is the 'bucket' where user and application time goes, certainly an area where response time is consumed. And yet, we are told to ignore this time and tools don't report on it. Time and again, I am asked to diagnose a user's session that is “hung” only to find that the database session is waiting on SQL*Net message from client...but the user is still seeing the spinning hour glass. Obviously, the problem is somewhere else...but where? And the user does not care, they want to get their work done, so the DBA sounds like he/she is passing the buck when, in reality, we probably know much more about what a session is doing/has done than any of the other tiers in the system stack.
What about all the other time?
Much of the time spent “inside” the database is not instrumented! Logical I/O, CPU activity, etc. is not instrumented. Some years ago, I did some research (along with Kirti Deshpande) on tracing logical i/o. We found several 'events' we could set to dump out information on logical i/o. All of it was undocumented, much of it was indecipherable and none of it included any timing information. I could look at trace files to see that a session was creating consistent versions of the block...but I could not tell how long it was taking.
What about CPU time that is not logical i/o? In 10g, some timing information is being exposed thanks to the DB Time model. I can see some summary information on parse time, pl/sql time, java time, etc. Definitely an improvement...but the devil is often in the details. I can query v$sesstat and see the “CPU used by this session” value increasing, but I can't tell you much more. My next step for diagnosing problems is to drill down...but the lack of more detailed instrumentation has stopped me.
Profiler obfuscation
When running extended sql trace, it is not unusual for a single business process/user action to generate a trace file that is millions of lines long. While this makes for enlightening reading, it also slows the process of problem determination when performing troubleshooting. In order to summarize the information, the trace file is processed by a profiler, which aggregates the information and removes almost all of the detail.
By removing these details, the profiler loses key troubleshooting information. For the same action performed by 2 different users, one has 100 “db file sequential reads” while the other has 100,000 “db file sequential reads”. Or the first session took 100 minutes for “db file sequential reads”, while the second took 1000 seconds. Why the difference? Different execution plans, read consistency, non-database problems are all possible explanations.
One of the effects of current profiling technology is that it hides variations in the detail. I can see that there were 10 “db file sequential read” events consuming 10 seconds. I do not know if there were 10 1 second events or 1 9 second event and 9 1/9th of a second events. Unless I dive into the actual trace file and read each timed event call. And even that won't tell me why there was an even distribution or a skew.
Instrumentation is absolutely critical
We need two kinds of instrumentation at every level, time line instrumentation and resource instrumentation (time is a resource...but it is important enough that I separate it out). Without this instrumentation we are left with vast parts of the system stack we cannot analyze. We cannot optimize what we cannot analyze! We can only guess.
We have to find the balance between instrumentation and the intrusion effect. If we instrumented everything in detail...we would likely see that the time/resource required for instrumentation was greater than the time/resource required for the actual work.
Consider the alternative
The wait interface is somewhat useless...the only thing more useless is not having a wait interface at all. If we don't have it or, worse, don't use it...we are working blind. We guess.
What can we do? We can continue to raise the issue of instrumentation within our community and organizations. If you are a developer, look for ways to instrument your code. Perhaps not every call, but look at the main calls, the main groups of actions. Start somewhere!
I am reminded of the fable of the blind men and the elephant. Until we can see all of the elephant in detail, our understanding is incomplete and perhaps just plain wrong.
Wednesday, December 05, 2007
UKOUG
The 2007 UKOUG Conference and Exhibition is off to a great start. Not only is the content excellent and amongst the best I have ever seen, but the "networking" is first rate. Each year I meet "new" colleagues whom I have come to know over the years via email and oracle lists.
On Monday, I attended presentations by Riyaj Shamsudeen and Tom Kyte. Riyaj's method of using Powerpoint to very clearly illustrate Analytical SQL. He uses the software more as an explanation tool than a note-taking tool. I would love to see a presentation by Riyaj on how to develop technical presentations. Tom's presentation on 11g new features for DBAs gave me a great deal to think about.
Tuesday was the day for my own presentation on statspack and analytical sql. It was also the day to work a great deal on today's presentation on why we should not tune sql. This particular topic has been very thought provoking for the last couple of years and I had a real epiphany yesterday (look for a blog entry next week on wait events).
There has also been a bit of unpleasantness in regards to a presentation containing someone else's material. One of the dangers of publishing papers/presentations/work is that others can easily take your material and pass it off as their own. I never object to someone who asks for permission to use some of my work in their own and gives me credit for it. That is part of being a member of the community. As a reminder, UKOUG is one of the best conferences and it is not their responsibility (nor should it be) to check presentations for plagiarism.
Time to get ready for my presentation...all 8 slides (including 3 blank or near blank ones, 1 intro and 1 closing). Yes...8 slides for 45 minutes.
On Monday, I attended presentations by Riyaj Shamsudeen and Tom Kyte. Riyaj's method of using Powerpoint to very clearly illustrate Analytical SQL. He uses the software more as an explanation tool than a note-taking tool. I would love to see a presentation by Riyaj on how to develop technical presentations. Tom's presentation on 11g new features for DBAs gave me a great deal to think about.
Tuesday was the day for my own presentation on statspack and analytical sql. It was also the day to work a great deal on today's presentation on why we should not tune sql. This particular topic has been very thought provoking for the last couple of years and I had a real epiphany yesterday (look for a blog entry next week on wait events).
There has also been a bit of unpleasantness in regards to a presentation containing someone else's material. One of the dangers of publishing papers/presentations/work is that others can easily take your material and pass it off as their own. I never object to someone who asks for permission to use some of my work in their own and gives me credit for it. That is part of being a member of the community. As a reminder, UKOUG is one of the best conferences and it is not their responsibility (nor should it be) to check presentations for plagiarism.
Time to get ready for my presentation...all 8 slides (including 3 blank or near blank ones, 1 intro and 1 closing). Yes...8 slides for 45 minutes.
Subscribe to:
Posts (Atom)