allegro.tech

Application release process, or in fact software development process, as a release is the final stage of application development, is not an easy thing. Books and IT websites discuss many approaches and each has its supporters and opponents. On the one hand, you have product owners, project managers and customers who want a ready-to-use application as soon as possible. On the other hand, we developers and testers, would like to release an application of the highest quality, which may affect the delivery time. Balancing these needs is a hard nut to crack. Usually, both sides need to make some compromises to establish a common way of working. For developers and testers, it involves answering several questions concerning software development methods, skills, use of manual or automated testing, and storage of test cases and test logs. In this article I describe best practices and tips for starting a new project. I think that by following them, you will make the software development process as effective as possible and adjusted to conditions of your project.

First of all, what testing skills are necessary to deliver a high-quality product?

To answer this question, you need to know what you want to achieve by testing. The first thing that comes to mind is (especially in the case of commercial products) an application or a system that is ready to use, has no major bugs, and makes its end users happy and willing to use it. To get there, testers and QA engineers should not perceive testing in terms of simple verification, whether all features are working in accordance to the specified requirements. Their job is to make sure that these features fit user needs, and improve application usability, thus making the application as user-friendly as possible. It is an important skill, as UX specialists are not always involved in a project. Therefore, a tester must give feedback regarding application’s look, feel, and potential reaction of end users. What is more, you should not forget about performance - another factor that affects the way users perceive an application. No matter how pretty an app you have, users will be irritated if it is slow, even if no single bug slipped through testing. Naturally, usability or performance is only one aspect. The other two, equally important that should be also taken into consideration, are applies to security and compliance. Data leaks or other security issues may not only affect a company’s image, but also cause financial consequences. The same for compliance issues understood as lack of consistency with specific policies applied to e.g. aviation industry, medical devices or banking applications.

Waterfall or Agile. Always use as designed?

Companies usually choose one specific software development method for all of their projects. However, in seldom cases it is a customer who wants you to apply a certain methodology. Although development and test teams usually have no influence on the choice, they are the ones who decide how the method will be applied. Every software development framework has some rules of engagement. Unfortunately, most teams tend to perceive them as a set of unalterable principles, as something fixed that cannot be adapted to the real needs of a project and the team itself. For example, what happens when requirements or application design are subject to frequent changes that must be quickly implemented and tested? A waterfall model was not designed to deal with frequent changes, so theoretically agile should fit better here. On the other hand, both models may fail when there is no decision or the decision is changed too often. In such cases, it is difficult to find the right path to develop and release an application by strictly following one methodology’s principles. So how to find the most suitable way of working? Be flexible and instead of following the rules adjust them to the changing conditions of a project. Although it may sound as an agile manifesto, but agile is not always the best choice. In the event of large projects not subject to changes, with complex (and approved!) requirements, waterfall (or its variations) may be a better solution. This particular model is more predictable and reliable when a team does not need to release new versions too often. Waterfall may also work when it is crucial to have very good test coverage and very low internal to external defects found ratio. Obviously, the above requirements are difficult to meet when working in an agile way, with frequent releases and not enough time for bug fixing. Eventually, a team would have to come to terms with bugs found in a production environment.

Waterfall or Agile

Automated vs manual testing – only one or both?

Application testing is the most discussed topic. Should you execute only manual, repeatable, and thus boring tests or rely on fast and convenient automated testing? The answer is not that obvious. There are cases when automated testing, happens to be difficult to implement or time-consuming even though it looks very promising at first sight. Let’s take a look at some common opinions on manual and automated tests.

Automated tests

Faster and more effective
This is indeed true, but only with a stable test setup and well-designed test scripts. The initial work that involves setting up the whole environment contributes to the effectiveness of tests and their further use. If you fail at this stage, testers may spend more time solving test setup problems than on testing. Naturally, when the environment is stable, automated regression testing is faster than the manual one, and may be even run for each new build on a daily basis.
Cost effective
Test environment setup and test design are cost- and time-consuming. However, if it is done properly, automated testing is indeed cheaper and faster than the manual approach. Actually, it is easier to write automated tests than to deal with poorly designed setup.
Less tiresome
If regression testing is run on a regular basis, testers carrying out manual tests may become somewhat frustrated and bored of doing the same things again and again, which may affect their effectiveness and concentration. For this reason, testers are often more interested in developing automated tests for the regression testing purpose than to manually executing the same set of test cases every time.
You can run them on a regular basis
It is the main advantage of automated tests. As you can use them to test builds on a daily basis, the development team receives feedback almost immediately. However, there is a risk that the tests may become blind over time - test scenarios, if not updated, verify the same paths as at the first run. It may happen that a small change in the code will remodel some of the application features, but the tests will pass anyway. How is it possible? Because these tests do not “see” UI changes or strings displayed outside the defined fields. They only check if all features are working properly (although it depends on applied frameworks).

Manual testing

It simulates what the end user does
As automated tests are basically robots, they do not reflect the real user’s world. Testing frameworks operate by following a fixed pattern, while users may use an application in a completely different way, not covered by automated tests. Testers, unlike robots, have intuition, which is a substantial skill in the case of exploratory testing. Besides, manual tests allow QA engineers to check more specific things such as cooperation with an operating system. Naturally, there are frameworks that may test it, but they are not as flexible as QA engineers checking certain features manually.
Easy to start with
This sort of test is the best solution for new members, as skills necessary to carry out manual testing are easy to acquire. Well-designed test cases saved in a test management tool (such as ‘TestLink’, ‘HP Quality Center’, etc.) are easy to follow, so new team members can start the test execution on their own. Besides, as creating new test cases is not complicated even beginners can handle it.
Faster and more effective in the case of applications undergoing frequent changes
When an application undergoes changes, the QA team may not keep up with creating new automated tests. So in this particular case, manual testing is faster and more effective due to its flexibility. Anyway, it does not mean that automated tests are unnecessary.

After reading the previous paragraphs, finding the best solution should be easier. Testers or QA engineers should consider their choice well and consider all the factors mentioned above. Eventually, the best choice depends on knowledge and experience of QA engineers.

Waterfall or Agile

Testing tools – do you need them? Which should you choose?

Less experienced engineers often ask about testing tools. An absolute must-have is a test management tool to keep any requirements-to-test-cases coverage and track bug-to-test-cases. The market offers a lot of commercial and free tools such as HP Quality Center or TestLink mentioned above or a free Polish tool – TestArena. A decision concerning the choice of a tool should be carefully considered in terms of ROI (Return of Investment). Any potential migration of test cases between different tools following a change of decision may be time-consuming and sometimes difficult to execute. The same rule applies to defect tracking tools, with JIRA developed by Atlassian being probably the most popular one. Its main advantage is a JIRA Agile add-on (recently incorporated into a standard JIRA version) that allows users to manage user stories and linked test cases. Therefore, it can be used in an agile project as the only test management tool. All in all, Excel spreadsheets are insufficient to do the job.
Next thing is choosing a tool for designing and executing of automated tests, which depends on the type of developed application/system (e.g. web or mobile app) and applied technology. If you are dealing with websites, try Selenium. In the case of native Android apps, try Espresso, and for iOS – XCUITest. Nonetheless, test other frameworks to select the one that suits your project best.

Application release. Case study

I discussed advantages and disadvantages of various testing approaches and software development methods in previous paragraphs. Nevertheless, it turns out that releasing a reliable application is not easy. When we started a new project, a German version of Allegro iOS app, we had to find the best solution. We decided to adjust an agile model to the needs and conditions of our project for more effective QA work. The problem was that we couldn’t receive final mock-ups and user stories, as they were continuously modified. So we decided to base on those few requirements that were already agreed on and could be considered stable. We started writing manual test cases using our test management tool. It was a good idea, as we had a lot of questions about different states of an application, its behavior, edge cases, etc. Eventually, it resulted in fewer bugs at the end of the development stage. When TCs (‘Test Cases’) were ready, we asked developers, UX engineers and a product owner to review our tests. They pointed out cases we did not think of and clarified some information in TCs. It gave us better insight into how the application should work, and gave us great project documentation. We created manual test cases being a base for regression testing. But first, we used them for regular functional testing of new features. Then we included test cases created for new features to a new regression test set, and ran it one more time when a release candidate was ready.

Although it may seem that with an increasing number of regression test cases it took more time to execute the regression tests with each release, it did not. For each release-specific regression test, cases were chosen based on areas subject to changes. After testing new features for a specific release, there was no need to run all test cases for that feature if nothing was changed. It was sufficient to run the main test scenarios only. As a result, the set for regression testing was always different. Therefore, we knew how much time testing might take. And what happened when we found new bugs when running regression tests? In such situations, the product owner, QA engineers, UX specialists and developers discussed the bugs criticality. Such defect triage allowed us to decide what to fix in next releases, and what had to be fixed immediately. When developers created a new build with all necessary fixes, we ran the regression tests once again, but we used less test cases just to check areas with modified code and verify core functionalities. After finding a new critical issue, we repeated the process one more time. Fixes are always checked using separate branches, before being merged into next release candidate, but regression testing is performed on RC (‘Release Candidate’) with all necessary fixes.

You may wonder where is the automation in this process? We have one set of automated sanity tests for new builds. It covers main functionalities of an application and is run on all branches, so feedback concerning builds is quick. We also use the set as a basic check for regression tests. With an increasing number of automated tests, we use them to replace the manual ones in regression testing. But it does not mean that eventually all regression manual tests will be automated. At the very beginning of the project, before we developed the process described above, no automated tests were created. We considered it too time-consuming as some implemented functionalities, which were only a temporary solution, were supposed to be changed in the nearest future. In other words, we would spend a lot of time on test environment setup and test design to create tests to be executed only a few times, so the ROI would be very low. Therefore, it was better to focus more on manual testing.

Waterfall or Agile

Summary

As there is no perfect model that would fit every project, QA engineers should decide about the testing process by taking into account project features. A few factors need to be taken into consideration as well. So, how to find the happy medium? Be flexible and adjust the model, or some of its parts, so that it suits the specific conditions of your project.

This is a post for those seeking to accomplish business goals and ensure stability of the solutions developed while maintaining focus on people. The model of three basic psychological needs that I’m presenting here may be useful for leaders, agile coaches, and scrum masters. I also encourage developers to do some self-reflection. This is the knowledge I’ve gained at the World Conference of Transactional Analysis in Berlin. Transactional Analysis (TA) is a theory of interpersonal relationships developed by Eric Berne which has a practical application in various fields, including organizations.

Do you sometimes lack motivation even after getting positive feedback from your boss or colleagues? Or perhaps you feel that you don’t get acknowledged? Your friends would do anything to have a stable job like yours, while you keep complaining about what you don’t have? Or perhaps it’s the opposite: you’re overwhelmed with the number and the variety of challenges that the company puts in front of you? Or maybe you feel lost because you don’t have a clearly defined role and place in the team? You don’t fully understand your responsibilities and that frustrates you? With a similar level of discomfort at work, it is likely that your basic psychological needs are not being met which affects your feeling of satisfaction and fulfilment. Below is a description of these needs.

Psychological needs: the 3 hungers model

During her workshop “Psychological Needs – base to develop identity, energize leadership and manage (vanishing) boundaries”, Anke von Platen vividly presented “the 3S model” (strokes, stimuli and structure). These are our 3 basic psychological needs which can be fulfilled or unfulfilled, both at personal and professional level.

the 3 hungers model

Anke started her story with a metaphor. In the picture above, the cart represents the body (hardware), and the horse represents the mind (software). The horse may be running in circles if it doesn’t know where to go. It can be running and straining the body without giving it any rest. The driver manages the whole system. He pays attention to the body and mind. He feeds it and makes sure it gets some rest. Each of us is responsible for taking care of these two aspects; you’re the only person that is able to integrate these two parts so that they are stable and in harmony.

What the mind needs. Transactional Analysis (TA) talks out about 3 basic needs referred to as hungers:

1. Hunger of structure When your hunger of structure is satisfied, it gives you a feeling of clarity (“I understand”). For each person, it can be based on different things. For example, you may need clear business goals, information, clear responsibilities and processes, or a predictable people environment.

2. Hunger of stimuli When your hunger of stimuli is satisfied, it gives you a feeling of control (“I will cope”). I have a feeling of control if there is a good amount of challenges or stimuli for me so that I feel “Yes, I will be able to cope with it”. The feeling of control is based on experience, education, coping with challenges, control of feelings and emotions.

3. Hunger of strokes Strokes are the signs that we get from the people we work with which make us feel that what we do has value. As proof of recognition, we can get a raise or a bonus, and on a daily basis: a code review, listening, feedback, celebrating success, or development. When your hunger of strokes is satisfied, it gives you a feeling of appreciation (“It’s worthwhile”). Sometimes, when we’re not getting enough recognition and appreciation, we subconsciously prompt negative strokes.

Hunger of strokes

By satisfying these hungers, we get the answer to the question of WHAT to do, HOW to do it using our knowledge and experience, and WHY to do it, what we want to achieve by doing this, and what is it that we personally want. We can use this knowledge to perform an analysis using individual or team approach, looking into our own motivation or feeling of satisfaction at work. Each team member can perform self-assessment using these axes. The results can be then compiled for the whole team to see what the team needs to feel good and have the right conditions to work.

Chart 1

Example of self-assessment performed by an individual

The person on the diagram below is a highly motivated developer. He uses a stretch goal that was set for his team as his main source of energy. He has Clarity as to what is to be done and why, and has experience in the language and architecture that the team wants to use to build the solution (Control). He rated Appreciation as medium.

At this point, this doesn’t affect his satisfaction with work but if his frustration with the unfulfilled need for stimuli or structure gets worse, it’ll be enough to affect his overall satisfaction.

Chart 2

Example of self-assessment performed by a team: Below is a diagram of a team that has been struggling in the last few months with a large number of service errors. It’s a legacy service that was passed from another team. They haven’t had the time for a proper refactoring. The developers keep making patches. Two of them are really fed up with it. The low scores on the Control axis result from the lack of challenges and stimuli that work as a powerful driver for them when developing new functionalities. They are starting to question why their team even exists. The Product Owner appreciates their efforts but at this point, instead of acknowledgement, these two developers need more interesting challenges (Control) or at least Clarity on what they still need to improve and how long this will take. The other two developers are coping with the lack of challenges. They are glad that they’ve been able to work out a previously unfamiliar service which has earned them Appreciation from other teams. This explains the high scores on each axis.

Chart 3

The following questions can help you analyse intuitive scores:

WHAT? Do I know my responsibilities in the team? What is the business goal? Why are we developing this product?

HOW? Do I still find my job interesting or perhaps I feel overwhelmed by the level of complexity? Do I have sufficient resources to cope with?

WHY? How do I know that what I’m doing makes sense? How often do I get acknowledged? What signs of acknowledgement do I need from my co-workers/boss?

What I find interesting is the definition of these terms: what do clarity, control and appreciation mean to me as opposed to our team? Do we understand these terms the same way? What has been provided to us and what is it that we’re missing?

I encourage you to do some self-reflection on your role and the workplace. In her story, Anke often referred to a leader as a person that can manage a project or a team but can leaders successfully manage themselves? As you may guess, she encourages leaders to first make sure that their needs are being met at a satisfactory level so that they can later help their team do the same. This way, leaders become role models who can share their experience of change.

You can successfully use this model during a retrospective or a face-to-face meeting, following the guidance provided above. It was interesting for me to see the contrast between my self-assessment scores for work and for personal life. These are two separate worlds, or two systems where totally different rules can apply: we may care about different things, and accept different things as the norm or as deviation from the norm.

I suggested this exercise to a team leader. He had some interesting insights, and said that a few months ago his scores on each of the axes would have been completely different. The situation changed, and so did the extent to which his needs are being fulfilled. The question is: do we consciously contribute to the improvement of a situation, and if so, to what extent? Or do we tend to go with the flow hoping that we won’t drown? Another important conclusion from this conversation is that we need to accept that each of us, even when working under the same conditions (team, company), may understand the axes and their extreme values differently. One person may get extremely anxious about announced organizational changes, while another person may pay no attention to them (Clarity). Some people may need a lot of feedback and signs of appreciation from their colleagues, while others may derive their satisfaction from their own contentment with the completion of a task (Appreciation). A comparison of each team member’s rates on the axes may be a good reason to talk about what we’re doing, why we’re doing it, and how we feel about it as a team and individually.

At Allegro, feature velocity is a top priority. We believe that one of our critical competitive advantages is the rate at which we introduce new features. In order to achieve a high feature velocity, one of the architectural choices that Allegro made a while back was to move to microservice architecture. So when somebody uses Allegro, a request comes in (and we just have a hypothetical example here) to service D — and we can imagine a service D being a proxy or an API layer — and whatever that service is, it is not going to have all the information it needs to serve a response. So the service D is going to reach out to service C and service F, and F in turn will reach out to A, and it will in turn reach out to B and E, and you see that this very quickly gets complicated. Allegro has somewhere around 500 microservices, so you can imagine how complicated the communication map looks.

Intuition Engineering

One of the challenges our monitoring team faces is the need to get live feedback on hundreds of microservices in two datacenters, to get a gut feeling of whether the system as a whole is doing OK or not.

When you think monitoring, you think dashboards. And they are great for reactive monitoring — you’ve been alerted, go look at the dashboards, see what has changed. In fact, most standard data visualisations are not meant to be stared at in real time. They are useful for showing exact numbers and how they correlate to each other, and are meant to be more “something has happened, let’s go look, investigate, and figure what happened and when it started”. But we need to be able know what’s going on right now, to just take a quick glance and see that something could be going wrong.

We are not the only ones facing this challenge. The approach of getting a gut feeling about the holistic state of the system was popularised by Netflix who called it “Intuition Engineering”. Netflix developed a tool called Vizceral to aid their traffic team in performing traffic failover between AWS availability zones.

Vizceral

Let’s have a look at a video showing traffic failover simulation.

At first, you see the normal state. The middle circle represents the Internet, showing how many requests are coming in per second and the error rate for that traffic. There are a lot of dots moving from the Internet to the surrounding circles, with each of those circles representing one of the AWS regions that Netflix uses. Traffic is handled by all three regions, and what’s nice about this view is that it’s fairly easy to tell the volume. You can see that US-East-1 is taking the most traffic, US-West-2 is next, and EU-West-1 is trailing closely behind.

Then you see that some errors start to happen in US-East-1. Dots that represent requests have colours, normal is light blue, because colour theory shows that blue is a calm neutral colour. Red dots mean there was a request that resulted in an error response, and yellow dots mean that there was a request that resulted in a degraded response (such as one of the rows missing in a list of movies).

So errors start to happen in US-East-1, and traffic team starts to scale up other 2 regions so that they can serve traffic of the users being affected in US-East-1. And soon they start to proxy some users from US-East-1 into the other two regions. More and more traffic is being proxied out of US-East-1, and when all the traffic is being proxied out of US-East-1, they can flip the DNS (which they can’t flip first because it would overwhelm the other two regions). So they flip DNS and the traffic which causes all traffic to be sent to the other two regions while engineers work on US-East-1 and get it back up. And as soon as US-East-1 is fixed they can do it all in reverse: flip DNS again, and slowly dial back the proxying until the system gets back to a steady state.

Look at just how intuitively clear this visualization tool is. Even without comments it is fairly obvious what was happening in the simulation — where errors happen, and where the traffic flows. This is very hard to achieve with a dashboard. And it’s exactly what Vizceral is good for.

Phobos

Netflix opensourced the front end part of Vizceral, using which we created an internal tool called Phobos. It uses Vizceral, and yet it is quite different in many ways. First of all, we are not interested in traffic data, we are interested in how connections between microservices relate to specific business processes like login or purchase. If something goes wrong with a service A, the questions we’re interested in are: which other services might have degraded performance? And which business processes may be affected by it?

Phobos area view

Instead of datacentres or availability zones, the main view of Phobos shows business areas. Each area contains microservices related to a specific business process. For example, one area could contain all microservices that handle user data, another one could contain all services that handle listings, and so on. You can zoom into an area to see individual services within this area and their connections. Services that belong to the area are shown in green, other areas are shown in blue. Phobos is integrated with our monitoring stack, so alerts are shown in red. For each service there is a summary panel where you have a list of hosts, incoming and outgoing connections, links to dashboards, PagerDuty, deploy info and other integrations. You can drill even further into an individual service to see individual instances of this service, their alerts and connections.

Phobos has the ability to travel back in time, so you can see what the state of the system was yesterday during an outage, which is especially useful during root cause analysis and postmortems.

To create a map of connections between services we use a combination of two sources. Firstly, we use trace IDs from Zipkin protocol. Secondly, we collect information about individual TCP connections from netstat. While the original Vizceral operates on data about the volume of requests, in Phobos we use data about the number of TCP connections established between hosts.

Phobos service view

The frontend part of Vizceral, which is opesourced by Netflix, is written in WebGL with three.js library; and on top of Vizceral we use a Vizceral-React wrapper. The backend consists of three logical parts. First, there are host-runners — daemons that collect information about TCP connections between services and send this data via Apache Kafka to our Hadoop cluster. Second, there are Spark jobs that analyse connection data and store it in Cassandra database. And finally there is the Phobos backend itself, which is written in Python using Django Rest Framework. Phobos backend crunches the data from Cassandra and exposes it in JSON format via API endpoint in the form that Vizceral understands.

Phobos became an invaluable tool that gives us an interface to a very complex system, enabling us to develop an intuition about its state and health.

Spring is one of the most popular JVM-targeted frameworks. One of the reasons why it has become so popular is writing tests. Even before Spring Boot era, it was easy to run an embedded Spring application in tests. With Spring Boot, it became trivial. JUnit and Spock are two most popular frameworks for writing tests. They both provide great support and integration with Spring, but until recently it was not possible to leverage Spring’s @WebMvcTest in Spock. Why does it matter? @WebMvcTest is a type of an integration test that only starts a specified slice of Spring Application and thus its execution time is significantly lower compared to full end-to-end tests.
Things have changed with Spock 1.2. Let me show you, how to leverage this new feature.

@WebMvcTest

It is easy to write great tests (clear and concise) for most of the components in a typical Spring Application. We create a unit test, stub interactions with dependencies and voila. Things are not so easy when it comes to REST controllers. Until Spring Boot 1.4 testing REST controllers (and all the ’magic’ done by Spring MVC) required running full application, which of course took a lot of time. Not only startup time was the issue. Typically, one was also forced to setup entire system’s state to test certain edge cases. This usually made tests less readable. @WebMvcTest is here to change that and now, supported in Spock.

@WebMvcTest with Spock

In order to use Spock’s support for @WebMvcTest, you have to add a dependency on Spock 1.2-SNAPSHOT, as GA version has not been released yet (https://github.com/spockframework/spock).
For Gradle, add snapshot repository:

repositories{...maven{url"https://oss.sonatype.org/content/repositories/snapshots/"}}

and then the dependency:

dependencies{...testCompile(..."org.spockframework:spock-core:1.2-groovy-2.4-SNAPSHOT","org.spockframework:spock-spring:1.2-groovy-2.4-SNAPSHOT")}

Sample application

I have created a fully functional application with examples. All snippets in this article are taken from it. The application can be found here: https://github.com/rafal-glowinski/mvctest-spock. It exposes a REST API for users to register to some event. Registration requirements are minimal: a user has to provide a valid email address, name, and last name. All fields are required.

Starting with Rest Controller (most imports omitted for clarity):

...importjavax.validation.Valid@RestController@RequestMapping(path="/registrations")publicclassUserRegistrationController{privatefinalRegistrationServiceregistrationService;publicUserRegistrationController(RegistrationServiceregistrationService){this.registrationService=registrationService;}@PostMapping(consumes=APPLICATION_JSON_VALUE,produces=APPLICATION_JSON_VALUE)@ResponseStatus(HttpStatus.CREATED)publicExistingUserRegistrationDTOregister(@RequestBody@ValidNewUserRegistrationDTOnewUserRegistration){UserRegistrationuserRegistration=registrationService.registerUser(newUserRegistration.getEmailAddress(),newUserRegistration.getName(),newUserRegistration.getLastName());returnasDTO(userRegistration);}privateExistingUserRegistrationDTOasDTO(UserRegistrationregistration){returnnewExistingUserRegistrationDTO(registration.getRegistrationId(),registration.getEmailAddress(),registration.getName(),registration.getLastName());}...}

We tell Spring Web to validate incoming request body (@Valid annotation on function argument). If you are using Spring Boot 1.4.x then this will not work without an additional post-processor in Spring Configuration:

@SpringBootApplicationpublicclassWebMvcTestApplication{publicstaticvoidmain(String[]args){SpringApplication.run(WebMvcTestApplication.class,args);}@BeanpublicMethodValidationPostProcessormethodValidationPostProcessor(){returnnewMethodValidationPostProcessor();}}

Spring Boot 1.5.x is shipped with an additional ValidationAutoConfiguration that automatically creates an instance of MethodValidationPostProcessor if necessary dependencies are present on classpath.

Now, having a REST Controller ready, we need a class to deserialize JSON request into. A simple POJO with Jackson and Javax Validation API annotations is enough to do the trick:

importcom.fasterxml.jackson.annotation.JsonCreator;importcom.fasterxml.jackson.annotation.JsonProperty;importjavax.validation.constraints.NotNull;importjavax.validation.constraints.Pattern;importjavax.validation.constraints.Size;importstaticcom.rg.webmvctest.SystemConstants.EMAIL_REGEXP;publicclassNewUserRegistrationDTO{privatefinalStringemailAddress;privatefinalStringname;privatefinalStringlastName;@JsonCreatorpublicNewUserRegistrationDTO(@JsonProperty("email_address")StringemailAddress,@JsonProperty("name")Stringname,@JsonProperty("last_name")StringlastName){this.emailAddress=emailAddress;this.name=name;this.lastName=lastName;}@Pattern(regexp=EMAIL_REGEXP,message="Invalid email address.")@NotNull(message="Email must be provided.")publicStringgetEmailAddress(){returnemailAddress;}@NotNull(message="Name must be provided.")@Size(min=2,max=50,message="Name must be at least 2 characters and at most 50 characters long.")publicStringgetName(){returnname;}@NotNull(message="Last name must be provided.")@Size(min=2,max=50,message="Last name must be at least 2 characters and at most 50 characters long.")publicStringgetLastName(){returnlastName;}}

What we have here is a POJO with 3 fields. Each of these fields has Jackson’s @JsonProperty annotation and two more from Javax Validation API.

First test

Writing @WebMvcTest is trivial once you have a framework that supports it. Following example is a minimal working piece of code to create a @WebMvcTest in Spock (written in Groovy):

...importstaticorg.springframework.test.web.servlet.request.MockMvcRequestBuilders.postimportstaticorg.springframework.test.web.servlet.result.MockMvcResultMatchers.jsonPathimportstaticorg.springframework.test.web.servlet.result.MockMvcResultMatchers.status@WebMvcTest(controllers=[UserRegistrationController])// 1classSimplestUserRegistrationSpecextendsSpecification{@AutowiredprotectedMockMvcmvc// 2@AutowiredRegistrationServiceregistrationService@AutowiredObjectMapperobjectMapperdef"should pass user registration details to domain component and return 'created' status"(){given:Maprequest=[email_address:'john.wayne@gmail.com',name:'John',last_name:'Wayne']and:registrationService.registerUser('john.wayne@gmail.com','John','Wayne')>>newUserRegistration(// 3'registration-id-1','john.wayne@gmail.com','John','Wayne')when:defresults=mvc.perform(post('/registrations').contentType(APPLICATION_JSON).content(toJson(request)))// 4then:results.andExpect(status().isCreated())// 5and:results.andExpect(jsonPath('$.registration_id').value('registration-id-1'))// 5results.andExpect(jsonPath('$.email_address').value('john.wayne@gmail.com'))results.andExpect(jsonPath('$.name').value('John'))results.andExpect(jsonPath('$.last_name').value('Wayne'))}@TestConfiguration// 6staticclassStubConfig{DetachedMockFactorydetachedMockFactory=newDetachedMockFactory()@BeanRegistrationServiceregistrationService(){returndetachedMockFactory.Stub(RegistrationService)}}}

First, there is a @WebMvcTest (1) annotation on the class level. We use it to inform Spring which controllers should be started. In this example, UserRegistrationController is created and mapped onto defined request paths, but to make that happen we have to provide stubs for all dependencies of UserRegistrationController. We do it by writing a custom configuration class and annotating it with @TestConfiguration (6).

Now, when Spring instantiates UserRegistrationController, it passes the stub created in StubConfig as a constructor argument and we are able to perform stubbing in our tests (3). We perform an HTTP request (4) using injected instance of MockMvc (2). Finally, we execute assertions on the obtained instance of org.springframework.test.web.servlet.ResultActions (5). Notice that these were not typical Spock assertions, we used ones built into Spring. Worry not, there is a way to make use of one of the strongest features of Spock:

def"should pass user registration details to domain component and return 'created' status"(){given:Maprequest=[email_address:'john.wayne@gmail.com',name:'John',last_name:'Wayne']and:registrationService.registerUser('john.wayne@gmail.com','John','Wayne')>>newUserRegistration('registration-id-1','john.wayne@gmail.com','John','Wayne')when:defresponse=mvc.perform(post('/registrations').contentType(APPLICATION_JSON).content(toJson(request))).andReturn().response// notice the extra call to: andReturn()then:response.status==HttpStatus.CREATED.value()and:with(objectMapper.readValue(response.contentAsString,Map)){it.registration_id=='registration-id-1'it.email_address=='john.wayne@gmail.com'it.name=='John'it.last_name=='Wayne'}}

What is different with respect to the previous test is the extra call of andReturn() method on the ResultAction object to obtain an HTTP response. Having a response object, we can perform any assertions we need as we would do in any Spock test.

Testing validations

So, let us get back to validations we want to perform on incoming requests. The NewUserRegistrationDTO class has lots of additional annotations that describe what values are allowed for each of the fields. When any of these fields are recognized as having illegal values, Spring throws org.springframework.web.bind.MethodArgumentNotValidException. How do we return a proper HTTP Status and error description in such situation?

First, we tell Spring that we are handling the mapping of MethodArgumentNotValidException onto the ResponseEntity ourselves. We do this by creating a new class and annotating it with org.springframework.web.bind.annotation.ControllerAdvice. Spring recognizes all such classes and they are instantiated as if they were regular Spring Beans. Inside this class, we write a function that handles the mapping. In my sample application, it looks like this:

@ControllerAdvicepublicclassExceptionsHandlerAdvice{privatefinalExceptionMapperHelpermapperHelper=newExceptionMapperHelper();@ExceptionHandler(MethodArgumentNotValidException.class)publicResponseEntity<ErrorsHolder>handleException(MethodArgumentNotValidExceptionexception){ErrorsHoldererrors=newErrorsHolder(mapperHelper.errorsFromBindResult(exception,exception.getBindingResult()));returnmapperHelper.mapResponseWithoutLogging(errors,HttpStatus.UNPROCESSABLE_ENTITY);}}

What we have here is a function annotated with org.springframework.web.bind.annotation.ExceptionHandler. Spring recognizes this method and registers it as global exception handler. If MethodArgumentNotValidException is thrown outside of the scope of the Rest Controller, this function is called to produce the response — an instance of org.springframework.http.ResponseEntity. In this case, I have decided to return HTTP Status 422 — UNPROCESSABLE_ENTITY with my own, custom errors structure.

Here is a more complicated example, that shows full test setup (make sure to check the sources on GitHub):

@Unrolldef"should not allow to create a registration with an invalid email address: #emailAddress"(){given:Maprequest=[email_address:emailAddress,name:'John',last_name:'Wayne']when:defresult=doRequest(post('/registrations').contentType(APPLICATION_JSON).content(toJson(request))).andReturn()then:result.response.status==HttpStatus.UNPROCESSABLE_ENTITY.value()and:with(objectMapper.readValue(result.response.contentAsString,Map)){it.errors[0].code=='MethodArgumentNotValidException'it.errors[0].path=='emailAddress'it.errors[0].userMessage==userMessage}where:emailAddress||userMessage'john.wayne(at)gmail.com'||'Invalid email address.''abcdefg'||'Invalid email address.'''||'Invalid email address.'null||'Email must be provided.'}

Summary

This short article by no means covers all features of Spring’s Web Mvc Tests. There are lots of cool features available (testing against Spring Security) and more are coming. JUnit always gets the support first but if you are a Spock fan like me, then I hope you have found this article helpful.

We all make errors, but some errors seem so ridiculous we wonder how anyone, let alone we ourselves, could have done such a thing. This is, of course, easy to notice only after the fact. Below, I describe a series of such errors which we recently made in one of our applications. What makes it interesting is that initial symptoms indicated a completely different kind of problem than the one actually present.

Once upon a midnight dreary

I was woken up shortly after midnight by an alert from our monitoring system. Adventory, an application responsible for indexing ads in our PPC (pay-per-click) advertising system had apparently restarted several times in a row. In a cloud environment, a restart of one single instance is a normal event and does not trigger any alerts, but this time the threshold had been exceeded by multiple instances restarting within a short period. I switched on my laptop and dived into the application’s logs.

It must be the network

I saw several timeouts as the service attempted connecting to ZooKeeper. We use ZooKeeper (ZK) to coordinate indexing between multiple instances and rely on it to be robust. Clearly, a Zookeeper failure would prevent indexing from succeeding, but it shouldn’t cause the whole app to die. Still, this was such a rare situation (the first time I ever saw ZK go down in production) that I thought maybe we had indeed failed to handle this case gracefully. I woke up the on-duty person responsible for ZooKeeper and asked them to check what was going on.

Meanwhile, I checked our configuration and realized that timeouts for ZooKeeper connections were in the multi-second range. Obviously, ZooKeeper was completely dead, and given that other applications were also using it, this meant serious trouble. I sent messages to a few more teams who were apparently not aware of the issue yet.

My colleague from ZooKeeper team got back to me, saying that everything looked perfectly normal from his point of view. Since other users seemed unaffected, I slowly realized ZooKeeper was not to blame. Logs clearly showed network timeouts, so I woke up the people responsible for networking.

Networking team checked their metrics but found nothing of interest. While it is possible for a single segment of the network or even a single rack to get cut off from the rest, they checked the particular hosts on which my app instances were running and found no issues. I had checked a few side ideas in the meantime but none worked, and I was at my wit’s end. It was getting really late (or rather early) and, independently from my actions, restarts somehow became less frequent. Since this app only affected the freshness of data but not its availability, together with all involved we decided to let the issue wait until morning.

It must be garbage collection

Sometimes it is a good idea to sleep on it and get back to a tough problem with a fresh mind. Nobody understood what was going on and the service behaved in a really magical way. Then it dawned on me. What is the main source of magic in Java applications? Garbage collection of course.

Just for cases like this, we keep GC logging on by default. I quickly downloaded the GC log and fired up Censum. Before my eyes, a grisly sight opened: full garbage collections happening once every 15 minutes and causing 20-second long [!] stop-the-world pauses. No wonder the connection to ZooKeeper was timing out despite no issues with either ZooKeeper or the network!

These pauses also explained why the whole application kept dying rather than just timing out and logging an error. Our apps run inside Marathon, which regularly polls a healthcheck endpoint of each instance and if the endpoint isn’t responding within reasonable time, Marathon restarts that instance.

20-second GC pauses — certainly not your average GC log

Knowing the cause of a problem is half the battle, so I was very confident that the issue would be solved in no time. In order to explain my further reasoning, I have to say a bit more about how Adventory works, for it is not your standard microservice.

Adventory is used for indexing our ads into ElasticSearch (ES). There are two sides to this story. One is acquiring the necessary data. To this end, the app receives events sent from several other parts of the system via Hermes. The data is saved to MongoDB collections. The traffic is a few hundred requests per second at most, and each operation is rather lightweight, so even though it certainly causes some memory allocation, it doesn’t require lots of resources. The other side of the story is indexing itself. This process is started periodically (around once every two minutes) and causes data from all the different MongoDB collections to be streamed using RxJava, combined into denormalized records, and sent to ElasticSearch. This part of the application resembles an offline batch processing job more than a service.

During each run, the whole index is rebuilt since there are usually so many changes to the data that incremental indexing is not worth the fuss. This means that a whole lot of data has to pass through the system and that a lot of memory allocation takes place, forcing us to use a heap as large as 12 GB despite using streams. Due to the large heap (and to being the one which is currently fully supported), our GC of choice was G1.

Having previously worked with some applications which allocate a lot of short-lived objects, I increased the size of young generation by increasing both -XX:G1NewSizePercent and -XX:G1MaxNewSizePercent from their default values so that more data could be handled by the young GC rather than being moved to old generation, as Censum showed a lot of premature tenuring. This was also consistent with the full GC collections taking place after some time. Unfortunately, these settings didn’t help one bit.

The next thing I thought was that perhaps the producer generated data too fast for the consumer to consume, thus causing records to be allocated faster than they could be freed. I tried to reduce the speed at which data was produced by the repository by decreasing the size of a thread pool responsible for generating the denormalized records while keeping the size of the consumer data pool which sent them off to ES unchanged. This was a primitive attempt at applying backpressure, but it didn’t help either.

It must be a memory leak

At this point, a colleague who had kept a cooler head, suggested we do what we should have done in the first place, which is to look at what data we actually had in the heap. We set up a development instance with an amount of data comparable to the one in production and a proportionally scaled heap. By connecting to it with jvisualvm and running the memory sampler, we could see the approximate counts and sizes of objects in the heap. A quick look revealed that the number of our domain Ad objects was way larger than it should be and kept growing all the time during indexing, up to a number which bore a striking resemblance to the number of ads we were processing. But… this couldn’t be. After all, we were streaming the records using RX exactly for this reason: in order to avoid loading all of the data into memory.

Memory Sampler showed many more Ad objects than we expected

With growing suspicion, I inspected the code, which had been written about two years before and never seriously revisited since. Lo and behold, we were actually loading all data into memory. It was, of course not intended. Not knowing RxJava well enough at that time, we wanted to parallelize the code in a particular way and resolved to using CompletableFuture along with a separate executor in order to offload some work from the main RX flow. But then, we had to wait for all the CompletableFutures to complete… by storing references to them and calling join(). This caused the references to all futures, and thus also to all the data they referenced, to be kept alive until the end of indexing, and prevented the Garbage Collector from freeing them up earlier.

Is it really so bad?

This is obviously a stupid mistake, and we were quite disgusted at finding it so late. I even remembered a brief discussion a long time earlier about the app needing a 12 GB heap, which seemed a bit much. But on the other hand, this code had worked for almost two years without any issues. We were able to fix it with relative ease at this point while it would probably have taken us much more time if we tried fixing it two years before and at that time there was a lot of work much more important for the project than saving a few gigabytes of memory.

So while on a purely technical level having this issue for such a long time was a real shame, from a strategic point of view maybe leaving it alone despite the suspected inefficiency was the pragmatically wiser choice. Of course, yet another consideration was the impact of the problem once it came into light. We got away with almost no impact for the users, but it could have been worse. Software engineering is all about trade-offs, and deciding on the priorities of different tasks is no exception.

Still not working

Having more RX experience under our belt, we were able to quite easily get rid of the CompletableFutures, rewrite the code to use only RX, migrate to RX2 in the process, and to actually stream the data instead of collecting it in memory. The change passed code review and went to testing in dev environment. To our surprise, the app was still not able to perform indexing with a smaller heap. Memory sampling revealed that the number of ads kept in memory was smaller than previously and it not only grew but sometimes also decreased, so it was not all collected in memory. Still, it seemed as if the data was not being streamed, either.

So what is it now?

The relevant keyword was already used in this post: backpressure. When data is streamed, it is common for the speeds of the producer and the consumer to differ. If the producer is faster than the consumer and nothing forces it to slow down, it will keep producing more and more data which can not be consumed just as fast. There will appear a growing buffer of outstanding records waiting for consumption and this is exactly what happened in our application. Backpressure is the mechanism which allows a slow consumer to tell the fast producer to slow down.

Our indexing stream had no notion of backpressure which was not a problem as long as we were storing the whole index in memory anyway. Once we fixed one problem and started to actually stream the data, another problem — the lack of backpressure — became apparent.

This is a pattern I have seen multiple times when dealing with performance issues: fixing one problem reveals another which you were not even aware of because the other issue hid it from view. You may not be aware your house has a fire safety issue if it is regularly getting flooded.

Fixing the fix

In RxJava 2, the original Observable class was split into Observable which does not support backpressure and Flowable which does. Fortunately, there are some neat ways of creating Flowables which give them backpressure support out-of-the-box. This includes creating Flowables from non-reactive sources such as Iterables. Combining such Flowables results in Flowables which also support backpressure, so fixing just one spot quickly gave the whole stream backpressure support.

With this change in place, we were able to reduce the heap from 12 GB to 3 GB and still have the app do its job just as fast as before. We still got a single full GC with a pause of roughly 2 seconds once every few hours, but this was already much better than the 20 second pauses (and crashes) we saw before.

GC tuning again

However, the story was not over yet. Looking at GC logs, we still noticed lots of premature tenuring — on the order of 70%. Even though performance was already acceptable, we tried to get rid of this effect, hoping to perhaps also prevent the full garbage collection at the same time.

Lots of premature tenuring

Premature tenuring (also known as premature promotion) happens when an object is short-lived but gets promoted to the old (tenured) generation anyway. Such objects may affect GC performance since they stuff up the old generation which is usually much larger and uses different GC algorithms than new generation. Therefore, premature promotion is something we want to avoid.

We knew our app would produce lots of short-lived objects during indexing, so some premature promotion was no surprise, but its extent was. The first thing that comes to mind when dealing with an app that creates lots of short-lived objects is to simply increase the size of young generation. By default, G1GC can adjust the size of generations automatically, allowing for between 5% and 60% of the heap to be used by the new generation. I noticed that in the live app proportions of young and old generations changed all the time over a very wide range of proportions, but still went ahead and checked what would happen if I raised both bounds: -XX:G1NewSizePercent=40 and -XX:G1MaxNewSizePercent=90. This did not help, and it actually made matters much worse, triggering full GCs almost immediately after the app started. I tried some other ratios, but the best I could arrive at was only increasing G1MaxNewSizePercent without modifying the minimum value: it worked about as well as defaults but not better.

After trying a few other options, with as little success as in my first attempt, I gave up and e-mailed Kirk Pepperdine who is a renowned expert in Java performance and whom I had the opportunity to meet at Devoxx conference and during training sessions at Allegro. After viewing GC logs and exchanging a few e-mails, Kirk suggested an experiment which was to set -XX:G1MixedGCLiveThresholdPercent=100. This setting should force G1GC mixed collections to clean all old regions regardless of how much they were filled up, and thus to also remove any objects prematurely tenured from young. This should prevent old generation from filling up and causing a full GC at any point. However, we were again surprised to get a full garbage collection run after some time. Kirk concluded that this behavior, which he had seen earlier in other applications, was a bug in G1GC: the mixed collections were apparently not able to clean up all garbage and thus allowed it to accumulate until full GC. He said he had contacted Oracle about it but they claimed this was not a bug and the behavior we observed was correct.

Conclusion

What we ended up doing was just increasing the app’s heap size a bit (from 3 to 4 GB), and full garbage collections went away. We still see a lot of premature tenuring but since performance is OK now, we don’t care so much any more. One option we could try would be switching to CMS (Concurrent Mark Sweep) GC, but since it is deprecated by now, we’d rather avoid using it if possible.

Problem fixed — GC pauses after all changes and 4 GB heap

So what is the moral of the story? First, performance issues can easily lead you astray. What at first seemed to be a ZooKeeper or network issue, turned out to be an error in our own code. Even after realizing this, the first steps I undertook were not well thought out. I started tuning garbage collection in order to avoid full GC before checking in detail what was really going on. This is a common trap, so beware: even if you have an intuition of what to do, check your facts and check them again in order to not waste time solving a problem different from the one you are actually dealing with.

Second, getting performance right is really hard. Our code had good test coverage and feature-wise worked perfectly, but failed to meet performance requirements which were not clearly defined at the beginning and which did not surface until long after deployment. Since it is usually very hard to faithfully reproduce your production environment, you will often be forced to test performance in production, regardless of how bad that sounds.

Third, fixing one issue may allow another, latent one, to surface, and force you to keep digging in for much longer than you expected. The fact that we had no backpressure was enough to break the app, but it didn’t become visible before we had fixed the memory leak.

I hope you find this funny experience of ours helpful when debugging your own performance issues!

Due to the high interest and controversy concerning this blog post, we believe that it is worth adding some context on how we work and make decisions at Allegro. Each of more than 50 development teams at Allegro has the freedom to choose technologies from those supported by our PaaS. We mainly code in Java, Kotlin, Python and Golang. The point of view presented in the article results from the author’s experience.

Kotlin is popular, Kotlin is trendy. Kotlin gives you compile-time null-safety and less boilerplate. Naturally, it’s better than Java. You should switch to Kotlin or die as a legacy coder. Hold on, or maybe you shouldn’t? Before you start writing in Kotlin, read the story of one project. The story about quirks and obstacles becoming so annoying that we decided to rewrite.

We gave Kotlin a try, but now we are rewriting to Java 10

I have my favorite set of JVM languages. Java in /main and Groovy in /test are the best-performing duo for me. In summer 2017 my team started a new microservice project, and as usual, we talked about languages and technologies. There are a few Kotlin advocating teams at Allegro, and we wanted to try something new, so we decided to give Kotlin a try. Since there is no Spock counterpart for Kotlin, we decided to stick with Groovy in /test (Spek isn’t as good as Spock). In winter 2018, after few months of working with Kotlin on a daily basis, we summarized pros and cons and arrived at the conclusion that Kotlin made us less productive. We started rewriting this microservice to Java.

Here are the reasons why.

Name shadowing

Shadowing was my biggest surprise in Kotlin. Consider this function:

funinc(num:Int){valnum=2if(num>0){valnum=3}println("num: "+num)}

What will be printed when you call inc(1)? Well, in Kotlin, method arguments are values, so you can’t change the num argument. That’s good language design because you shouldn’t change method arguments. But you can define another variable with the same name and initialize it to whatever you wish. Now you have two variables named num in the method level scope. Of course, you can access only the one num at a time, so effectively, the value of the num is changed. Checkmate.

In the if body, you can add another num, which is less shocking (new block-level scope).

Okay, so in Kotlin, inc(1) prints 2. The equivalent code in Java, won’t compile:

voidinc(intnum){intnum=2;//error: variable 'num' is already defined in the scopeif(num>0){intnum=3;//error: variable 'num' is already defined in the scope}System.out.println("num: "+num);}

Name shadowing wasn’t invented by Kotlin. It’s common in programming languages. In Java, we get used to shadowing class fields with methods arguments:

publicclassShadow{intval;publicShadow(intval){this.val=val;}}

In Kotlin, shadowing goes too far. Definitely, it’s a design flaw made by Kotlin team. IDEA team tried to fix this by showing you the laconic warning on each shadowed variable: Name shadowed. Both teams work in the same company, so maybe they can talk to each other and reach a consensus on the shadowing issue? My hint — IDEA guys are right. I can’t imagine a valid use case for shadowing a method argument.

Type inference

In Kotlin, when you declare a var or val, you usually let the compiler guess the variable type from the type of expression on the right. We call it local variable type inference, and it’s a great improvement for programmers. It allows us to simplify the code without compromising static type checking.

For example, this Kotlin code:

vara="10"

would be translated by the Kotlin compiler into:

vara:String="10"

It was the real advantage over Java. I deliberately said was, because — good news — Java 10 has it and Java 10 is available now.

Type inference in Java 10:

vara="10";

To be fair, I need to add, that Kotlin is still slightly better in this field. You can use type inference also in other contexts, for example, one-line methods.

More about Local-Variable Type Inference in Java 10.

Compile time null-safety

Null-safe types are Kotlin’s killer feature. The idea is great. In Kotlin, types are by default non-nullable. If you need a nullable type you need to add ? to it, for example:

vala:String?=null// ok
valb:String=null// compilation error

Kotlin won’t compile if you use a nullable variable without the null check, for example:

println(a.length)// compilation error
println(a?.length)// fine, prints null
println(a?.length?:0)// fine, prints 0

Once you have these this two kind of types, non-nullable T and nullable T?, you can forget about the most common exception in Java — NullPointerException. Really? Unfortunately, it’s not that simple.

Things get nasty when your Kotlin code has to get along with Java code (libraries are written in Java, so it happens pretty often I guess). Then, the third kind of type jumps in — T!. It’s called platform type, and somehow it means T or T?. Or if we want to be precise, T! means T with undefined nullability. This weird type can’t be denoted in Kotlin, it can be only inferred from Java types. T! can mislead you because it’s relaxed about nulls and disables Kotlin’s null-safety net.

Consider the following Java method:

publicclassUtils{staticStringformat(Stringtext){returntext.isEmpty()?null:text;}}

Now, you want to call format(String) from Kotlin. Which type should you use to consume the result of this Java method? Well, you have three options.

First approach. You can use String, the code looks safe but can throw NPE.

fundoSth(text:String){valf:String=Utils.format(text)// compiles but assignment can throw NPE at runtime
println("f.len : "+f.length)}

You need to fix it with Elvis:

fundoSth(text:String){valf:String=Utils.format(text)?:""// safe with Elvis
println("f.len : "+f.length)}

Second approach. You can use String?, and then you are null-safe:

fundoSth(text:String){valf:String?=Utils.format(text)// safe
println("f.len : "+f.length)// compilation error, fine
println("f.len : "+f?.length)// null-safe with ? operator
}

Third approach. What if you just let the Kotlin do the fabulous local variable type inference?

fundoSth(text:String){valf=Utils.format(text)// f type inferred as String!
println("f.len : "+f.length)// compiles but can throw NPE at runtime
}

Bad idea. This Kotlin code looks safe, compiles, but allows nulls for the unchecked journey through your code, pretty much like in Java.

There is one more trick, the !! operator. Use it to force inferring f type as String:

fundoSth(text:String){valf=Utils.format(text)!!// throws NPE when format() returns null
println("f.len : "+f.length)}

In my opinion, Kotlin’s type system with all these scala-like !, ?, and !! is too complex. Why Kotlin infers from Java T to T! and not to T?? It seems like Java interoperability spoils Kotlin’s killer feature — the type inference. Looks like you should declare types explicitly (as T?) for all Kotlin variables populated by Java methods.

Class literals

Class literals are common when using Java libraries like Log4j or Gson.

In Java, we write the class name with .class suffix:

Gsongson=newGsonBuilder().registerTypeAdapter(LocalDate.class,newLocalDateAdapter()).create();

In Groovy, class literals are simplified to the essence. You can omit the .class and it doesn’t matter if it’s a Groovy or Java class.

defgson=newGsonBuilder().registerTypeAdapter(LocalDate,newLocalDateAdapter()).create()

Kotlin distinguishes between Kotlin and Java classes and has the syntax ceremony for it:

valkotlinClass:KClass<LocalDate>=LocalDate::classvaljavaClass:Class<LocalDate>=LocalDate::class.java

So in Kotlin, you are forced to write:

valgson=GsonBuilder().registerTypeAdapter(LocalDate::class.java,LocalDateAdapter()).create()

Which is ugly.

Reversed type declaration

In the C-family of programming languages, we have the standard way of declaring types of things. Shortly, first goes a type, then goes a typed thing (variable, fields, method, and so on).

Standard notation in Java:

intinc(inti){returni+1;}

Reversed notation in Kotlin:

funinc(i:Int):Int{returni+1}

This disorder is annoying for several reasons.

First, you need to type and read this noisy colon between names and types. What is the purpose of this extra character? Why are names separated from their types? I have no idea. Sadly, it makes your work in Kotlin harder.

The second problem. When you read a method declaration, first of all, you are interested in the name and the return type, and then you scan the arguments.

In Kotlin, the method’s return type could be far at the end of the line, so you need to scroll:

privatefungetMetricValue(kafkaTemplate:KafkaTemplate<String,ByteArray>,metricName:String):Double{...}

Or, if arguments are formatted line-by-line, you need to search. How much time do you need to find the return type of this method?

@BeanfunkafkaTemplate(@Value("\${interactions.kafka.bootstrap-servers-dc1}")bootstrapServersDc1:String,@Value("\${interactions.kafka.bootstrap-servers-dc2}")bootstrapServersDc2:String,cloudMetadata:CloudMetadata,@Value("\${interactions.kafka.batch-size}")batchSize:Int,@Value("\${interactions.kafka.linger-ms}")lingerMs:Int,metricRegistry:MetricRegistry):KafkaTemplate<String,ByteArray>{valbootstrapServer=if(cloudMetadata.datacenter=="dc1"){bootstrapServersDc1}...}

The third problem with reversed notation is poor auto-completion in an IDE. In standard notation, you start from a type name, and it’s easy to find a type. Once you pick a type, an IDE gives you several suggestions about a variable name, derived from selected type. So you can quickly type variables like this:

MongoExperimentsRepositoryrepository

Typing this variable in Kotlin is harder even in IntelliJ, the greatest IDE ever. If you have many repositories, you won’t find the right pair on the auto-completion list. It means typing the full variable name by hand.

repository:MongoExperimentsRepository

Companion object

A Java programmer comes to Kotlin.

“Hi, Kotlin. I’m new here, may I use static members?” He asks.
“No. I’m object-oriented and static members aren’t object-oriented,” Kotlin replies.
“Fine, but I need the logger for MyClass, what should I do?”
“No problem, use a companion object then.”
“And what’s a companion object?”
“It’s the singleton object bounded to your class. Put your logger in the companion object,” Kotlin explains.
“I see. Is it right?”

classMyClass{companionobject{vallogger=LoggerFactory.getLogger(MyClass::class.java)}}

“Yes!“
“Quite verbose syntax,” the programmer seems puzzled, “but okay, now I can call my logger like this — MyClass.logger, just like a static member in Java?”
“Um… yes, but it’s not a static member! There are only objects here. Think of it as the anonymous inner class already instantiated as the singleton. And in fact this class isn’t anonymous, it’s named Companion, but you can omit the name. See? That’s simple.“

I appreciate the object declaration concept — singletons are useful. But removing static members from the language is impractical. In Java, we are using static loggers for years. It’s classic. It’s just a logger, so we don’t care about object-oriented purity. It works, and it never did any harm.

Sometimes, you have to use static. Old good public static void main() is still the only way to launch a Java app. Try to write this companion object spell without googling.

classAppRunner{companionobject{@JvmStaticfunmain(args:Array<String>){SpringApplication.run(AppRunner::class.java,*args)}}}

Collection literals

In Java, initializing a list requires a lot of ceremony:

importjava.util.Arrays;...List<String>strings=Arrays.asList("Saab","Volvo");

Initializing a Map is so verbose, that lot of people use Guava:

importcom.google.common.collect.ImmutableMap;...Map<String,String>string=ImmutableMap.of("firstName","John","lastName","Doe");

In Java, we are still waiting for the new syntax to express collection and map literals. The syntax, which is so natural and handy in many languages.

JavaScript:

constlist=['Saab','Volvo']constmap={'firstName':'John','lastName':'Doe'}

Python:

list=['Saab','Volvo']map={'firstName':'John','lastName':'Doe'}

Groovy:

deflist=['Saab','Volvo']defmap=['firstName':'John','lastName':'Doe']

Simply, the neat syntax for collection literals is what you expect from a modern programming language, especially if it’s created from scratch. Instead of collection literals, Kotlin offers the bunch of built-in functions: listOf(), mutableListOf(), mapOf(), hashMapOf(), and so on.

Kotlin:

vallist=listOf("Saab","Volvo")valmap=mapOf("firstName"to"John","lastName"to"Doe")

In maps, keys and values are paired with the to operator, which is good, but why not use well-known : for that? Disappointing.

Maybe? Nope

Functional languages (like Haskell) don’t have nulls. Instead, they offer the Maybe monad (if you are not familiar with monads, read this article by Tomasz Nurkiewicz).

Maybe was introduced to the JVM world the long time ago by Scala as Option, and then, became adopted in Java 8 as Optional. Now, Optional are quite popular way of dealing with nulls in return types at API boundaries.

There is no Optional equivalent in Kotlin. It seems that you should use bare Kotlin’s nullable types. Let’s investigate this issue.

Typically, when you have an Optional, you want to apply a series of null-safe transformations and deal with null at the and.

For example, in Java:

publicintparseAndInc(Stringnumber){returnOptional.ofNullable(number).map(Integer::parseInt).map(it->it+1).orElse(0);}

No problem one might say, in Kotlin, for mapping you can use the let function:

funparseAndInc(number:String?):Int{returnnumber.let{Integer.parseInt(it)}.let{it->it+1}?:0}

Can you? Yes, but it’s not that simple. The above code is wrong and throws NPE from parseInt(). The monadic-style map() is executed only if the value is present. Otherwise, null is just passed by. That’s why map() is so handy. Unfortunately, Kotlin’s let doesn’t work that way. It’s just called on everything from the left, including nulls.

So in order make this code null-safe, you have to add ? before each let:

funparseAndInc(number:String?):Int{returnnumber?.let{Integer.parseInt(it)}?.let{it->it+1}?:0}

Now, compare readability of the Java and Kotlin versions. Which one do you prefer?

Read more about Optionals at Stephen Colebourne’s blog.

Data classes

Data classes are Kotlin’s way to reduce the boilerplate that is inevitable in Java when implementing Value Objects (aka DTO).

For example, in Kotlin, you write only the essence of a Value Object:

data classUser(valname:String,valage:Int)

and Kotlin generates good implementations of equals(), hashCode(), toString(), and copy().

It’s really useful when implementing simple DTOs. But remember, Data classes come with the serious limitation — they are final. You cannot extend a Data class or make it abstract. So probably, you won’t use them in a core domain model.

This limitation is not Kotlin’s fault. There is no way to generate the correct value-based equals() without violating the Liskov Principle. That’s why Kotlin doesn’t allow inheritance for Data classes.

Open classes

In Kotlin, classes are final by default. If you want to extend a class, you have to add the open modifier to it.

Inheritance syntax looks like this:

openclassBaseclassDerived:Base()

Kotlin changed the extends keyword into the : operator, which is already used to separate variable name from its type. Back to C++ syntax? For me it’s confusing.

What is controversial here is making classes final by default. Maybe Java programmers overuse inheritance. Maybe you should think twice before allowing for extending your class. But we live in the frameworks world, and frameworks love AOP. Spring uses libraries (cglib, jassist) to generate dynamic proxies for your beans. Hibernate extends you entities to enable lazy loading.

If you are using Spring, you have two options. You can put open in front of all bean classes (which is rather boring), or use this tricky compiler plugin:

buildscript{dependencies{classpathgroup:'org.jetbrains.kotlin',name:'kotlin-allopen',version:"$versions.kotlin"}}

Steep learning curve

If you think that you can learn Kotlin quickly because you already know Java — you are wrong. Kotlin would throw you in the deep end. In fact, Kotlin’s syntax is far closer to Scala. It’s the all-in bet. You would have to forget Java and switch to the completely different language.

On the contrary, learning Groovy is a pleasant journey. Groovy would lead you by the hand. Java code is correct Groovy code, so you can start by changing the file extension from .java to .groovy. Each time when you learn a new Groovy feature, you can decide. Do you like it or do you prefer to stay with the Java way? That’s awesome.

Final thoughts

Learning a new technology is like an investment. We invest our time and then the technology should pay off. I’m not saying that Kotlin is a bad language. I’m just saying that in our case, the costs outweighed the benefits.

Funny facts about Kotlin

In Poland, Kotlin is one of the best selling brands of ketchup. This name clash is nobody’s fault, but it’s funny. Kotlin sounds to our ears like Heinz.

Kotlin ketckup

The following article has two parts. The first part describes improving Allegro iOS app launch time by adopting static linking and sums it up with a speedup analysis. The second part describes how I managed to launch a custom macOS app using not-yet-fully-released dyld3 dynamic linker and also completes with an app launch speedup analysis.

Improving iOS app launch time

It takes some time to launch a mobile app, especially on a system with limited power of mobile CPU. Apple suggests 400ms as a good launch time. iOS performs zoom animation during the app launch – thus creating an opportunity to perform all CPU-intensive tasks. Ideally the whole launch process on iOS should be completed as soon as the app opening animation ends.

Apple engineers described some techniques to improve launch times in WWDC 2016 - Session 406: Optimizing App Startup Time. This wasn’t enough, so the very next year they announced a brand new dynamic linker in WWDC 2017 - Session 413: App Startup Time: Past, Present, and Future. Looking at the history of dyld, one can see that Apple is constantly trying to make their operating systems faster.

At Allegro we also try to make our apps as fast as possible. Aside from using Swift (Swift performs much better than ObjC in terms of launch time and app speed), we build our iOS apps using static linking.

Static linking

Allegro iOS app uses a lot of libraries. The app has modular architecture and each module is a separate library. Aside from that, Allegro app uses a lot of 3rd-party libraries, integrated using CocoaPods package manager. All these libraries used to be integrated as frameworks– a standard way of dylibs (dynamic libraries) distribution in Apple ecosystem. 57 nested frameworks is a number large enough to impact app launch time. iOS has a 20 seconds app launch time limit. Any app that hits that limit is instantly killed. Allegro app was often killed on a good old iPad 2, when the device was freshly started and all caches were empty.

Dynamic linker performs a lot of disk IO when searching for dependencies. Static linking eliminates the need for all that dylib searching – dependencies and executable become one. We decided to give it a try and to link at least some of our libraries statically into main executable, hence reducing frameworks count.

We wanted to do this gradually, framework by framework. We also wanted to have a possibility to turn the static linking off in case of any unexpected problem.

We decided to use a two-step approach:

compiling frameworks code to static libraries,
converting frameworks (dynamic library packages) to resource bundles (resources packages).

Compiling framework code as a static library

Xcode 9 provides MACH_O_TYPE = staticlib build setting – linker produces static library when the flag is set. As for libraries integrated through CocoaPods, we had to create a custom script in Podfile to set this flag only for selected external libraries during pod install (that is during dependencies installation, because CocoaPods creates new project structures for managed libraries with each reinstallation).

MACH_O_TYPE does a great job, but we performed static linking even before Xcode 9 was released. Although Xcode 8 had no support for static Swift linking, there is a way to perform static linking using libtool. In those dark times, we were just adding custom build phases with buildstatic script for selected libraries. This may seem like a hack, but it is really just a hefty usage of well-documented toolset… and it worked flawlessly.

That way we replaced our dynamic libraries with static libraries, but that was the easier part of the job.

Converting framework to resource bundle

Aside from dynamic libraries, a framework can also contain resources (images, nibs, etc.). We got rid of dynamic libraries, but we couldn’t leave resource-only-frameworks. Resource bundle is a standard way of wrapping resources in Apple ecosystem, so we created framework_to_bundle.sh script, which takes *.framework and outputs *.bundle with all the resources.

The resource-handling code was redesigned to automatically use the right resource location. Allegro iOS app has a Bundle.resourcesBundle(forModuleName:) method, which always finds the right bundle, no matter what linking type was used.

Results

Last time the Allegro iOS app launch time was measured, it still had 31 dynamic libraries – so merely 45% libraries were linked statically and results were already very promising. Our job with static linking revolution is not complete yet, the target is 100%.

We measured launch time on different devices for two app versions: one with all libraries dynamically linked and the other one with 26 libraries statically linked. What measurement method did we use? A stopwatch… yes, real stopwatch. DYLD_PRINT_STATISTICS=1 variable is a tool that can help identify the reason of a dynamic linker being slow, but it does not measure the whole launch time. We used a stopwatch and slow motion camera, to measure the time between an app icon tap and the app home screen being fully visible.

Each measurement in the following table is an average of 6 samples.

	iPhone 4s	iPad 2	iPhone 5c	iPhone 5s	iPhone 7+	iPad 2 cold launch
57 dylibs app launch time [s]	7.79	7.33	7.30	3.14	2.31	11.75
31 dylibs app launch time [s]	6.62	6.08	5.39	2.75	1.75	7.27
Launch speedup [%]	15.02	17.05	26.16	12.42	24.24	38.13

Allegro iOS app launch time decreased by about 2 seconds on iPhone 5c – this was a significant gain. The app launch time improved even more on freshly turned on iPad 2 – the difference was about 4.5 seconds, which was about 38% of the launch time with all libraries being dynamically linked.

Static linking pitfall

Having some statically linked library, beware of linking it with more than one dynamic library – this will result in static library objects being duplicated across different dynamic libraries and that could be a serious problem. We have created a check_duplicated_classes.sh script to be run as a final build phase.

That was the only major obstacle we’ve come across.

Dyld3

Dyld3, the brand new dynamic linker, was announced about a year ago at WWDC 2017. At the time of writing this article, we are getting close to WWDC 2018 and dyld3 is still not available for 3rd party apps. Currently only system apps use dyld3. I couldn’t wait any longer, I was too curious about its real power. I decided to try launching my own app using dyld3.

Looking for dyld3

I wondered: What makes system apps so special that they are launched with dyld3?

First guess: LC_LOAD_DYLINKER load command points to dyld3 executable…

$ otool -l /Applications/Calculator.app/Contents/MacOS/Calculator | grep"cmd LC_LOAD_DYLINKER"-A 2
          cmd LC_LOAD_DYLINKER
      cmdsize 32
         name /usr/lib/dyld (offset 12)

That was a bad guess. Looking through the rest of load commands and all the app sections revealed nothing particular. Do system applications use dyld3 at all? Let’s try checking that using lldb debugger:

$ lldb /Applications/Calculator.app/Contents/MacOS/Calculator
(lldb) rbreak dyld3
Breakpoint 1: 887 locations.
(lldb) r
Process 92309 launched: '/Applications/Calculator.app/Contents/MacOS/Calculator'(x86_64)
Process 92309 stopped
* thread #1, stop reason = breakpoint 1.154
    frame #0: 0x00007fff72bf6296 libdyld.dylib`dyld3::AllImages::applyInterposingToDyldCache(dyld3::launch_cache::binary_format::Closure const*, dyld3::launch_cache::DynArray<dyld3::loader::ImageInfo> const&)
libdyld.dylib`dyld3::AllImages::applyInterposingToDyldCache:
->  0x7fff72bf6296 <+0>: pushq  %rbp
    0x7fff72bf6297 <+1>: movq   %rsp, %rbp
    0x7fff72bf629a <+4>: pushq  %r15
    0x7fff72bf629c <+6>: pushq  %r14
Target 0: (Calculator) stopped.

lldb hit some dyld3-symbol during system app launch and did not during any custom app launch. Inspecting the backtrace and the assembly showed that /usr/lib/dyld contained both the old dyld2 and the brand new dyld3. There had to be some if that decided which dyldX should be used.

Reading assembly code is often a really hard process. Fortunately I remembered that some parts of apple code are open sourced, including dyld. My local binary had LC_SOURCE_VERSION = 551.3 and the most recent dyld source available was 519.2.2. Are those versions distant? I spent a few nights looking at local dyld assembly and corresponding dyld sources and didn’t see any significant difference. In fact I had a strange feeling that the source code exactly matched the assembly – it was a perfect guide for debugging.

What did I end up with? Hidden dyld3 can be activated on macOS High Sierra using one of the following two approaches:

setting dyld`sEnableClosures:
- dyld`sEnableClosures needs to be set by e.g. using lldb memory write (unfortunately undocumented DYLD_USE_CLOSURES=1 variable only works on Apple internal systems),
- /usr/libexec/closured needs be compiled from dyld sources (it needs a few modifications to compile),
- read invocation in callClosureDaemon needs to be fixed (I filed a bug report for this issue); for the sake of tests I fixed it with lldb breakpoint command and a custom lldb script that invoked read in a loop until it returned 0, or
dyld closure needs to be generated and saved to the dyld cache… but… what is a dyld closure?

Dyld closure

Louis Gerbarg mentioned the concept of dyld closure at WWDC 2017. Dyld closure contains all the informations needed to launch an app. Dyld closures can be cached, so dyld can save a lot of time just restoring them.

Dyld sources contain dyld_closure_util– a tool that can be used to create and dump dyld closures. It looks like Apple open source can rarely be compiled on a non-Apple-internal system, because it has a lot of Apple private dependencies (e.g. Bom/Bom.h and more…). I was lucky – dyld_closure_util could be compiled with just a couple of simple modifications.

I created a macOS app just to check dyld3 in action. The TestMacApp.app contained 20 frameworks, 1000 ObjC classes and about 1000~10000 methods each. I tried to create a dyld closure for the app, its JSON representation (36.5 MB) was pretty long - almost milion lines:

$ dyld_closure_util -create_closure ~/tmp/TestMacApp.app/Contents/MacOS/TestMacApp | wc -l
  832363

The basic JSON representation of a dyld closure looks as follows:

{"dyld-cache-uuid":"9B095CC4-22F1-3F88-8821-8DFD979AB7AD","images":[{"path":"/Users/kamil.borzym/tmp/TestMacApp.app/Contents/MacOS/TestMacApp","uuid":"D5BDC1D3-D09E-36D5-96E9-E7FFA7EE955E""file-inode":"0x201D8F8BC",//usedtocheckifdyldclosureisstillvalid"file-mod-time":"0x5B032E9A",//usedtocheckifdyldclosureisstillvalid"dependents":[{"path":"/Users/kamil.borzym/tmp/TestMacApp.app/Contents/Frameworks/Frm1.framework/Versions/A/Frm1"},{"path":"/Users/kamil.borzym/tmp/TestMacApp.app/Contents/Frameworks/Frm2.framework/Versions/A/Frm2"},/*...*/],/*...*/},{"path":"/Users/kamil.borzym/tmp/TestMacApp.app/Contents/Frameworks/Frm1.framework/Versions/A/Frm1","dependents":[/*...*/]},/*...*/],/*...*/}

Dyld closure contains a fully resolved dylib dependency tree. That means: no more expensive dylib searching.

Dyld3 closure cache

In order to measure dyld3 launch speed gain, I had to use the dyld3 activation method #2 – providing a valid app dyld closure. Although setting dyld`sEnableClosures creates a dyld closure during app launch, the closure is currently not being cached.

Dyld sources contain an update_dyld_shared_cache tool source code. Unfortunately this tool uses some Apple-private libraries, I was not able to compile it on my system. By pure accident I found that this tool is available in every macOS High Sierra in /usr/bin/update_dyld_shared_cache. Also the man update_dyld_shared_cache was present – this made the cache rebuild even simpler.

update_dyld_shared_cache sources showed that it generates dyld closures cache only for a set of predefined system apps. I could modify the tool binary to take TestMacApp.app into account, but I ended up renaming the test app to Calculator.app and moving it to /Applications– simple, but effective.

I updated the dyld closure cache:

sudo update_dyld_shared_cache -force

and restarted my system (as stated by man update_dyld_shared_cache). After that, my test app launched using dyld3! I verified that with lldb. Also setting DYLD_PRINT_WARNINGS=1 variable showed that the dyld closure was not generated, but taken from the dyld cache:

dyld: found closure 0x7fffef8f278c in dyld shared cache

Dyld3 performance

As I wrote earlier, the test app contained 20 frameworks, each framework having 1000 ObjC classes and 1000~10000 methods. I also created a simple dependency network between those frameworks: main app depended on all frameworks, 1st framework depended on 19 frameworks, 2nd framework depended on 18 frameworks, 3rd framework depended on 17 frameworks, and so on… After launching, the app just invoked exit(0). I used time to measure the time between invoking the launch command and app exit. I didn’t use DYLD_PRINT_STATISTICS=1, because, aside from the reasons presented above, dyld3 does not even support this variable yet.

Test platform was MacBook Pro Retina, 13-inch, Early 2015 (3,1 GHz Intel Core i7) with macOS High Sierra 10.13.4 (17E202). Unfortunately I didn’t have access to any significantly slower machine. Each measurement in the following tables is an average of 6 samples. Two types of launches were measured:

warm launch – without system restart,
cold launch – system restart between each measured time sample.

Statically linked app always launched very fast, but I could not see any significant difference between dyld2 and dyld3 loading time.

launch type	dyld2	dyld3	static
warm	0.737s	0.726s	0.676s
cold	1.166s	1.094s	0.871s

I tried measuring app launch from some slower drive configuration – an old USB drive (having terribly low sequential read speed of 17.1 MB/s). Disk IO was supposed to be a bottleneck of dyld2 loading. I faked /Application/Calculator.app path using ln -s /Volumes/USB/Calculator.app and regenerated dyld cache.

Next measurements looked much better. No difference at warm launch, but cold launch was 20% faster with dyld3 than with dyld2. Actually dyld3 cold launch was right in the middle, between dyld2 launch time and statically linked app launch time.

launch type	dyld2	dyld3	static
warm	0.722s	0.731s	0.679s
cold	3.687s	2.947s	2.276s

dyld3 status

Mind that dyld3 in still under development, it has not been released for 3rd party apps yet. I guess it is currently available for system apps not to increase their speed, but mainly to test dyld3 stability.

Louis Gerbarg said that dyld3 had its daemon. On macOS High Sierra there is no dyld3 daemon. closured is currently invoked by dyld3 as a command line tool with fork+execve. It does not even cache created dyld closures. For sure we will see a lot of changes in the near future.

Are you curious about my opinion? I think a fully working dyld3 with closured daemon will be shipped with the next major macOS version. I think this new dyld3 version will implement even faster in-memory closure cache. Everyone will feel a drastic app launch time improvement on all Apple platforms – launch time much closer to statically linked app launching than to the current dyld2 launching. I keep my fingers crossed.

In this post I present a story of a bug that hit us recently. Everything was caused by unexpected (although documented) behavior of Go built-in function append. This bug has lived silently for nearly a year in allegro/marathon-consul. Ensure you run the latest version.

The missing service

Dude, where is my service

At Allegro we build our infrastructure on the top of Mesos and Marathon. For discovery service we use Consul. Services registration is done by allegro/marathon-consul, a simple tool written in Go to register services started by Marathon in Consul.

One day we deployed a service. It was neither a new service nor a big release. It was just a regular deployment. But there was a problem. The service had two ports, each registered in Consul. After deployment, both ports had the same tags although they were configured differently. This might not sound like a serious issue but it was. The service was unavailable because its clients couldn’t find it due to invalid tags on a port responsible for handling clients’ requests.

Marathon-Consul hasn’t been touched for some time, so it was very unlikely that it was responsible for malformed registration. Application configuration in Marathon looked good. There were some global service tags on application level and additional tags on each port. Why Marathon-Consul messed this up?

We checked what had changed in the new deployment and the only differences were the service version and an additional service tag that was added. Why adding a new tag results in such a weird behavior? We deleted this tag and the service has registered correctly. We added it back and tags were filled wrong. We added a test to reproduce it and contributed a fix.

The bug

The bug lied in the following code:

commonTags:=labelsToTags(app.Labels)varintents[]RegistrationIntentfor_,d:=rangedefinitions{intents=append(intents,RegistrationIntent{Name:app.labelsToName(d.Labels,nameSeparator),Port:task.Ports[d.Index],Tags:append(commonTags,labelsToTags(d.Labels)...),// ◀ Wrong tags here})}funclabelsToTags(labelsmap[string]string)[]string{tags:=[]string{}forkey,value:=rangelabels{ifvalue=="tag"{tags=append(tags,key)}}returntags}

The bug is not easy to hit and probably that’s why it wasn’t covered in tests and nobody reported it before. To reproduce it, an application must have at least two ports with different tags on each. When commonTags size is power of two it worked but in other case — it didn’t. It’s a rare case a service has multiple ports (80% of our applications have only one port) and even rarer when ports have additional tags (8% of our ports have tags) and only one has multiple tagged ports.

The bug can be distilled to the example below. Let’s unroll the loop to just two iterations and use ints instead of structures. Then rename commonTags to x and fill it with some values. Finally, use y and z instead of intents[0] and intents[1]. What’s the output of the following code?

packagemainimport("fmt")funca(){x:=[]int{}x=append(x,0)x=append(x,1)// commonTags := labelsToTags(app.Labels)y:=append(x,2)// Tags: append(commonTags, labelsToTags(d.Labels)...)z:=append(x,3)// Tags: append(commonTags, labelsToTags(d.Labels)...)fmt.Println(y,z)}funcb(){x:=[]int{}x=append(x,0)x=append(x,1)x=append(x,2)// commonTags := labelsToTags(app.Labels)y:=append(x,3)// Tags: append(commonTags, labelsToTags(d.Labels)...)z:=append(x,4)// Tags: append(commonTags, labelsToTags(d.Labels)...)fmt.Println(y,z)}funcmain(){a()b()}

First guess could be

[0, 1, 2] [0, 1, 3]
[0, 1, 2, 3] [0, 1, 2, 4]

but in fact it results in

[0, 1, 2] [0, 1, 3]
[0, 1, 2, 4] [0, 1, 2, 4]

Function a() works as expected but behavior of b() is not what we were expecting.

Slices

To understand this not obvious behavior we need some background on how slices works and what happens when we call append.

Slice is a triple of pointer to first element, length and capacity (length ≤ capacity). Memory is a continuous block of data but slice uses only length of capacity.

According to documentation of append:

The append built-in function appends elements to the end of a slice. If it has sufficient capacity, the destination is resliced to accommodate the new elements. If it does not, a new underlying array will be allocated. Append returns the updated slice. It is therefore necessary to store the result of append, often in the variable holding the slice itself:

append allocates a new slice if new elements do not fit into the current slice, but when they fit they will be added at the end. append always returns a new slice but (as the slice is a triple of address, length and capacity) the new slice could have the same address and capacity and differs only on the length.

How slices grow?

One does not simply append to a slice

Above paragraph doesn’t answer why code works like this. To understand it, we need to go deeper in Go code. Let’s take a look at growslice function of Go runtime. It’s called by append when a slice doesn’t have enough capacity for all appended elements.

// growslice handles slice growth during append.// It is passed the slice element type, the old slice, and the desired new minimum capacity,// and it returns a new slice with at least that capacity, with the old data// copied into it.// The new slice's length is set to the old slice's length,// NOT to the new requested capacity.// This is for codegen convenience. The old slice's length is used immediately// to calculate where to write new values during an append.// TODO: When the old backend is gone, reconsider this decision.// The SSA backend might prefer the new length or to return only ptr/cap and save stack space.funcgrowslice(et*_type,oldslice,capint)slice

when slice needs to grow, it doubles its size. In fact there is more logic to handle growing heuristics, but in our case it grows just like this.

Connect the dots

Let’s go through b() step by step.

x:=[]int{}x=append(x,0)x=append(x,1)

Create a slice with 2 elements.

x=append(x,2)

Append one element. x is too small so it needs to grow. It doubles its capacity.

y:=append(x,3)

Append one element. Slice has free space at the end so 3 is stored there.

z:=append(x,4)

Append one element. Slice has free space at the end so 4 is stored there and overwrites 3 stored before.

All 3 slices: x, y and z point to the same memory block.

Why it’s working in a()? The Answer is really simple. There is a slice of capacity two and when we append one element it’s copied to a new space. That’s why we end up with x, y and z pointing to different memory blocks.

TL;DR

Be careful when using append. Don’t append to slices you want to keep unchanged. If you want to work on a copy of a slice data, you must explicitly copy it into a new slice.

What if I told you

Some of our engineers run their own tech blogs. We encourage them to move here, but for various reasons, they prefer to publish on private blogs. We respect their decisions. What we can do is to gather all the blog posts published by allegro.tech engineers around the web in the one place.

We prepared the digest with all these posts published this Fall.

Generating Test Data – jFairy

Nov 20 2017 · Łukasz Rosłonek

Test data has been always an issue. If you are running selenium or backend automated tests based on user-related scenarios, in order to make your tests more efficient you need to provide unique and realistic user test data. There’re many ways to deal with this: dumping samples of production databases, writing your own data generators…

InfluxDB in IoT world: Hosting and scaling on AWS (Part 2)

Nov 14 2017 · Ivan Vaskevych

In the previous part we took a bird’s-eye view of InfluxDB, it’s core features and some of the reasons to embrace the database in the wake of IoT data onslaught. In this part, we’re going to see how easy it is to install and start using InfluxDB on AWS…

Kafka Streams DSL vs Processor API

Nov 02 2017 · Marcin Kuthan

Kafka Streams is a Java library for building real-time, highly scalable, fault tolerant, distributed applications. The library is fully integrated with Kafka and leverages Kafka producer and consumer semantics (e.g: partitioning, rebalancing, data retention and compaction). What is really unique, the only dependency to run Kafka Streams application is a running Kafka cluster. Even local state…

Gathering valuable feedback for your growth

Oct 15 2017 · Marcin Konkel

Since few months, together with Viktor, we were exchanging our thoughts on gathering valuable feedback and maintaining our growth in various circumstances. We’ve found that feedback continues to be a problem. We don’t seem to give enough of it and what we give often holds low quality…

Swift 4 Cookbook - Linux & MacOS, part 1

Oct 11 2017 · Marcin Kliks

Swift is powerful, fast and safe programming language. From version 2 it’s open source, and officialy supported on Linux. Its native performance, type safety, interactive (REPL) playground and scripting posibility, is a joy to use. The main goal for Swift on Linux is to be compatible with Swift on macOS/ios/watchos, and there is an official Apple documentation for Foundation. Howerever, this…

Byte my code – new conference

Sep 30 2017 · Jarek Pałka

#Wrocław #Java Hey guys, just letting you know that ByteMyCode – a new Java event is taking place in Wroclaw. If these topics are interesting for you: Event Processing in Action Cloud Native Java Guide to Instantaneous Feedback Loops for Java Developers Bringing Structure to Performance Tuning UBS Innovation – DLT Or you’d like to…

Three levels of TDD

Sep 24 2017 · Michael Lewandowski

IntroductionI’ve been using TDD technique for a few years. Most of the time with satisfactory a result. But it wasn’t an easy journey; it was a trip full of ups and downs. During this period my thinking about TDD has changed dramatically, or maybe I have changed my perception of testing and software development during this time? Indeed, yes I have.Lasse Koskela in his book called “Test Driven:…

Idiomatic concurrency: flatMap() vs. parallel() - RxJava FAQ

Sep 14 2017 · Tomasz Nurkiewicz

Simple, effective and safe concurrency was one of the design principles of RxJava. Yet, ironically, it’s probably one of the most misunderstood aspects of this library. Let’s take a simple example: imagine we have a bunch of UUIDs and for each one of them we must perform a set of tasks. The first problem is to perform I/O intensive operation per each UUID, for example loading…

Apache Mesos is an open-source project to manage computer clusters. In this article we present one of its components called Executor and more specifically the Custom Executor. We also tell you why you should consider writing your own executor by giving examples of features that you can benefit from by taking more control over how tasks are executed. Our executor implementation is available at github.com/allegro/mesos-executor

TL;DR Apache Mesos is a great tool but in order to build a truly cloud native platform for microservices you need to get your hands dirty and write custom integrations with your ecosystem.

If you are familiar with Apache Mesos, skip introduction

Apache Mesos

“Sixty-four cores or 128 cores on a single chip look a lot like 64 machines or 128 machines in a data center.” – Ben Hindman

Apache Mesos is a tool to abstract your data center resources. In contrast to many container orchestrators such as Kubernetes, Nomad or Swarm, Mesos makes use of two-level scheduling. This means Mesos is responsible only for offering resources to higher level applications called frameworks or schedulers. It’s up to a framework if it accepts an offer or declines it.

Mesos development originates from Berkeley University. The idea was not totally new. Google already had a similar system called Borg for some time. Rapid development of Mesos started when one of its creators – Ben Hindman gave a talk at Twitter presenting Mesos – then development got a boost. Twitter employed Ben and decided to use Mesos as a resource management system and to build a dedicated framework – Aurora. Both projects were donated to Apache foundation and became top-level projects.

Currently Mesos is used by many companies including Allegro, Apple, Twitter, Alibaba and others. At Allegro we started using Mesos and Marathon in 2015. At that time there were no mature and battle-proven competitive solutions.

Mesos Architecture

A Mesos Cluster is built out of three main elements: Masters, Agents and Frameworks.

Mesos Architecture

Masters

A Mesos Master is the brain of the whole datacenter. It controls the state of a whole cluster. It receives resource information from agents and presents those resources to a framework as an offer. The Master is a communication hub between a framework and its tasks. At any given time only one Master is acting as a leader, other instances are in a standby mode. Zookeeper is used to elect this leader and notifies all instances when the leader abdicates.

Agents

Agents are responsible for running tasks, monitoring their state and present resources to the master. They also receive launch and kill task requests. Agents notify the Master when a task changes its state (for example fails).

Frameworks

Frameworks (aka schedulers) are not a part of Mesos. They are custom implementation of business logic. The best known frameworks are container orchestrators such as Google Kubernetes, Mesosphere Marathon and Apache Aurora. Frameworks receive resource offers from the Master and based on custom scheduling logic decide whether to launch a task with given resources. There could be multiple frameworks running at once. Each of them can have a different scheduling policy and a different purpose. This gives Mesos the ability to maximize resource usage by sharing same resources between different workloads. For example: batch jobs (e.g. Hadoop, Spark), stateless services, continuous integration (e.g., Jenkins), databases and caches (e.g., Arango, Cassandra, Redis).

Beside those big elements, there are also smaller parts of the whole architecture. One of them are executors.

Executor

An executor is a process that is launched on agent nodes to run the framework’s tasks. Executor is started by a Mesos Agent and subscribes to it to get tasks to launch. A single executor can launch multiple tasks. Communication between executors and agents is via Mesos HTTP API. Executor notifies an Agent about task state by sending task status information. Then, in turn, the Agent notifies the framework about the task state change. The Executor can perform health checks and any other operations required by a framework to run a task.

There are four types of executors:

Command Executor – Speaks V0 API and is capable of running only a single task.
Docker Executor – Similar to command executor but launches a docker container instead of a command.
Default Executor – Introduced in Mesos 1.0 release. Similar to command executor but speaks V1 API and is capable of running pods (aka task groups).
Custom Executor – Above executors are built into Mesos. A custom executor is written by a user to handle custom workloads. It can use V0 or V1 API and can run single or multiple tasks depending on implementation. In this article we are focusing on our custom implementation and what we achieved with it.

Why do we need a custom executor?

At Allegro we are using Mesos and Marathon as a platform to deploy our microservices. Currently we have nearly 600 services running on Mesos. For service discovery we are using Consul. Some services are exposed to the public and some are hidden by load balancers (F5, HAProxy, Nginx) or cache (Varnish).

At Allegro we try to follow the 12 Factor App Manifesto. This document defines how a cloud native application should behave and interact with environments they run on in order to be deployable in a cloud environment. Below we present how we achieve 3 out of 12 factors with the custom executor.

Allegro Mesos Executor

III. Config – Store config in the environment

The twelve-factor app stores config in environment variables (often shortened to env vars or env). Env vars are easy to change between deploys without changing any code; unlike config files, there is little chance of them being checked into the code repo accidentally; and unlike custom config files, or other config mechanisms such as Java System Properties, they are a language- and OS-agnostic standard.

With a custom executor we are able to place app’s configuration in its environment. Executor is not required in this process and could be replaced with Mesos Modules but it’s easier for us to maintain our Executor than a module (a C++ shared library). Configuration is kept in our Configuration Service backed by Vault. When the executor starts, it connects to configuration service and downloads configuration for a specified application. Config Service stores encrypted configurations. Authentication, by necessity, is performed with a certificate generated by a dedicated mesos module that was written years ago and convinced us we do not want to keep any logic there. The certificate is signed by an internal authority. Pictures below presents how communication looks in our installation. Mesos Agent obtains signed certificate from CA (certificate authority) and passes it to the executor in an environment variable. Previously, every application contained dedicated logic for reading this certificate and downloading its configuration from Configuration Service. We replaced this logic with our executor that, by using the certificate to authenticate, is able to download the decrypted configuration and pass it in environment variables to the task that is launched.

Config

IX. Disposability – Maximize robustness with fast startup and graceful shutdown

Processes shut down gracefully when they receive a SIGTERM signal from the process manager.

Although the Mesos Command Executor supports graceful shutdown in configuration it does not work properly with shell commands (see MESOS-6933).

Lifecycle

Above diagram presents the life cycle of a typical task. At the beginning its binaries are fetched and the executor is started (1). After starting, the executor can run some hooks (for example to load configuration from configuration service) then it starts the task and immediately starts health checking it (2). Our applications are plugged into discovery service (Consul), load balancers (F5, Nginx, HAProxy) and caches (Varnish) when they start answering the health check (3). When the instance is killed, it is first unregistered from all services (4) then SIGTERM is sent and finally (if the instance is still running) it receives a SIGKILL (5). This approach gives us nearly no downtime at deployment and could not be achieved without a custom executor. Below you can see a comparison of a sample application launched and restarted with and without our executor’s graceful shutdown feature. We deployed this application with our executor at 15:33 (first peak) and restarted it 3 times (there are some errors but fewer then before). Then we rolled back to the default command executor and restarted it a couple of times (more errors). There are errors due to missing cache warmup at start, but we see a huge reduction of errors during deployments.

Opbox

What’s more, with this approach we can notify the user that an external service errored using Task State Reason, so details are visible for the end user. Accidentally by implementing custom health checks and notifications we avoided MESOS-6790.

Previous solution (called Marathon-Consul) was based on Marathon Events. Marathon-Consul was consuming events about task starting and killing which register or deregister the instance from Consul. This caused a delay, instance was deregistered when it was already gone. With Executor we can guarantee that application will not get any traffic before we shut it down.

XI. Logs – Treat logs as event streams

A twelve-factor app never concerns itself with routing or storage of its output stream. It should not attempt to write to or manage logfiles. Instead, each running process writes its event stream, unbuffered, to stdout. During local development, the developer will view this stream in the foreground of their terminal to observe the app’s behavior.

Command executor redirects standard streams to stdout/stderr files. In our case we want to push logs into ELK stack. In Task definition we have all metadata about task (host, name, id, etc.) so we can enhance the log line and push it further. Logs are generated in key-value format that is easy to parse and read by both human and machine. This allow us to easily index fields and to reduce the cost of parsing JSONs by the application. Whole integration is done by the executor.

What could be improved

Mesos provides an old-fashioned way to extend its functionality – we have to write our own shared library in C++ (see Mesos Modules for more information). This solution has its advantages, but it also significantly increases the time needed for development and enforces the use of technology that is not currently used at our company. Additionally, errors in our code could propagate to the Mesos agent, possibly causing it to crash. We do not want to go back to the times of segmentation fault errors causing service failures. A more modern solution based on Google Protobuf (already used by Mesos), would be appreciated. Finally upgrading Mesos often required us to recompile all modules, thus maintaining different binaries for different Mesos versions

.so

A lot of solutions for Mesos maintain their own executors because other integration methods are not as flexible, for example:

We also made a mistake by putting our service integration logic directly into the executor binary. As a result, the integration code became tightly coupled with Mesos specific code, making it significantly more difficult to use it in other places – e.g. Kubernetes. We are in the process of separating this code into stand-alone binaries.

.so

Comparison with other solutions (read K8s)

Kubernetes provides a more convenient way to integrate your services with it. Container Lifecycle Hooks allow you to define an executable or HTTP endpoint which will be called during lifecycle events of the container. This approach allows the use of any technology to create integration with other services and reduces the risk of tight coupling the code with a specific platform. In addition, we have a built-in graceful shutdown mechanism integrated with the hooks, which ultimately eliminates the need to have the equivalent of our custom Mesos executor on this platform.

Conclusion

Mesos is totally customizable and able to handle all edge cases but sometimes it’s really expensive in terms of time, maintainability and stability. Sometimes it’s better to use solutions that work out of the box instead of reinventing the wheel.

First of all, what testing skills are necessary to deliver a high-quality product?

Waterfall or Agile. Always use as designed?

Waterfall or Agile

Automated vs manual testing – only one or both?

Automated tests

Faster and more effective
This is indeed true, but only with a stable test setup and well-designed test scripts. The initial work that involves setting up the whole environment contributes to the effectiveness of tests and their further use. If you fail at this stage, testers may spend more time solving test setup problems than on testing. Naturally, when the environment is stable, automated regression testing is faster than the manual one, and may be even run for each new build on a daily basis.
Cost effective
Test environment setup and test design are cost- and time-consuming. However, if it is done properly, automated testing is indeed cheaper and faster than the manual approach. Actually, it is easier to write automated tests than to deal with poorly designed setup.
Less tiresome
If regression testing is run on a regular basis, testers carrying out manual tests may become somewhat frustrated and bored of doing the same things again and again, which may affect their effectiveness and concentration. For this reason, testers are often more interested in developing automated tests for the regression testing purpose than to manually executing the same set of test cases every time.
You can run them on a regular basis
It is the main advantage of automated tests. As you can use them to test builds on a daily basis, the development team receives feedback almost immediately. However, there is a risk that the tests may become blind over time - test scenarios, if not updated, verify the same paths as at the first run. It may happen that a small change in the code will remodel some of the application features, but the tests will pass anyway. How is it possible? Because these tests do not “see” UI changes or strings displayed outside the defined fields. They only check if all features are working properly (although it depends on applied frameworks).

Manual testing

It simulates what the end user does
As automated tests are basically robots, they do not reflect the real user’s world. Testing frameworks operate by following a fixed pattern, while users may use an application in a completely different way, not covered by automated tests. Testers, unlike robots, have intuition, which is a substantial skill in the case of exploratory testing. Besides, manual tests allow QA engineers to check more specific things such as cooperation with an operating system. Naturally, there are frameworks that may test it, but they are not as flexible as QA engineers checking certain features manually.
Easy to start with
This sort of test is the best solution for new members, as skills necessary to carry out manual testing are easy to acquire. Well-designed test cases saved in a test management tool (such as ‘TestLink’, ‘HP Quality Center’, etc.) are easy to follow, so new team members can start the test execution on their own. Besides, as creating new test cases is not complicated even beginners can handle it.
Faster and more effective in the case of applications undergoing frequent changes
When an application undergoes changes, the QA team may not keep up with creating new automated tests. So in this particular case, manual testing is faster and more effective due to its flexibility. Anyway, it does not mean that automated tests are unnecessary.

Waterfall or Agile

Testing tools – do you need them? Which should you choose?

Application release. Case study

Waterfall or Agile

Summary

Psychological needs: the 3 hungers model

the 3 hungers model

What the mind needs. Transactional Analysis (TA) talks out about 3 basic needs referred to as hungers:

Hunger of strokes

Chart 1

Example of self-assessment performed by an individual

Chart 2

Chart 3

The following questions can help you analyse intuitive scores:

WHAT? Do I know my responsibilities in the team? What is the business goal? Why are we developing this product?

HOW? Do I still find my job interesting or perhaps I feel overwhelmed by the level of complexity? Do I have sufficient resources to cope with?

WHY? How do I know that what I’m doing makes sense? How often do I get acknowledged? What signs of acknowledgement do I need from my co-workers/boss?

Intuition Engineering

Vizceral

Let’s have a look at a video showing traffic failover simulation.

Phobos

Phobos area view

Phobos has the ability to travel back in time, so you can see what the state of the system was yesterday during an outage, which is especially useful during root cause analysis and postmortems.

Phobos service view

Phobos became an invaluable tool that gives us an interface to a very complex system, enabling us to develop an intuition about its state and health.

@WebMvcTest

@WebMvcTest with Spock

repositories{...maven{url"https://oss.sonatype.org/content/repositories/snapshots/"}}

and then the dependency:

dependencies{...testCompile(..."org.spockframework:spock-core:1.2-groovy-2.4-SNAPSHOT","org.spockframework:spock-spring:1.2-groovy-2.4-SNAPSHOT")}

Sample application

Starting with Rest Controller (most imports omitted for clarity):

...importjavax.validation.Valid@RestController@RequestMapping(path="/registrations")publicclassUserRegistrationController{privatefinalRegistrationServiceregistrationService;publicUserRegistrationController(RegistrationServiceregistrationService){this.registrationService=registrationService;}@PostMapping(consumes=APPLICATION_JSON_VALUE,produces=APPLICATION_JSON_VALUE)@ResponseStatus(HttpStatus.CREATED)publicExistingUserRegistrationDTOregister(@RequestBody@ValidNewUserRegistrationDTOnewUserRegistration){UserRegistrationuserRegistration=registrationService.registerUser(newUserRegistration.getEmailAddress(),newUserRegistration.getName(),newUserRegistration.getLastName());returnasDTO(userRegistration);}privateExistingUserRegistrationDTOasDTO(UserRegistrationregistration){returnnewExistingUserRegistrationDTO(registration.getRegistrationId(),registration.getEmailAddress(),registration.getName(),registration.getLastName());}...}

@SpringBootApplicationpublicclassWebMvcTestApplication{publicstaticvoidmain(String[]args){SpringApplication.run(WebMvcTestApplication.class,args);}@BeanpublicMethodValidationPostProcessormethodValidationPostProcessor(){returnnewMethodValidationPostProcessor();}}

Now, having a REST Controller ready, we need a class to deserialize JSON request into. A simple POJO with Jackson and Javax Validation API annotations is enough to do the trick:

importcom.fasterxml.jackson.annotation.JsonCreator;importcom.fasterxml.jackson.annotation.JsonProperty;importjavax.validation.constraints.NotNull;importjavax.validation.constraints.Pattern;importjavax.validation.constraints.Size;importstaticcom.rg.webmvctest.SystemConstants.EMAIL_REGEXP;publicclassNewUserRegistrationDTO{privatefinalStringemailAddress;privatefinalStringname;privatefinalStringlastName;@JsonCreatorpublicNewUserRegistrationDTO(@JsonProperty("email_address")StringemailAddress,@JsonProperty("name")Stringname,@JsonProperty("last_name")StringlastName){this.emailAddress=emailAddress;this.name=name;this.lastName=lastName;}@Pattern(regexp=EMAIL_REGEXP,message="Invalid email address.")@NotNull(message="Email must be provided.")publicStringgetEmailAddress(){returnemailAddress;}@NotNull(message="Name must be provided.")@Size(min=2,max=50,message="Name must be at least 2 characters and at most 50 characters long.")publicStringgetName(){returnname;}@NotNull(message="Last name must be provided.")@Size(min=2,max=50,message="Last name must be at least 2 characters and at most 50 characters long.")publicStringgetLastName(){returnlastName;}}

What we have here is a POJO with 3 fields. Each of these fields has Jackson’s @JsonProperty annotation and two more from Javax Validation API.

First test

Writing @WebMvcTest is trivial once you have a framework that supports it. Following example is a minimal working piece of code to create a @WebMvcTest in Spock (written in Groovy):

...importstaticorg.springframework.test.web.servlet.request.MockMvcRequestBuilders.postimportstaticorg.springframework.test.web.servlet.result.MockMvcResultMatchers.jsonPathimportstaticorg.springframework.test.web.servlet.result.MockMvcResultMatchers.status@WebMvcTest(controllers=[UserRegistrationController])// 1classSimplestUserRegistrationSpecextendsSpecification{@AutowiredprotectedMockMvcmvc// 2@AutowiredRegistrationServiceregistrationService@AutowiredObjectMapperobjectMapperdef"should pass user registration details to domain component and return 'created' status"(){given:Maprequest=[email_address:'john.wayne@gmail.com',name:'John',last_name:'Wayne']and:registrationService.registerUser('john.wayne@gmail.com','John','Wayne')>>newUserRegistration(// 3'registration-id-1','john.wayne@gmail.com','John','Wayne')when:defresults=mvc.perform(post('/registrations').contentType(APPLICATION_JSON).content(toJson(request)))// 4then:results.andExpect(status().isCreated())// 5and:results.andExpect(jsonPath('$.registration_id').value('registration-id-1'))// 5results.andExpect(jsonPath('$.email_address').value('john.wayne@gmail.com'))results.andExpect(jsonPath('$.name').value('John'))results.andExpect(jsonPath('$.last_name').value('Wayne'))}@TestConfiguration// 6staticclassStubConfig{DetachedMockFactorydetachedMockFactory=newDetachedMockFactory()@BeanRegistrationServiceregistrationService(){returndetachedMockFactory.Stub(RegistrationService)}}}

def"should pass user registration details to domain component and return 'created' status"(){given:Maprequest=[email_address:'john.wayne@gmail.com',name:'John',last_name:'Wayne']and:registrationService.registerUser('john.wayne@gmail.com','John','Wayne')>>newUserRegistration('registration-id-1','john.wayne@gmail.com','John','Wayne')when:defresponse=mvc.perform(post('/registrations').contentType(APPLICATION_JSON).content(toJson(request))).andReturn().response// notice the extra call to: andReturn()then:response.status==HttpStatus.CREATED.value()and:with(objectMapper.readValue(response.contentAsString,Map)){it.registration_id=='registration-id-1'it.email_address=='john.wayne@gmail.com'it.name=='John'it.last_name=='Wayne'}}

Testing validations

@ControllerAdvicepublicclassExceptionsHandlerAdvice{privatefinalExceptionMapperHelpermapperHelper=newExceptionMapperHelper();@ExceptionHandler(MethodArgumentNotValidException.class)publicResponseEntity<ErrorsHolder>handleException(MethodArgumentNotValidExceptionexception){ErrorsHoldererrors=newErrorsHolder(mapperHelper.errorsFromBindResult(exception,exception.getBindingResult()));returnmapperHelper.mapResponseWithoutLogging(errors,HttpStatus.UNPROCESSABLE_ENTITY);}}

Here is a more complicated example, that shows full test setup (make sure to check the sources on GitHub):

@Unrolldef"should not allow to create a registration with an invalid email address: #emailAddress"(){given:Maprequest=[email_address:emailAddress,name:'John',last_name:'Wayne']when:defresult=doRequest(post('/registrations').contentType(APPLICATION_JSON).content(toJson(request))).andReturn()then:result.response.status==HttpStatus.UNPROCESSABLE_ENTITY.value()and:with(objectMapper.readValue(result.response.contentAsString,Map)){it.errors[0].code=='MethodArgumentNotValidException'it.errors[0].path=='emailAddress'it.errors[0].userMessage==userMessage}where:emailAddress||userMessage'john.wayne(at)gmail.com'||'Invalid email address.''abcdefg'||'Invalid email address.'''||'Invalid email address.'null||'Email must be provided.'}

Summary

Once upon a midnight dreary

It must be the network

It must be garbage collection

Just for cases like this, we keep GC logging on by default. I quickly downloaded the GC log and fired up Censum. Before my very eyes, a grisly sight opened: full garbage collections happening once every 15 minutes and causing 20-second long [!] stop-the-world pauses. No wonder the connection to ZooKeeper was timing out despite no issues with either ZooKeeper or the network!

20-second GC pauses — certainly not your average GC log

It must be a memory leak

Memory Sampler showed many more Ad objects than we expected

Is it really so bad?

Still not working

So what is it now?

Fixing the fix

GC tuning again

Lots of premature tenuring

Conclusion

Problem fixed — GC pauses after all changes and 4 GB heap

I hope you find this funny experience of ours helpful when debugging your own performance issues!

Due to the high interest and controversy concerning this blog post, we believe that it is worth adding some context on how we work and make decisions at Allegro. Each of more than 50 development teams at Allegro has the freedom to choose technologies from those supported by our PaaS. We mainly code in Java, Kotlin, Python and Golang. The point of view presented in the article results from the author’s experience.

We gave Kotlin a try, but now we are rewriting to Java 10

Here are the reasons why.

Name shadowing

Shadowing was my biggest surprise in Kotlin. Consider this function:

funinc(num:Int){valnum=2if(num>0){valnum=3}println("num: "+num)}

In the if body, you can add another num, which is less shocking (new block-level scope).

Okay, so in Kotlin, inc(1) prints 2. The equivalent code in Java, won’t compile:

voidinc(intnum){intnum=2;//error: variable 'num' is already defined in the scopeif(num>0){intnum=3;//error: variable 'num' is already defined in the scope}System.out.println("num: "+num);}

Name shadowing wasn’t invented by Kotlin. It’s common in programming languages. In Java, we get used to shadowing class fields with methods arguments:

publicclassShadow{intval;publicShadow(intval){this.val=val;}}

Type inference

For example, this Kotlin code:

vara="10"

would be translated by the Kotlin compiler into:

vara:String="10"

It was the real advantage over Java. I deliberately said was, because — good news — Java 10 has it and Java 10 is available now.

Type inference in Java 10:

vara="10";

To be fair, I need to add, that Kotlin is still slightly better in this field. You can use type inference also in other contexts, for example, one-line methods.

More about Local-Variable Type Inference in Java 10.

Compile time null-safety

Null-safe types are Kotlin’s killer feature. The idea is great. In Kotlin, types are by default non-nullable. If you need a nullable type you need to add ? to it, for example:

vala:String?=null// ok
valb:String=null// compilation error

Kotlin won’t compile if you use a nullable variable without the null check, for example:

println(a.length)// compilation error
println(a?.length)// fine, prints null
println(a?.length?:0)// fine, prints 0

Consider the following Java method:

publicclassUtils{staticStringformat(Stringtext){returntext.isEmpty()?null:text;}}

Now, you want to call format(String) from Kotlin. Which type should you use to consume the result of this Java method? Well, you have three options.

First approach. You can use String, the code looks safe but can throw NPE.

fundoSth(text:String){valf:String=Utils.format(text)// compiles but assignment can throw NPE at runtime
println("f.len : "+f.length)}

You need to fix it with Elvis:

fundoSth(text:String){valf:String=Utils.format(text)?:""// safe with Elvis
println("f.len : "+f.length)}

Second approach. You can use String?, and then you are null-safe:

fundoSth(text:String){valf:String?=Utils.format(text)// safe
println("f.len : "+f.length)// compilation error, fine
println("f.len : "+f?.length)// null-safe with ? operator
}

Third approach. What if you just let the Kotlin do the fabulous local variable type inference?

fundoSth(text:String){valf=Utils.format(text)// f type inferred as String!
println("f.len : "+f.length)// compiles but can throw NPE at runtime
}

Bad idea. This Kotlin code looks safe, compiles, but allows nulls for the unchecked journey through your code, pretty much like in Java.

There is one more trick, the !! operator. Use it to force inferring f type as String:

fundoSth(text:String){valf=Utils.format(text)!!// throws NPE when format() returns null
println("f.len : "+f.length)}

Class literals

Class literals are common when using Java libraries like Log4j or Gson.

In Java, we write the class name with .class suffix:

Gsongson=newGsonBuilder().registerTypeAdapter(LocalDate.class,newLocalDateAdapter()).create();

In Groovy, class literals are simplified to the essence. You can omit the .class and it doesn’t matter if it’s a Groovy or Java class.

defgson=newGsonBuilder().registerTypeAdapter(LocalDate,newLocalDateAdapter()).create()

Kotlin distinguishes between Kotlin and Java classes and has the syntax ceremony for it:

valkotlinClass:KClass<LocalDate>=LocalDate::classvaljavaClass:Class<LocalDate>=LocalDate::class.java

So in Kotlin, you are forced to write:

valgson=GsonBuilder().registerTypeAdapter(LocalDate::class.java,LocalDateAdapter()).create()

Which is ugly.

Reversed type declaration

In the C-family of programming languages, we have the standard way of declaring types of things. Shortly, first goes a type, then goes a typed thing (variable, fields, method, and so on).

Standard notation in Java:

intinc(inti){returni+1;}

Reversed notation in Kotlin:

funinc(i:Int):Int{returni+1}

This disorder is annoying for several reasons.

The second problem. When you read a method declaration, first of all, you are interested in the name and the return type, and then you scan the arguments.

In Kotlin, the method’s return type could be far at the end of the line, so you need to scroll:

privatefungetMetricValue(kafkaTemplate:KafkaTemplate<String,ByteArray>,metricName:String):Double{...}

Or, if arguments are formatted line-by-line, you need to search. How much time do you need to find the return type of this method?

@BeanfunkafkaTemplate(@Value("\${interactions.kafka.bootstrap-servers-dc1}")bootstrapServersDc1:String,@Value("\${interactions.kafka.bootstrap-servers-dc2}")bootstrapServersDc2:String,cloudMetadata:CloudMetadata,@Value("\${interactions.kafka.batch-size}")batchSize:Int,@Value("\${interactions.kafka.linger-ms}")lingerMs:Int,metricRegistry:MetricRegistry):KafkaTemplate<String,ByteArray>{valbootstrapServer=if(cloudMetadata.datacenter=="dc1"){bootstrapServersDc1}...}

MongoExperimentsRepositoryrepository

repository:MongoExperimentsRepository

Companion object

A Java programmer comes to Kotlin.

classMyClass{companionobject{vallogger=LoggerFactory.getLogger(MyClass::class.java)}}

Sometimes, you have to use static. Old good public static void main() is still the only way to launch a Java app. Try to write this companion object spell without googling.

classAppRunner{companionobject{@JvmStaticfunmain(args:Array<String>){SpringApplication.run(AppRunner::class.java,*args)}}}

Collection literals

In Java, initializing a list requires a lot of ceremony:

importjava.util.Arrays;...List<String>strings=Arrays.asList("Saab","Volvo");

Initializing a Map is so verbose, that lot of people use Guava:

importcom.google.common.collect.ImmutableMap;...Map<String,String>string=ImmutableMap.of("firstName","John","lastName","Doe");

In Java, we are still waiting for the new syntax to express collection and map literals. The syntax, which is so natural and handy in many languages.

JavaScript:

constlist=['Saab','Volvo']constmap={'firstName':'John','lastName':'Doe'}

Python:

list=['Saab','Volvo']map={'firstName':'John','lastName':'Doe'}

Groovy:

deflist=['Saab','Volvo']defmap=['firstName':'John','lastName':'Doe']

Kotlin:

vallist=listOf("Saab","Volvo")valmap=mapOf("firstName"to"John","lastName"to"Doe")

In maps, keys and values are paired with the to operator, which is good, but why not use well-known : for that? Disappointing.

Maybe? Nope

Functional languages (like Haskell) don’t have nulls. Instead, they offer the Maybe monad (if you are not familiar with monads, read this article by Tomasz Nurkiewicz).

There is no Optional equivalent in Kotlin. It seems that you should use bare Kotlin’s nullable types. Let’s investigate this issue.

Typically, when you have an Optional, you want to apply a series of null-safe transformations and deal with null at the and.

For example, in Java:

publicintparseAndInc(Stringnumber){returnOptional.ofNullable(number).map(Integer::parseInt).map(it->it+1).orElse(0);}

No problem one might say, in Kotlin, for mapping you can use the let function:

funparseAndInc(number:String?):Int{returnnumber.let{Integer.parseInt(it)}.let{it->it+1}?:0}

So in order make this code null-safe, you have to add ? before each let:

funparseAndInc(number:String?):Int{returnnumber?.let{Integer.parseInt(it)}?.let{it->it+1}?:0}

Now, compare readability of the Java and Kotlin versions. Which one do you prefer?

Read more about Optionals at Stephen Colebourne’s blog.

Data classes

Data classes are Kotlin’s way to reduce the boilerplate that is inevitable in Java when implementing Value Objects (aka DTO).

For example, in Kotlin, you write only the essence of a Value Object:

data classUser(valname:String,valage:Int)

and Kotlin generates good implementations of equals(), hashCode(), toString(), and copy().

Open classes

In Kotlin, classes are final by default. If you want to extend a class, you have to add the open modifier to it.

Inheritance syntax looks like this:

openclassBaseclassDerived:Base()

Kotlin changed the extends keyword into the : operator, which is already used to separate variable name from its type. Back to C++ syntax? For me it’s confusing.

If you are using Spring, you have two options. You can put open in front of all bean classes (which is rather boring), or use this tricky compiler plugin:

buildscript{dependencies{classpathgroup:'org.jetbrains.kotlin',name:'kotlin-allopen',version:"$versions.kotlin"}}

Steep learning curve

Final thoughts

Funny facts about Kotlin

In Poland, Kotlin is one of the best selling brands of ketchup. This name clash is nobody’s fault, but it’s funny. Kotlin sounds to our ears like Heinz.

Kotlin ketckup

Improving iOS app launch time

Static linking

We wanted to do this gradually, framework by framework. We also wanted to have a possibility to turn the static linking off in case of any unexpected problem.

We decided to use a two-step approach:

compiling frameworks code to static libraries,
converting frameworks (dynamic library packages) to resource bundles (resources packages).

Compiling framework code as a static library

That way we replaced our dynamic libraries with static libraries, but that was the easier part of the job.

Converting framework to resource bundle

Results

Each measurement in the following table is an average of 6 samples.

	iPhone 4s	iPad 2	iPhone 5c	iPhone 5s	iPhone 7+	iPad 2 cold launch
57 dylibs app launch time [s]	7.79	7.33	7.30	3.14	2.31	11.75
31 dylibs app launch time [s]	6.62	6.08	5.39	2.75	1.75	7.27
Launch speedup [%]	15.02	17.05	26.16	12.42	24.24	38.13

Static linking pitfall

That was the only major obstacle we’ve come across.

Dyld3

Looking for dyld3

I wondered: What makes system apps so special that they are launched with dyld3?

First guess: LC_LOAD_DYLINKER load command points to dyld3 executable…

$ otool -l /Applications/Calculator.app/Contents/MacOS/Calculator | grep"cmd LC_LOAD_DYLINKER"-A 2
          cmd LC_LOAD_DYLINKER
      cmdsize 32
         name /usr/lib/dyld (offset 12)

$ lldb /Applications/Calculator.app/Contents/MacOS/Calculator
(lldb) rbreak dyld3
Breakpoint 1: 887 locations.
(lldb) r
Process 92309 launched: '/Applications/Calculator.app/Contents/MacOS/Calculator'(x86_64)
Process 92309 stopped
* thread #1, stop reason = breakpoint 1.154
    frame #0: 0x00007fff72bf6296 libdyld.dylib`dyld3::AllImages::applyInterposingToDyldCache(dyld3::launch_cache::binary_format::Closure const*, dyld3::launch_cache::DynArray<dyld3::loader::ImageInfo> const&)
libdyld.dylib`dyld3::AllImages::applyInterposingToDyldCache:
->  0x7fff72bf6296 <+0>: pushq  %rbp
    0x7fff72bf6297 <+1>: movq   %rsp, %rbp
    0x7fff72bf629a <+4>: pushq  %r15
    0x7fff72bf629c <+6>: pushq  %r14
Target 0: (Calculator) stopped.

What did I end up with? Hidden dyld3 can be activated on macOS High Sierra using one of the following two approaches:

setting dyld`sEnableClosures:
- dyld`sEnableClosures needs to be set by e.g. using lldb memory write (unfortunately undocumented DYLD_USE_CLOSURES=1 variable only works on Apple internal systems),
- /usr/libexec/closured needs be compiled from dyld sources (it needs a few modifications to compile),
- read invocation in callClosureDaemon needs to be fixed (I filed a bug report for this issue); for the sake of tests I fixed it with lldb breakpoint command and a custom lldb script that invoked read in a loop until it returned 0, or
dyld closure needs to be generated and saved to the dyld cache… but… what is a dyld closure?

Dyld closure

$ dyld_closure_util -create_closure ~/tmp/TestMacApp.app/Contents/MacOS/TestMacApp | wc -l
  832363

The basic JSON representation of a dyld closure looks as follows:

{"dyld-cache-uuid":"9B095CC4-22F1-3F88-8821-8DFD979AB7AD","images":[{"path":"/Users/kamil.borzym/tmp/TestMacApp.app/Contents/MacOS/TestMacApp","uuid":"D5BDC1D3-D09E-36D5-96E9-E7FFA7EE955E""file-inode":"0x201D8F8BC",//usedtocheckifdyldclosureisstillvalid"file-mod-time":"0x5B032E9A",//usedtocheckifdyldclosureisstillvalid"dependents":[{"path":"/Users/kamil.borzym/tmp/TestMacApp.app/Contents/Frameworks/Frm1.framework/Versions/A/Frm1"},{"path":"/Users/kamil.borzym/tmp/TestMacApp.app/Contents/Frameworks/Frm2.framework/Versions/A/Frm2"},/*...*/],/*...*/},{"path":"/Users/kamil.borzym/tmp/TestMacApp.app/Contents/Frameworks/Frm1.framework/Versions/A/Frm1","dependents":[/*...*/]},/*...*/],/*...*/}

Dyld closure contains a fully resolved dylib dependency tree. That means: no more expensive dylib searching.

Dyld3 closure cache

I updated the dyld closure cache:

sudo update_dyld_shared_cache -force

dyld: found closure 0x7fffef8f278c in dyld shared cache

Dyld3 performance

warm launch – without system restart,
cold launch – system restart between each measured time sample.

Statically linked app always launched very fast, but I could not see any significant difference between dyld2 and dyld3 loading time.

launch type	dyld2	dyld3	static
warm	0.737s	0.726s	0.676s
cold	1.166s	1.094s	0.871s

launch type	dyld2	dyld3	static
warm	0.722s	0.731s	0.679s
cold	3.687s	2.947s	2.276s

dyld3 status

Throughout my studies at university and work in the industry I switched my primary programming language from Java to C# and back again to Java. This article gathers some of my thoughts on using both languages. It’s not intended to be a comprehensive comparison of Java and C#. There are a lot of other resources on the Internet that cover this topic. Instead, I want to focus on what I personally liked about both languages and how it felt to transition between them.

From Java to C#

Over the course of my computer science studies, Java was my primary programming language. It had everything I needed and felt much easier to use than C++. Around that time I also briefly tried C# but was instantly discouraged by the horribly slow Visual Studio 2008 IDE. So I stuck with Java for 5 years and considered myself a Java developer. That was until I found an interesting job offer that required C#. I decided to give it another try. I wrote some small projects using it and, to my surprise, I almost didn’t notice it was a different language. Actually, the thing that was most difficult for me to get used to was coding conventions, like starting method names with a capital letter. Everything else felt very familiar. So I dug a bit deeper into the language and found things that I actually considered an improvement over Java. First and foremost LINQ.

LINQ

When I first started to use C#, LINQ was the feature that I liked most. The name is an acronym for Language Integrated Query. It lets you query collections and other data sources in a way that’s both consistent and convenient to use. For example, let’s say you have a list of users and you want to narrow it down to only these whose names start with “A”. In plain C#, you could write this as:

IEnumerable<User> result = new List<User>();
for (User user in users) {
    if (user.name.StartsWith('A')) {
        result.Add(user);
    }
}

This code would work just fine, but when you glance at it, it’s not immediately obvious what it does. Also, it’s a lot of typing. Using LINQ this code could be simplified to:

List<User> result = users.Where(user -> user.name.StartsWith('A'));

Or if you prefer a more SQL-like syntax:

IEnumerable<User> result =
    from users
    where user.name.StartsWith('A')
    select user;

In addition to filtering, LINQ also supports various methods for transforming (or projecting) results to a different form (like Select or SelectMany), aggregation (for example Aggregate, or GroupBy) or joining multiple collections (for example Join, or Zip). .NET 4 further extended LINQ by adding support for parallel processing (PLINQ). All in all, the feature is quite powerful. People even wrote some one line queries to solve riddles like Sudoku. Of course LINQ doesn’t come without some drawbacks. Most notably, queries are notoriously difficult to debug as they are treated as just one statement by the debugger. The fact that query results are lazily evaluated also complicates stepping through the code a bit.

After I switched to C#, Java 8 introduced a feature very similar to LINQ called Streams which was then further improved in Java 9. Our previous example of filtering a list of users could look like this:

List<User>result=users.stream().filter(user->user.name.startsWith("A")).collect(toList());

The syntax of both technologies is very similar. The set of available convenience methods is a bit different but you can fill in the gaps by combining other, more basic, methods.

When learning C# I really liked how LINQ simplified the source code by removing unnecessary loops. C# also has some other features that help make the code more concise, such as for example properties or optional parameters.

Properties and Optional Parameters

Properties are an alternative to Java getters and setters. Having to add field accessors was one of the things that I liked the least about Java. Sure, most Java IDEs can generate them on demand, but it’s still another action you need to do. The code also gets more cluttered. There are some libraries that let you add accessors by annotating a class (such as Lombok), but this only really works if your IDE understands them. C# properties are built into the core language and offer a concise syntax for accessing fields. Instead of declaring a field with two accompanying accessor methods you can simply write:

public int SomeProperty {get; set;}

If required, you can provide custom logic after the get or set keywords, or remove the setter altogether if the property should be read-only.

Another feature that helps reduce code clutter is allowing optional method arguments. This reduces the need for overloading method names with different sets of parameters. For example, let’s take a standard IndexOf() method used to find the index of a particular character in a string. It usually has two versions: one that allows you to specify only the character you’re looking for and another which lets you specify where to start the search. This could be written as two methods:

public int IndexOf(char character) {
    return IndexOf(character, 0);
}

public int IndexOf(char character, int startIndex) {
    // find the first occurence of character starting from startIndex
}

Using optional parameters we only need one method:

public int IndexOf(char character, int startIndex = 0) {
    // find the first occurence of character starting from startIndex
}

Generics

Both Java and C# support generic types, but the way the feature is implemented differs significantly. In Java generics only exist at the language level. The runtime environment doesn’t support type parameters so the compiler removes them in a process called type erasure. It replaces all instances of a type parameter with the top level Object class or a constraint defined for the type parameter. It then inserts casts to preserve type safety. In contrast, C# compiler only verifies type constraints without removing the parameters. Actual code generation is deferred until the class is loaded. At that point, the type parameter is replaced with the actual runtime type. Both approaches to generics have their advantages and disadvantages. I won’t go into too much detail here as there are a lot of other resources that explain the topic better than I can. Setting aside some differences in resource consumption, in most use cases both approaches work just as well. But type erasure can cause some inconveniences. For example consider these two methods:

privatevoiddoSomething(List<String>list){...}privatevoiddoSomething(List<URL>list){...}

After type erasure takes place, the type of items contained in the collections is replaced by Object. Because of this, the two methods end up with an identical signature and the code would fail to compile. To overcome this, one needs to modify the names of the methods.

Type erasure can also make it more difficult to use reflection. For example, let’s say we have two classes A and B, and that B inherits from A. Let’s also create a list of type A which contains a mix of instances of A and B.

public class A {
}

public class B : A {
}

...

List<A> list = new List<A>();
list.Add(new A());
list.Add(new B());

Now let’s write a generic method that starts with logging the type of the list and then processes it some way:

public void Process<T>(List<T> list) where T : A {
    Console.WriteLine("Processing a list of {0}", typeof(T).Name);
    
    // do something else
}

Calling the method as:

Process(list);

would print the following line:

Processing a list of A

This approach would not work in Java. It relies on the fact that we can get the actual runtime value of T using the typeof operator. Since Java’s type erasure replaces T with Object, there’s no way we can determine the type of items in the list. We could try to iterate through them and check their types using the getClass() method, but, because our list contains a mixture of different classes, we would need to walk up the inheritance tree and calculate a common ancestor for them. This could get tricky if the classes implemented the same interfaces. A better approach, would be to define an additional Class parameter. We can then use it to explicitly tell the method what type we’re processing.

public<TextendsA>voidprocess(List<T>value,Class<T>type){System.out.printf("Processing a list of %s",type.getName());// do something else}

The method would then be called like this:

process(list,A.class);

Adding the extra information in this case might not seem like a huge problem, but having to pass the type explicitly is a bit unintuitive since the method already has a type parameter.

From C# to Java

Over the 5 years I was using C# I really got used to the convenience offered by these and other language features. At some point I decided it was time to change jobs and found an interesting position at Allegro. The primary language required was Java. Because of that, I decided to refresh my skills a bit and started a small project. I must say, transitioning back to Java wasn’t as smooth as moving to C# earlier. The language felt way less expressive. It seemed I needed to do much more to get the same result. But that was before I tried Spring.

Spring

I never really used Spring before I switched to C#. I wanted to give it a try and was very impressed. Without any real knowledge of the framework I managed to get a simple REST service running within a few minutes. This ease of configuration is something .NET WCF framework really lacks. The acronym stands for Windows Communication Foundation. It’s a framework for building services-oriented applications. It handles all the details of sending messages over the network, supporting multiple message patterns (like a request-response model, or a duplex channel), different transport protocols, encodings, and has a host of other features. It’s quite powerful but the drawback is it relies on a rather complicated XML configuration file. It’s not easy to get everything set up correctly without doing some research upfront. In Spring the XML configuration file is optional. The more convenient method is to configure the application in code by adding annotations and registering bean classes. Spring also decreases the entry cost by hiding all of the configuration settings you don’t need at first. Spring Boot sets some sensible defaults for you. Later, if you need to, you can always configure them the way you want, but until then, the framework takes care of everything.

Another thing I liked about Spring was the built-in dependency container. It requires the user to just add a few annotations on classes and the framework takes care of wiring them together. In general, I really liked how Spring makes use of annotations. They exist in C# as well but are nowhere near as utilised. I was surprised to see how easy it was to add validation for input messages of my service or to gather metrics for endpoints. All this without cluttering business logic with extra code that doesn’t really belong there. For example the below code creates a service endpoint that accepts PUT requests, with a URL path parameter that’s not null and a request body that’s validated using annotations defined for fields in the Request class.

@PutMapping(value="/{pathParameter}")publicResponseupdate(@PathVariable("pathParameter")@NotNullStringpathParameter,@RequestBody@Valid@NotNullRequestrequestBody)

But all this convenience comes at a cost. When using Spring, it sometimes feels there’s some sort of magic going on inside the framework. Sometimes even too much magic. For example, when accessing databases using Spring you can use repository interfaces. These are ordinary Java interfaces to which you add methods using a defined naming convention. Spring then generates a class with query implementations based on method names and injects it whenever you use the interface. Things get even more implicit if you want to add some custom queries. I read the documentation for this a few times and still didn’t quite believe the approach described there would just work. The solution was to add a class with the same name as the interface and an “Impl” suffix. It doesn’t need to implement the original interface or have any other connection to it. The name alone is enough for Spring to know it should be merged into the automatically generated class I mentioned earlier. This kind of implicit behavior makes it a bit difficult to understand what’s going on in a Spring application, especially when you’re not familiar with some of the framework’s features. But, all in all, Spring does a great job when it comes to configuring most common use cases.

Spring wasn’t the only thing I liked about Java. Some core language features also caught my attention, such as for example Optionals.

Optionals

Java Optionals are something C# could really use. It’s not uncommon to have C# programs cluttered with if (something == null) conditions everywhere. C# 6.0 introduced an optional chaining operator which improved this a bit by allowing method calls like this:

String value = GetSomething()?.GetSomethingElse()?.AndExtractString();

The new operator makes sure a method is only invoked on a non-null reference. Otherwise it just returns null. But you most likely still need to guard against a null when you try to use the final value returned from the call chain. Java has a much more flexible mechanism for handling null values. For example, it lets you define default values to use like this.

Stringvalue=Optional.ofNullable(GetSomething()).orElse("default");

It also allows you to transform the object or filter based on a predicate:

Stringvalue=Optional.ofNullable(getSomething()).map(something->something.getSomethingElse()).filter(somethingElse->somethingElse.isOk).map(somethingElse->somethingElse.andExtractString()).orElse("default");

The syntax really resembles Java Streams that I mentioned earlier and C# LINQ. It’s a really powerful way to handle possible null values in the code without dotting business logic with null checks.

Some final thoughts

So, after using both Java and C#, which language do I like more? The most honest answer is: I don’t really know. If I were to compare only core language features, C# seems to be more expressive, but Spring easily makes up for the difference. Both Java and C# are general purpose programming languages that let you code pretty much whatever you want. Is one better suited for some projects than the other? Probably so. C# seems to be a more natural choice when the main platform you want to target is Windows. There are also official APIs available to use all Microsoft services like OneDrive or Active Directory. On the other hand, Java seems to be a better choice outside of Windows platform. Then again, recent actions taken by Microsoft seem to indicate a change in the previous approach to favour its own platform and language. Acquisition of Xamarin and increasing support for .NET Core makes it easier to use C# on operating systems other than Windows. Client libraries and official code samples in most popular languages are also added to most of Microsoft’s services opening them to developers that don’t use C#. Java is also picking up momentum by shortening release cycles of new versions and adding a lot of interesting features. With all that, the line between the two worlds becomes increasingly blurry and the choice between C# and Java will often boil down to personal or company preference.

And what about picking a language when starting to learn programming? Looking back, do I think it was good to start with Java? I think so. The core language is probably easier to learn than C#. LINQ, Events and Properties can be a bit confusing at first. On the other hand, Spring gives Java a big advantage. And once you get a bit more experience in programming, the language isn’t that important anymore. After all, it’s just a way to express our ideas. It’s a bit like with natural languages: you can write poetry no matter which one you use. What really matters is the developer that turns an idea into an amazing application.

Lately my colleague Michał described how he tracked a Java memory leak. Although that problem was completely solved, new complications suddenly appeared on the horizon. As it usually happens, everything started with an alert from our monitoring system which woke me up in the middle of the night.

Introduction

Before I start a story I need to give you a little explanation about what Adventory is and how it was designed. Adventory is a microservice which is a part of our PPC Advertising Platform and its job is to take data from MongoDB and put it into Elasticsearch (2.4.6 at the time of writing this article). We do this because our index changes so quickly and from performance point of view we decided that it would be better to build a fresh Elasticsearch index each time rather than to update an existing one. For this purpose we have two Elasticsearch clusters. One of them serves queries while the other one indexes, and after each indexing they switch their roles.

Investigation

Our monitoring system told me that our indexing time increased and started hitting 20 minutes. That was too much. A couple of months ago we started with 2 minutes and the number of documents increased no more than 3—4 times since then. It should be much lower than 20 minutes. We hadn’t changed code related to indexing recently so it seemed to be a bug related to increasing load or amount of data in our system. Unfortunately we didn’t realise that things were getting worse until we crossed the critical line. We store metrics only for one week so we didn’t know when exactly indexing times got worse, we didn’t know if they increased suddenly or kept increasing slowly and consistently. Now, regardless of the answer, we needed to fix this quickly because we couldn’t accept these times.

The problem here is that since we were streaming data all the way, we couldn’t see which part of the process was slowing everything down. With this architecture it’s hard to see one metric with a separate value for each step because many steps execute in parallel. For this reason looking at the metrics gave me no certainty which step of the whole process was the bottleneck.

First attempt

First I took a look at the times of each step of the indexing part. It turned out that a forcemerge lasted ~7 minutes in each indexing. I turned this step off and started looking at metrics. It turned out that this part had a crucial impact on search latency and without this step search times grew drastically. I gave this idea another try and switched max_num_segments parameter from 1 (which compacts all the data in the index) back to its default (checking if a merge needs to execute, and if so, executing it) but it didn’t work either. So this was a dead end.

Second attempt

Undeterred by previous failure I tried another idea. Let’s change the way we transform data. When we had written this service there was no support for streaming data in Spring Data MongoDB. We needed to sort data with a given limit and implement our own equivalent of streaming. It was clearly not efficient. I rewrote this to use streams. By the way, we hit the hidden feature of MongoDB: default idle timeout. After fixing this, there was no good news again. This change had no significant effect on performance.

Third attempt

Due to lack of progress my colleague took over the helm and tried some different configuration options. Some were unsuitable:

setting refresh_interval to —1— since we do forcemerge in the end we don’t need to refresh during indexing,
turning off _field_names field — this is Elasticsearch’s feature which comes at a price and since we didn’t take advantage of it we could easily disable it,
playing with translog options.

Some helped a little:

turning off doc_values for fields we don’t sort or group,
turning off _all field,
storing in _source only those fields’ values which we really need, although this one might be tricky because these fields will be harder to debug in the future,
increasing the number of threads on which we process all data. We needed to be careful with this since too many threads may cause memory problems.

After this step, indexing times dropped from ~20 minutes to ~12 minutes. It was a significant improvement but still not enough.

The final attempt

Reviewing recently added features, I found one which drew my attention. We had added another result filtering method, this time by key-value pairs. The structure in the index was following:

{"filters":{"key1":["value1","value2","value3"],"key2":["value1","value2","value3"]}}

I turned these fields off for a couple of indexing runs and times dropped to ~4.5 minutes. The problem was that we had many different keys and because of the way the index was organized internally it could be a performance issue. Fortunately, since we only used filters for search, we could refactor it a little bit to the following structure:

{"filters":["key1_value1","key1_value2","key1_value3","key2_value4","key2_value5","key2_value6"]}

Due to this change we stopped forcing Elasticsearch to create heavyweight structures for storing many keys. Instead, we made a kind of enum values from key-value pairs and stored them in Elasticsearch as a flat list. This way is optimal for both indexing and searching.

Conclusion

Why didn’t we earlier realise that indexing times had increased? Because they increased slowly over time rather than in a sudden peak and not enough to hit our threshold level in the monitoring system. We hit the threshold a couple of days after deployment of filtering (as described before) when we had higher traffic. Unfortunately, there is no obvious solution to these kinds of problems. The best we can do is to observe metrics after each deployment and maybe even put them permanently on a screen in the developers’ room. It may be worth considering setting up a tool for anomaly detection as well.

We messed up. On July 18^th, 2018, at noon, Allegro went down and was unavailable for twenty minutes. The direct cause was a special offer in which one hundred Honor 7C phones whose regular price is around PLN 850 (about € 200), were offered at a price of PLN 1 (less than € 1). This attracted more traffic than we anticipated and at the same time triggered a configuration error in the way services are scaled out. This caused the site to go down despite there being plenty of CPUs, RAM, and network capacity available in our data centers.

In order to make up for the issues and apologize, we made it possible to finish the transaction afterwards to buyers who managed to buy the phone at the low price but whose transactions were aborted as the system went down.

But we believe that we also owe our customers and the tech community an explanation of how the crash came about and what technical measures we are putting in place in order to make such events less likely in the future. We prepare internal postmortems after any serious issue in order to analyze the causes and learn from our mistakes. This text is based on such an internal postmortem, prepared by multiple people from the teams that took part in dealing with the outage.

Architecture overview

First of all, let’s start with an overview of our architecture. As you probably already know from our blog, our system is based on microservices which run in a private cloud environment. In the typical case of a user searching for an offer, clicking on it to view details, and then buying, following major services are involved:

Listing — prepares data related to item listing (search result) pages
Search — responsible for low-level search in offers, based on keywords, parameters and other criteria
Transaction — allows items to be bought
Opbox— responsible for frontend rendering of the data returned by backend services
Item — service for frontend rendering of item pages

Outage timeline

The special offer was to start at noon sharp, and a direct link to its item page had been published before. At 11:15 we manually scaled out Listing service in order to be prepared for increased incoming traffic.

Search service traffic around noon. The number of requests per unit of time rose before noon, causing some requests to fail after reaching a high enough level. Apart from natural changes in traffic, this chart also shows the time of low traffic caused by frontend services failing around 12:05 and traffic rising again after those issues were resolved.

At 11:50, traffic to the major services was already 50% higher than the day before at the same time of day. At 11:55, further traffic increase caused response times of major services to rise, forcing us to scale out these services. A minute or two later, response times from Search and Listing services rose even more, forcing further scaling.

By 11:58, almost all resources in the part of the cluster provisioned for these services had been reserved even though only a fraction of the cluster’s capacity (or even that particular compartment) was actually used. When an application is deployed to our cloud, it declares the amount of resources such as processor cores and memory which it needs for each instance. These resources are reserved for a particular instance and can’t be used by others even if the owner is not really consuming them. Some services share their cluster space with others while others have separate compartments due to special needs.

As we later found out, due to a misconfiguration, some services reserved much more resources than they actually needed. This lead to a paradoxical situation in which there were plenty of resources available in the cluster but since they were reserved, they couldn’t be assigned to any other services. This prevented more instances from starting despite resources being there. Some other compartments within the cluster were not even affected at all, with lots of CPUs idling and tons of RAM laying around unused.

Listing service response times (avg median - average between instances of the median value, max p99 - maximum between instances of 99^th percentile). Response times stayed stable despite growing traffic but after reaching saturation, they increased very quickly, only to fall due to frontend services failing and later successful scaling of Listing service.

Seconds before noon, the price of the special offer was decreased to PLN 1 in order to ensure that at 12:00 sharp it would already be visible in all channels, and the first sales took place.

Also just before noon, traffic peaked at 200%-300% of the traffic from previous day, depending on service. At this stage, traffic was at its highest but due to excessive resource reservations, in some parts of the cluster we could not use available CPUs and RAM for starting new service instances. Meanwhile, the frontend service, Opbox, was starting to fail. This caused a decrease in traffic to the backend services. It was still quite high, though, and autoscaler started to spin up new instances of Search service. We manually added even more instances, but the resource reservations created previously prevented us from scaling up as far as to decrease response times significantly.

Increased response times caused some Opbox instances to not report their health status to the cluster correctly and at 12:05 the cluster started killing off unresponsive instances. While automated and manual scaling efforts continued, before 12:15 we started adding more resources to the cluster. At the same time, we started shutting down some non-critical services in order to free CPU and memory. Around 12:20, the situation was fully under control and Allegro became responsive again.

Analysis

What is going on inside a service which experiences traffic higher than it can handle with available resources? As response times increase, the autoscaler tries to scale up the service. On the other hand, instances whose health endpoint can’t respond within a specified timeout, are automatically shut down. During the outage, autoscaler did not respond quickly enough to rising traffic and we had to scale up manually. There were also some bad interactions between the autoscaler scaling services up and the cluster watchdog killing off unresponsive instances.

Excessive resource reservations were a major cause of problems since they prevented more instances from being started even though there were still plenty of resources available. As the probably most important action resulting from this postmortem, we plan to change the cluster’s approach to reserving resources so that there is less waste and resources are not locked out of the pool if they are not really used.

Apart from the obvious resources of the cluster: CPU and RAM, another resource which can become saturated are the connection pools for incoming and outgoing network connections as well as file descriptors associated with them. If we run out of them, our service becomes unresponsive even if CPU and RAM are available, and this is what happened to some of the backend services during the outage. By better tuning the configuration of thread and connection pools as well as the retry policies, we will be able to mitigate the impact of high traffic the next time it happens.

Undertow thread count in Listing service. A sudden increase is visible during the time when there were too few instances to handle incoming traffic. Compare with the graph of response times above.

In most cases, requests which time out, are repeated after a short delay. Under normal conditions, the second or third attempt usually succeeds, so these retries can often fix the situation and allow a response to still be delivered to the end user. However, if the whole cluster is maxed out, retries only increase the load while the whole request fails anyway. In such a situation, a circuit breaker should prevent further requests, but as we found out during postmortem analysis, one of the circuit breakers between our services was not correctly configured: the failure threshold for triggering it was set to a high value which we didn’t reach even during such a serious surge in traffic. Apart from fixing this, we are also adding an additional layer of circuit breakers directly after the frontend service.

The role of rate limiters is to cut off incoming traffic which displays suspicious patterns before it even enters the system. Such rate limiters did in fact kick in and were the cause of many “blank pages” seen by our users during the outage. Unfortunately, the coverage of the site by rate limiters was not complete, so while some pages were protected very well, others were not. The “blank page” had an internal retry, so a user looking at such a page was actually still generating requests to the system once in a while, further increasing the load. On the other hand, upon seeing that the site was broken, some users tried to manually refresh the pages they were viewing or to enter allegro.pl into the address bar and searching for the phone’s name, thus generating even more search requests manually.

Another takeaway was the observation that new Opbox instances had issues while starting under high load. Newly started instances very quickly reached “unresponsive” status and were automatically killed. We will try out several ideas which should make the service start up faster even if it gets hit with lots of requests right away.

Finally, by introducing smart caches, we should be able to eliminate the need for many requests altogether. Due to personalisation, item pages are normally not cached and neither is the data returned by backend services used for rendering those pages. However, we plan to introduce a mechanism which will be able to tell backend services to generate simplified, cacheable responses under special conditions. This will allow us to decrease load under heavy traffic.

Closing remarks

Apart from the need of introducing the improvements mentioned above, we learned a few other interesting things.

First off, we certainly learned that traffic drawn in by an attractive offer can outgrow our expectations. We should have been ready for more than we were, both in terms of using cluster capacity effectively and in terms of general readiness to handle unexpected situations caused by a sudden surge in traffic. Apart from technical insights, we also learned some lessons on the business side of things, related to dealing with attractive offers and organizing promotions, for example that publishing a direct link to the special offer ahead of time was a rather bad idea.

Interestingly enough, the traffic which brought us down, was in large part bots rather than human users. Apparently, some people were so eager to buy the phone cheaply that they used automated bots in order to increase their chances of being in the lucky hundred. Some even shared their custom solutions online. Since we want to create a level playing field for all users, we plan to make it harder for bots to participate in this kind of offers.

Even though it may have looked as if the site had gone down due to an exhaustion of resources such as processing power or memory, actually plenty of these resources were available. However, an invalid approach to reserving resources made it impossible at one point to use them for starting new instances of the services which we needed to scale out.

I think that despite the outage taking place, the way we handled it validated our approach to architecture. Thanks to the cloud, we were able to scale out all critical services as long as the resource limits allowed us to. Using microservices, we were able to scale different parts of the system differently which made it possible to use the available cluster more effectively. Loose coupling and separation of concerns between the services allowed us to safely shut down those parts of the system which were not critical in order to make room for more instances of the critical services.

Our decentralized team structure was a mixed bag, but with advantages outweighing disadvantages. It certainly lead to some communication overhead and occasional miscommunication, but on the other hand, it allowed teams responsible for different services to act mostly independently, which increased our reaction speed overall. Note, that “decentralized team structure” does not mean “free for all”. In particular, during an outage, there is a formal command structure for coordinating the whole effort, but it does not mean micromanagement.

We know that Allegro is an important place for our customers, and every day we work hard to make it better. We hope that the information contained in this postmortem will be interesting for the IT community. We are implementing actions outlined in a much more detailed internal report in order to make such events less probable in the future. Even in failure there is opportunity for learning.

Allegro Engineering Team