Sunday, January 31, 2016

OpenGL rendering performance test #1 - Procedural grass

This is the first of several OpenGL rendering performance tests done to determine the best strategy for rendering of various type of procedural and non-procedural content.
This particular test concerns with rendering of small procedurally generated meshes and the optimal way to render them to achieve the highest triangle throughput on different hardware.

Test setup:
  • rendering a large number of individual grass blades using various rendering methods
  • a single grass blade in the test is made of 7 vertices and 5 triangles
    grass-strip.png
  • grass is organized in tiles with 256x256 blades, making 327680 effective triangles, rendered using either:
    • a single draw call rendering 327680 grass triangles at once, or
    • an instanced draw call with different instance mesh sizes and corresponding number of instances, rendering the same scene
  • rendering methods:
    • arrays via glDrawArrays / glDrawArraysInstanced, using a triangle strip with 2 extra vertices to transition from one blade to another (except for instanced draw call with a single blade per instance)
    • indexed rendering via glDrawElements / glDrawElementsInstanced, separately as a triangle strip or a triangle list, with index buffer containing indices for 1,2,4 etc up to all 65536 blades at once
    • a geometry shader (GS) producing blades from point input, optionally also generating several blades in one invocation
      GS performance showed up to be dependent on the number of interpolants used by the fragment shader, so we also tested a varying number of these
  • position of grass blades is computed from the blade coordinate within the grass tile, derived purely from gl_InstanceID and gl_VertexID values; rendering does not use any vertex attributes, it uses only small textures storing the ground and grass height, looked up using the blade coordinate
  • also tested randomizing the order in which the blades are rendered in tile, it seems to boost performance on older GPUs a bit

This test generates lots of geometry from little input data, which might be considered a distinctive property of procedural rendering.

Goals of the test are to determine the fastest rendering method for procedural geometry across different graphics cards and architectures, the optimal mesh size per (internal) draw call, the achievable triangle throughput and factors that affect it.

Results

Performance results for the same scene rendered divided into varying size instances. The best results overall were obtained with indexed triangle list rendering, shown in the following graph, measured as triangle throughput at different instance mesh sizes:

On Nvidia GPUs and on older AMD chips (1st gen GCN and earlier) it’s good to keep mesh size above 80 triangles in order not to lose performance significantly. On newer AMD chips the threshold seems to be much higher - above 1k, and after 20k the performance goes down again.
Unfortunately I haven’t got a Fury card here to test if this holds for the latest parts, but anyone can run the tests and submit the results to us (links at the end).

In any case, mesh size around 5k triangles is a good one that works well across different GPUs. Interestingly both vendors start to have issues at different ends - on Nvidia, performance drops with small meshes/lots of instances (increasing CPU side overhead), whereas AMD cards start having problem with larger meshes (but not with array rendering).

Conclusion: with small meshes, always group several instances into one draw call so that resulting effective mesh size is around 1 - 20k.

Geometry shader performance roughly corresponds to the performance of instanced vertex shader rendering with small instance mesh sizes, which in all cases lie below the peak performance. This also shows as a problem on newer AMD cards with disproportionally low performance with geometry shaders.
Note that there’s one factor that can still shift the disadvantage in some cases - the ability to implement culling as a fast discard in GS, especially in this test where lots of off-screen geometry can be discarded.
mtris-gs.png
GS performance is also affected by the number of interpolants (0 or 4 floats in the graph), but mainly on Nvidia cards.

The following graph shows the performance as a percentage of given card’s theoretical peak Mtris/s performance (core clock * triangles per clock of given architecture). However, the resulting numbers for AMD cards seem to be too low.

Perhaps a better comparison is the performance per dollar graph that follows after this one.


Performance per dollar, using the prices as of Dec 2015 from http://www.videocardbenchmark.net/gpu_value.html


Results for individual GPUs


Arrays are generally slower here because of 2 extra vertices needed to cross between grass blades in triangle strips, and only match the indexed rendering when using per-blade instances, where the extra vertices aren’t needed.

“Randomized” versions just reverse the bits in computed blade index to ensure that blade positions are spread. This seems to help a bit on older architectures.

AMD7970-grass.png

Unexpected performance drop on smaller meshes on newer AMD cards (380, 390X)
AMDR9_380-grass.png

Slower rendering on Nvidia with small meshes is due to sudden CPU-side overhead.
NV750Ti-grass.png
NV960-grass.png
With more powerful Nvidia GPUs it’s also best to aim for meshes larger than 1k, as elsewhere minor performance bump becomes slightly more prominent:
NV980Ti-grass.png

Older Nvidia GPUs show comparatively worse geometry shader performance than newer ones:
NV780-grass.png


Test sources and binaries


All tests can be found at https://github.com/hrabcak/draw_call_perf

If anyone wants to contribute their benchmark results, the binaries can be downloaded from here: Download Outerra perf test

There are 3 tests: grass, buildings and cubes. The tests can be launched by running their respective bat files. Each test run lasts around 4-5 seconds, but there are many tested combinations so the total time is up to 15 minutes.

Once each test completes, you will be asked to submit the test results to us. The test results are stored in a CSV file, and include the GPU type, driver version and performance measurements.
We will be updating the graphs to fill the missing pieces.

The other two tests will be described in subsequent blog posts.

@cameni

Wednesday, May 27, 2015

Evaluation of 30m elevation data in Outerra

In September 2014 it was announced that the 30m (1") SRTM dataset will be released with global coverage (previously available only for US region). This was eagerly expected by a lot of people, especially simulator fans, as the 90m data are lacking finer erosion patterns.

Final release is planned for September 2015, but as of now already a large part of the world is ready, with the exception of Northeast Africa and Southwest Asia. I decided to do some early testing of the data, and here's a comparison video showing the differences. Atmospheric haze was lowered for the video to better show the differences:


Depending on the location, the 30m dataset can be a huge enhancement, adding lots of smaller scale details.

Note that "30m" actually refers to 38m OT dataset that was created by re-projecting to OT quad-sphere mapping and fractal-resampling from 30m (1") sources, while the original dataset is effectively a 76m one, produced by bilinear resampling (which washes out some details by itself).

Here are also some animated pics, switching between 38/30m and 76/90m data:



As it can be seen, details are added both to flat parts and the slopes. Increased detail roughly triples the resulting dataset size, rising from 12.5GB to 39GB, excluding the color data which remain the same.

However, an interesting thing is that a 76/30 dataset (76m fractal resampling from 30m sources) still looks way better than the original dataset made from the 90m data, while staying roughly the same size. The following animation shows the difference between 76/30 and 38/30 data:


The extra detail in 38/30 data visible on OT terrain is actually not fully caused by the differences in detail; it seems that the procedural refinement that continues producing more detail is overreacting and producing a lot of small dips, as can be seen on the snow pattern.

Quality of the data

The problem seems to be that the 30m SRTM data are still noticeably filtered and smoothed. When the resampled 38m grid is overlaid on some known mountain crests, it's obvious that it's way more smooth than the real ones. This has some negative consequences when the data are further refined procedurally, and the elevations coming from real data are already missing a part of the spectrum because of the filtering.

The effective resolution seems to be actually closer to 90m, but it's still way better than the 90m source data, which were produced via a 3x3 averaging filter - that means even worse loss of detail.

30m sources can definitely serve to make a better 76m OT base world dataset, but I'm not certain if the increased detail alone justifies the threefold increase in size, compared to the 76/30 compilation (not to the original 76/90 one).
There are still things that can be done with the 30m data though. For example, attempt to reconstruct the lost detail by applying a crest sharpening filter on the source data. We can also use even more detailed elevation data as inputs to the dataset compiler where possible, while keeping the output resolution at 38 meters.

Issues

Apart from the filtering problem there are some other issues that show up in the new data, some of which are present in the old dataset as well but grew worse with the increase of detail.

First issue are various holes and false values that existed in the acquired radar data because of some regions being under heavy clouds during all the passes of the SRTM mission. While the 90m data were mostly fixed in many of these cases (except for some patterns in mountainous regions), new 30m sources are still containing many of them. It might be useful to create an in-game page to report these bugs by the users, crowd-sourcing it.

Another issue is that dense urban areas with tall or large buildings have them baked into elevations. It was also present in 90m data, but here it is more visible. For example, this is the Vehicle Assembly Building in Launch Complex 39:


The plan is to filter these out using urban area masks, which will be useful also to level the terrain in cities. One potential problem with that is that the urban mask data are available only in considerably coarser resolution than the elevation data, which may cause some unwanted effects.


Bathymetric data


Together with higher resolution land data we also went to use enhanced precision ocean depth data, released recently in 500m resolution. Previously used dataset had resolution of 1km, which was insufficient especially in coastal areas.

Unfortunately, the effective resolution of these data is still 1km or worse in most places, and the way the data were upscaled introduces nasty artifacts, since OT now takes these data as mandatory and cannot refine them using fractal (much more natural-looking) algorithms. The result is actually much worse than the original 1km sources (Hawaiian island):


Just as with the land data, any artificial resampling is making things worse. Fortunately, for important coastal areas there are plenty of other sources with much finer resolution, that we can combine with the global 1km bathymetric data. This is how the preliminary test of these sources looks like (ignore the land fills in bays):



Sunday, December 7, 2014

TitanIM, an Outerra-based military simulation platform

Revealed at the world's largest modeling, simulation and training conference oriented at military use, I/ITSEC (Orlando 1-5.Dec 2014), TitanIM is a new simulation platform built on top of the Outerra engine, utilizing its ability to render the whole planet with the full range of detail levels from space down to the blades of grass.



Military simulation was always one of the best fitting areas for the use of the engine. Unlike most other procedural engines, Outerra focuses on using real world data, enhancing it by seamless procedural refinement, which allows it to render accurate geography with first-person level ground details that does not need an extraordinary amount of streamed data to achieve geo-typical terrain. Supported  scale range allows it to combine all types of simulation environments into a single world and eventually into a single battlefield, which is something that's highly desired in this field.

Over the years we have been in contact with several companies in the military simulation business, which were interested in using the technology. As probably many people know, Bohemia Interactive Simulations (BIS), makers of VBS, is a major player in the "serious games" field. What is probably less known is that the company was originally founded as Bohemia Interactive Australia by David Lagettie, an Australian who saw the potential in Operation Flashpoint game, and went to use it for military simulation and training software, which soon saw a widespread adoption.



Later, around the time BIA was relocated to Prague, he left and founded Virtual Simulation Systems (VSS), a company developing all kinds of simulation hardware used in weapon and vehicle/aircraft simulators. Several of these were actually used at the ITSEC demo, shown on the screens below.


A new era: TitanIM/Outerra


TitanIM is a company founded by David Lagettie to develop a simulation platform based on the Outerra engine, in close cooperation with us. Right now Outerra engine isn't generally available, still being in the development phase, and so any early projects have to be developed with our direct participation. We have worked with TitanIM for some time already, providing specialized API and the functionality they require for specific tasks of that domain. This effort culminated at this year's I/ITSEC conference where TitanIM was officially revealed, although several projects committed to using Titan platform even before the official launch.

Here's a quickly made video showing some (not all) simulators shown at the I/ITSEC:



Titan booth was shared with two well-known companies that are already using Titan for their hardware simulators: Laser Shot and Intelligent Decisions (ID), showing diversity of applications even in this early phase.


A couple of photos of the simulators demoed:


Complete Aircrew Training System (CATS) with UH-60 Helicopter simulator



Boat platform, taking data from Outerra vehicle simulation and driving the platform servos.

Phil inside the F18 simulator using Oculus DK2, with accurate cockpit construction matching the rendered 3D model.

Overall it was a great success, with the whole Titan team working hard to get everything connected and working. These guys are seriously dedicated and insanely hard working; Phil (TitanIM co-founder and COO) had to be forcibly sent to get a bit of sleep after running for 3 days without rest, with other guys usually getting only short naps a day too.

We also decided to grant the exclusive license to Outerra engine to TitanIM for military use (direct or indirect use by the military), to secure its position, since we are already participating on it quite closely. This probably won't be good news for some other interested parties, but as many people are pointing out, competition can only be good in this field. With Outerra engine powering TitanIM, a global integrated simulation platform is possible for the first time, connecting all simulation areas - space, air, ground and water, into a single limitless world.

What does this mean for Outerra: apart from gaining an experienced partner handling simulation aspects that we could not cover by ourselves, lots of the stuff done for Titan will also get back to the Outerra engine and our games and simulators. We are also getting access to other connected companies, especially the hardware makers, making the engine more robust and universal in the process. It also allowed us to grow, to hire more people into our office in Bratislava, and the results will be showing up soon.