Playing Games with Python

Fri Aug 04 2017 00:00:00 GMT+0000 (UTC)

I've been working for a while now on a bot that plays video games.

I needed a way to capture the screen and send input to the game that was fast enough to let a bot interact with it in real time. I decided to go with windows because most games run on windows. The only computers I have with GPUs also run windows and that will be useful once I get into some machine learning.

Windows has an API that lets you capture the screen. It's called the Desktop Duplication API and it's pretty new (It came out around Windows 8). It lets you query the screen and gives you a DirectX texture that has the screen image in it.

The process works like this.

  1. Capture the display that the game is running on.
  2. Copy the section of the screen that contains the game to another texture backed by memory instead of GPU ram.
  3. Copy the pixels out of the texture to a numpy array.

Here's how it works. I keep all the state in a big ol struct.

struct CaptureState {
    ID3D11Device *device;
    ID3D11DeviceContext *device_context;
    
    HWND capture_window;
    IDXGIOutputDuplication *output_duplication;
    
    int captured_display_left;
    int captured_display_top;
    int captured_display_right;
    int captured_display_bottom;

    int capture_window_left;
    int capture_window_top;
    int capture_window_right;
    int capture_window_bottom;
    
    ID3D11Texture2D *capture_texture;
    ID3D11Texture2D *region_copy_texture;
    IDXGISurface *region_copy_surface;

    int width;
    int height;
    int components;
};

Initialization starts like this.

int
captureStateInit(CaptureState* state, const char* window_name) {
    // Windows COM api stuff, sorta odd if you've never seen it before.
    IDXGIFactory1 *dxgi_factory = NULL;
    HRESULT hr = CreateDXGIFactory1(__uuidof(IDXGIFactory1), (void**)&dxgi_factory);
    if (FAILED(hr)) {
        return -1;
    }

    D3D_FEATURE_LEVEL supported_feature_levels[] = {
        D3D_FEATURE_LEVEL_11_1,
        D3D_FEATURE_LEVEL_11_0,
        D3D_FEATURE_LEVEL_10_1,
        D3D_FEATURE_LEVEL_10_0,
        D3D_FEATURE_LEVEL_9_3,
        D3D_FEATURE_LEVEL_9_2,
        D3D_FEATURE_LEVEL_9_1,
    };

    D3D_FEATURE_LEVEL fl;

    hr = D3D11CreateDevice(NULL, D3D_DRIVER_TYPE_HARDWARE, NULL, D3D11_CREATE_DEVICE_DEBUG,
                           supported_feature_levels, LEN(supported_feature_levels),
                           D3D11_SDK_VERSION, &state->device, &fl, &state->device_context);

    if (FAILED(hr)) {
        return -1;
    }

    state->output_duplication = NULL;
    
    state->captured_display_left = 0;
    state->captured_display_top = 0;
    state->captured_display_right = 0;
    state->captured_display_bottom = 0;
    
    state->capture_texture = NULL;
    state->region_copy_texture = NULL;
    state->region_copy_surface;

    state->capture_window = NULL;

Next we need to find the window. The API for enumerating processes takes a function that it calls once with every window. The function gets a reference to a window and a pointer to user data. We build a window buffer and a function to push references onto it.

struct WindowBuf {
    HWND windows[100];
    int num_windows;
};

BOOL CALLBACK
enumWindowsProc(HWND hwnd, LPARAM lParam) {
    WindowBuf *buf = (WindowBuf*)lParam;

    if (IsWindowVisible(hwnd)) {
        buf->windows[buf->num_windows++] = hwnd;
    }

    return true;
}

Now we can call EnumWindows which calls our function with every active window. Once our buffer is full we can look for the window we want.

    WindowBuf buf;
    buf.num_windows = 0;
    EnumWindows(enumWindowsProc, (LPARAM)&buf);

    char s[128];
    for (int p=0; p<buf.num_windows; p++) {
        HWND w = buf.windows[p];
        GetWindowText(w, (LPSTR)s, 128);
        if (strcmp(s, window_name) == 0) {
            state->capture_window = w;
            printf("Found the window! %s\n", s);
            break;
        }
    }

Now that we have a handle to the window we can request more info about it. We want to know where on the screen (or screens) the window is so we can capture that display and copy that region out of it.

    WINDOWINFO info;
    GetWindowInfo(state->capture_window, &info);

    state->capture_window_left = info.rcClient.left;
    state->capture_window_top = info.rcClient.top;
    state->capture_window_right = info.rcClient.right;
    state->capture_window_bottom = info.rcClient.bottom;

Next we start the output duplication. We iterate through the adapters and the outputs on each adapter to find the right display. I'm not sure what the difference between an adapter and an output is. I guess with a thunderbolt cable you could daisy chain two displays or something. My laptop only has one display so this isn't tested on fancy setups and there's probably some gotchas there.

   // find the display that has the window on it.
    IDXGIAdapter1 *adapter;
    for (int adapter_index = 0;
         dxgi_factory->EnumAdapters1(adapter_index, &adapter) != DXGI_ERROR_NOT_FOUND;
         adapter_index++) {
        // enumerate outputs
        IDXGIOutput *output;
        for (int output_index = 0;
             adapter->EnumOutputs(output_index, &output) != DXGI_ERROR_NOT_FOUND;
             output_index++) {
            DXGI_OUTPUT_DESC output_desc;
            output->GetDesc(&output_desc);
            if (output_desc.AttachedToDesktop) {
                // printf("this display dimensions (%i,%i,%i,%i)\n",
                //        output_desc.DesktopCoordinates.top,
                //        output_desc.DesktopCoordinates.left,
                //        output_desc.DesktopCoordinates.bottom,
                //        output_desc.DesktopCoordinates.right);
                if (output_desc.DesktopCoordinates.left <= state->capture_window_left &&
                    output_desc.DesktopCoordinates.right >= state->capture_window_right &&
                    output_desc.DesktopCoordinates.top <= state->capture_window_top &&
                    output_desc.DesktopCoordinates.bottom >= state->capture_window_bottom) {

                    // printf("Display output found. DeviceName=%ls  AttachedToDesktop=%d Rotation=%d DesktopCoordinates={(%d,%d),(%d,%d)}\n",
                    //         output_desc.DeviceName,
                    //         output_desc.AttachedToDesktop,
                    //         output_desc.Rotation,
                    //         output_desc.DesktopCoordinates.left,
                    //         output_desc.DesktopCoordinates.top,
                    //         output_desc.DesktopCoordinates.right,
                    //         output_desc.DesktopCoordinates.bottom);
                    
                    state->captured_display_left = output_desc.DesktopCoordinates.left;
                    state->captured_display_right = output_desc.DesktopCoordinates.right;
                    state->captured_display_bottom = output_desc.DesktopCoordinates.bottom;
                    state->captured_display_top = output_desc.DesktopCoordinates.top;

                    IDXGIOutput1 *output1 = (IDXGIOutput1*)output;
                    hr = output1->DuplicateOutput(state->device, &state->output_duplication);
                    if (FAILED(hr)) {
                        printf("Output Duplication Failed\n");
                        printf("%#x\n", hr);
                        return -1;
                    }
                    // printf("Output Duplicated\n");
                }
            }
            output->Release();
        }
        adapter->Release();
    }

    state->width = state->capture_window_right - state->capture_window_left;
    state->height = state->capture_window_bottom - state->capture_window_top;
    state->components = 4;

We know the size of the window and have duplicated it's output so we need to set up a buffer that we'll copy the image data to. I do this by creating an empty numpy array in python and pass in the raw pointer back to c to get filled out.

Now we're ready to capture frames, we'll call this every time we want to see the screen.

int
captureStateCaptureFrame(CaptureState* state, uint8_t* copy_to_buffer) {
    DXGI_OUTDUPL_FRAME_INFO capture_frame_info;
    IDXGIResource *resource;

    HRESULT hr = state->output_duplication->AcquireNextFrame(0,
                                                             &capture_frame_info,
                                                             &resource);
    if (FAILED(hr)) {
        // no new frame
        return 0;
    }

    resource->QueryInterface(__uuidof(ID3D11Texture2D), (void**)&state->capture_texture);
    resource->Release();

AcquireNextFrame gives us a direct3d texture of the whole screen. The data is in gpu memory so first we have to copy it to a texture backed by cpu memory. We also don't want the whole screen so we'll create the second texture to be the size of the window region. The first time through we'll initialize this intermediate texture.

    if (!state->region_copy_texture) {
        D3D11_TEXTURE2D_DESC capture_texture_desc;
        state->capture_texture->GetDesc(&capture_texture_desc);

        D3D11_TEXTURE2D_DESC region_texture_desc;
        ZeroMemory(&region_texture_desc, sizeof(region_texture_desc));

        region_texture_desc.Width = state->width;
        region_texture_desc.Height = state->height;
        region_texture_desc.MipLevels = 1;
        region_texture_desc.ArraySize = 1;
        region_texture_desc.SampleDesc.Count = 1;
        region_texture_desc.SampleDesc.Quality = 0;
        region_texture_desc.Usage = D3D11_USAGE_STAGING;
        region_texture_desc.Format = capture_texture_desc.Format;
        region_texture_desc.BindFlags = 0;
        region_texture_desc.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
        region_texture_desc.MiscFlags = 0;

        hr = state->device->CreateTexture2D(&region_texture_desc, NULL, &state->region_copy_texture);
        if (FAILED(hr)) {
            return -1;
        }
    }

Now we can copy a region of the screen texture to our memory backed texture. Then we'll take that Texture, get a Surface from that and Map it to a rect which finally gives us a raw pointer to pixel data.

    // copy region of screen to texture
    D3D11_BOX source_region;
    source_region.left = state->capture_window_left;
    source_region.right = state->capture_window_right;
    source_region.top = state->capture_window_top;
    source_region.bottom = state->capture_window_bottom;
    source_region.front = 0;
    source_region.back = 1;

    state->device_context->CopySubresourceRegion(state->region_copy_texture, 0, 0, 0, 0, state->capture_texture, 0, &source_region);
    state->region_copy_texture->QueryInterface(__uuidof(IDXGISurface), (void**)&state->region_copy_surface);

    DXGI_MAPPED_RECT rect;
    hr = state->region_copy_surface->Map(&rect, DXGI_MAP_READ);
    if (FAILED(hr)) {
        return -1;
    }

We can now copy from this pointer to our memory buffer. We need to obey the Pitch of the mapped rect because the texture is not exactly the size of our buffer. DirectX will choose whatever size it wants that makes the copy efficient.

    uint8_t *dest = copy_to_buffer;
    uint8_t *src = rect.pBits;
    for (int row = 0; row < state->height; row++) {
        memcpy(dest, src, state->width * 4);
        dest += state->width * 4;
        src += rect.Pitch;
    }

Once we have the data copied over we can release the frame.

    state->region_copy_surface->Unmap();
    state->region_copy_surface->Release();
    state->output_duplication->ReleaseFrame();
    
    return 1;
}

To make it easy to call from python we make a static struct of the capture state in the file. We could change this later if we want multiple instances but this works fine for my use case. We also have to wrap this code in an extern ā€œCā€ api because python likes to call c, not c++. We compile it as a dll we can load and call into from python. The whole file looks like this. (There's a couple other helper and test functions thrown in too.)

#include <windows.h>
#include <d3d11.h>
#include <dxgi1_2.h>

#include <stdio.h>
#include <stdint.h>
#include <string.h>

#define LEN(e) (sizeof(e)/sizeof(e[0]))

uint8_t test_image_rgba[] = {
    255,0,0,255, 0,255,0,255, 0,0,255,255,
    0,255,255,255, 255,0,255,255, 255,255,0,255
};

struct WindowBuf {
    HWND windows[100];
    int num_windows;
};

BOOL CALLBACK
enumWindowsProc(HWND hwnd, LPARAM lParam) {
    WindowBuf *buf = (WindowBuf*)lParam;

    if (IsWindowVisible(hwnd)) {
        buf->windows[buf->num_windows++] = hwnd;
    }

    return true;
}

void
listProcesses() {
    WindowBuf buf;
    buf.num_windows = 0;
    EnumWindows(enumWindowsProc, (LPARAM)&buf);

    char s[128];
    for (int p=0; p<buf.num_windows; p++) {
        HWND w = buf.windows[p];
        GetWindowText(w, (LPSTR)s, 128);
        if (strcmp(s, "") != 0) {
            printf("%s\n", s);
        }
    }
}

struct CaptureState {
    ID3D11Device *device;
    ID3D11DeviceContext *device_context;
    
    HWND capture_window;
    IDXGIOutputDuplication *output_duplication;
    
    int captured_display_left;
    int captured_display_top;
    int captured_display_right;
    int captured_display_bottom;

    int capture_window_left;
    int capture_window_top;
    int capture_window_right;
    int capture_window_bottom;
    
    ID3D11Texture2D *capture_texture;
    ID3D11Texture2D *region_copy_texture;
    IDXGISurface *region_copy_surface;

    int width;
    int height;
    int components;
};

int
captureStateInit(CaptureState* state, const char* window_name) {
    // Windows COM api stuff, sorta odd if you've never seen it before.
    IDXGIFactory1 *dxgi_factory = NULL;
    HRESULT hr = CreateDXGIFactory1(__uuidof(IDXGIFactory1), (void**)&dxgi_factory);
    if (FAILED(hr)) {
        return -1;
    }

    D3D_FEATURE_LEVEL supported_feature_levels[] = {
        D3D_FEATURE_LEVEL_11_1,
        D3D_FEATURE_LEVEL_11_0,
        D3D_FEATURE_LEVEL_10_1,
        D3D_FEATURE_LEVEL_10_0,
        D3D_FEATURE_LEVEL_9_3,
        D3D_FEATURE_LEVEL_9_2,
        D3D_FEATURE_LEVEL_9_1,
    };

    D3D_FEATURE_LEVEL fl;

    hr = D3D11CreateDevice(NULL, D3D_DRIVER_TYPE_HARDWARE, NULL, D3D11_CREATE_DEVICE_DEBUG,
                           supported_feature_levels, LEN(supported_feature_levels),
                           D3D11_SDK_VERSION, &state->device, &fl, &state->device_context);

    if (FAILED(hr)) {
        return -1;
    }

    state->output_duplication = NULL;
    
    state->captured_display_left = 0;
    state->captured_display_top = 0;
    state->captured_display_right = 0;
    state->captured_display_bottom = 0;
    
    state->capture_texture = NULL;
    state->region_copy_texture = NULL;
    state->region_copy_surface;

    state->capture_window = NULL;

    // find the window we want
    WindowBuf buf;
    buf.num_windows = 0;
    EnumWindows(enumWindowsProc, (LPARAM)&buf);

    char s[128];
    for (int p=0; p<buf.num_windows; p++) {
        HWND w = buf.windows[p];
        GetWindowText(w, (LPSTR)s, 128);
        if (strcmp(s, window_name) == 0) {
            state->capture_window = w;
            printf("Found the window! %s\n", s);
            break;
        }
    }

    if (!state->capture_window) {
        printf("Couldn't find window\n");
        return -1;
    }

    WINDOWINFO info;
    GetWindowInfo(state->capture_window, &info);

    state->capture_window_left = info.rcClient.left;
    state->capture_window_top = info.rcClient.top;
    state->capture_window_right = info.rcClient.right;
    state->capture_window_bottom = info.rcClient.bottom;

    // find the display that has the window on it.
    IDXGIAdapter1 *adapter;
    for (int adapter_index = 0;
         dxgi_factory->EnumAdapters1(adapter_index, &adapter) != DXGI_ERROR_NOT_FOUND;
         adapter_index++) {
        // enumerate outputs
        IDXGIOutput *output;
        for (int output_index = 0;
             adapter->EnumOutputs(output_index, &output) != DXGI_ERROR_NOT_FOUND;
             output_index++) {
            DXGI_OUTPUT_DESC output_desc;
            output->GetDesc(&output_desc);
            if (output_desc.AttachedToDesktop) {
                // printf("this display dimensions (%i,%i,%i,%i)\n",
                //        output_desc.DesktopCoordinates.top,
                //        output_desc.DesktopCoordinates.left,
                //        output_desc.DesktopCoordinates.bottom,
                //        output_desc.DesktopCoordinates.right);
                if (output_desc.DesktopCoordinates.left <= state->capture_window_left &&
                    output_desc.DesktopCoordinates.right >= state->capture_window_right &&
                    output_desc.DesktopCoordinates.top <= state->capture_window_top &&
                    output_desc.DesktopCoordinates.bottom >= state->capture_window_bottom) {

                    // printf("Display output found. DeviceName=%ls  AttachedToDesktop=%d Rotation=%d DesktopCoordinates={(%d,%d),(%d,%d)}\n",
                    //         output_desc.DeviceName,
                    //         output_desc.AttachedToDesktop,
                    //         output_desc.Rotation,
                    //         output_desc.DesktopCoordinates.left,
                    //         output_desc.DesktopCoordinates.top,
                    //         output_desc.DesktopCoordinates.right,
                    //         output_desc.DesktopCoordinates.bottom);
                    
                    state->captured_display_left = output_desc.DesktopCoordinates.left;
                    state->captured_display_right = output_desc.DesktopCoordinates.right;
                    state->captured_display_bottom = output_desc.DesktopCoordinates.bottom;
                    state->captured_display_top = output_desc.DesktopCoordinates.top;

                    IDXGIOutput1 *output1 = (IDXGIOutput1*)output;
                    hr = output1->DuplicateOutput(state->device, &state->output_duplication);
                    if (FAILED(hr)) {
                        printf("Output Duplication Failed\n");
                        printf("%#x\n", hr);
                        return -1;
                    }
                    // printf("Output Duplicated\n");
                }
            }
            output->Release();
        }
        adapter->Release();
    }

    state->width = state->capture_window_right - state->capture_window_left;
    state->height = state->capture_window_bottom - state->capture_window_top;
    state->components = 4;

    // Return the size of the buffer needed to copy captures to.
    return state->width * state->height * state->components;
}

int
captureStateCaptureFrame(CaptureState* state, uint8_t* copy_to_buffer) {
    DXGI_OUTDUPL_FRAME_INFO capture_frame_info;
    IDXGIResource *resource;

    HRESULT hr = state->output_duplication->AcquireNextFrame(0,
                                                             &capture_frame_info,
                                                             &resource);
    if (FAILED(hr)) {
        // no new frame
        return 0;
    }

    resource->QueryInterface(__uuidof(ID3D11Texture2D), (void**)&state->capture_texture);
    resource->Release();

    if (!state->region_copy_texture) {
        D3D11_TEXTURE2D_DESC capture_texture_desc;
        state->capture_texture->GetDesc(&capture_texture_desc);

        D3D11_TEXTURE2D_DESC region_texture_desc;
        ZeroMemory(&region_texture_desc, sizeof(region_texture_desc));

        region_texture_desc.Width = state->width;
        region_texture_desc.Height = state->height;
        region_texture_desc.MipLevels = 1;
        region_texture_desc.ArraySize = 1;
        region_texture_desc.SampleDesc.Count = 1;
        region_texture_desc.SampleDesc.Quality = 0;
        region_texture_desc.Usage = D3D11_USAGE_STAGING;
        region_texture_desc.Format = capture_texture_desc.Format;
        region_texture_desc.BindFlags = 0;
        region_texture_desc.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
        region_texture_desc.MiscFlags = 0;

        hr = state->device->CreateTexture2D(&region_texture_desc, NULL, &state->region_copy_texture);
        if (FAILED(hr)) {
            return -1;
        }
    }
    
    // copy region of screen to texture
    D3D11_BOX source_region;
    source_region.left = state->capture_window_left;
    source_region.right = state->capture_window_right;
    source_region.top = state->capture_window_top;
    source_region.bottom = state->capture_window_bottom;
    source_region.front = 0;
    source_region.back = 1;

    state->device_context->CopySubresourceRegion(state->region_copy_texture, 0, 0, 0, 0, state->capture_texture, 0, &source_region);
    state->region_copy_texture->QueryInterface(__uuidof(IDXGISurface), (void**)&state->region_copy_surface);

    DXGI_MAPPED_RECT rect;
    hr = state->region_copy_surface->Map(&rect, DXGI_MAP_READ);
    if (FAILED(hr)) {
        return -1;
    }
    
    uint8_t *dest = copy_to_buffer;
    uint8_t *src = rect.pBits;
    for (int row = 0; row < state->height; row++) {
        memcpy(dest, src, state->width * 4);
        dest += state->width * 4;
        src += rect.Pitch;
    }

    state->region_copy_surface->Unmap();
    state->region_copy_surface->Release();
    state->output_duplication->ReleaseFrame();
    
    return 1;
}

// We make the state static and global to make it easier to interact with python.
static CaptureState capture_state;

extern "C" {
    int
    get_image(uint8_t* copy_to) {
        memcpy((void*)copy_to, (void*)test_image_rgba, sizeof(test_image_rgba));
        return 1;
    }

    void
    list_processes() {
        listProcesses();
    }

    int
    init(const char* window_name) {
        return captureStateInit(&capture_state, window_name);
    }

    int
    get_capture_height() {
        return capture_state.height;
    }

    int
    get_capture_width() {
        return capture_state.width;
    }

    int
    get_capture_num_components() {
        return capture_state.components;
    }

    int
    capture_frame(uint8_t* copy_to_buffer) {
        return captureStateCaptureFrame(&capture_state, copy_to_buffer);
    }

}

I build it like this.

@echo off

cl -Zi -W3 -nologo^
   ../aissac.cpp^
   Dxgi.lib D3D11.lib user32.lib gdi32.lib shell32.lib Shcore.lib^
   -LD /link^
   /EXPORT:get_image^
   /EXPORT:init^
   /EXPORT:get_capture_height^
   /EXPORT:get_capture_width^
   /EXPORT:get_capture_num_components^
   /EXPORT:capture_frame^
   /EXPORT:list_processes || goto :error

goto :EOF

:error
popd
exit /b %errorlevel%

And then it can be used from python like this.

import numpy as np

from cffi import FFI
ffi = FFI()
lib = ffi.dlopen('aissac.dll')
ffi.cdef('''
    void list_processes();
    int init(const char* window_name);
    int get_capture_height();
    int get_capture_width();
    int get_capture_num_components();
    int capture_frame(uint8_t* copy_to_buffer);

    int get_image(uint8_t* copy_to);
''')

c_window_name = ffi.new("char[]", b"Binding of Isaac: Afterbirth+")
buffer_size = lib.init(c_window_name)

capture_height = lib.get_capture_height()
capture_width = lib.get_capture_width()
capture_components = lib.get_capture_num_components()

raw_buffer = np.empty((buffer_size), np.uint8)

while True:
    cap = lib.capture_frame(ffi.cast("uint8_t *", self.raw_buffer.ctypes.data))
    # reshape to get it as a hxwxc numpy tensor instead of just one array.
    capture = raw_buffer.reshape((capture_height,
                                  capture_width,
                                  capture_components))
    # do something

It could be possible to copy from a directx texture to a PyTorch tensor without going to memory. There are some DirectX / CUDA interop libraries that I haven't looked at yet. That would be faster if we're only passing it to PyTorch. I haven't tried it yet because even with the copy to memory this can run at 60fps.

There were a few issues that stumped me during development. The process needs to run under the integrated gpu, not an nvidia gpu to get access to the screen. I'm not sure why that is. Also the format of pixels in the DirectX texture is bgra not rgba. This means if you try to convert the raw pixels to a typical image format it'll display weird. You can swizzle them yourself, or if you're displaying with directx or opengl you can set the pixel format. Since my bot code is likely going to work off raw pixels it doesn't matter the format it's in.

Next I have to figure out how to write a program that plays the game.

If you're interested in OSX I have some similar code you can check out here. It's not wrapped in python and the api is a little weird. You have to pass it a callback that it hits when the screen refreshes. It calls it on an OSX managed thread so getting the data back to your program takes some trickery. I have an example in there but it's probably not the best way.

It's my hope to combine both of these (and a version for linux) into a library. I'm also very interested in running GUI apps headless in linux containers so a whole cloud of bots can play them. I haven't figured that one out yet.