Generating Character Names in Unity from External Files (Part II)

Check out the (free!) Unity asset, complete with readme and a basic usage guide, on GitHub!

Where we started

Last time we discussed our design goals for the Cultural Names asset and we jumped right in with some simple .csv parsing. We also talked through the way we chose to organize each culture into a single dictionary, with the original filename serving as the key and the value a jagged array of all the possible components in a name: surname, forenames (both male and female), titles, and suffixes.

Building a name from the arrays

Now that we have the dictionary successfully loaded into memory whenever we call the method outlined last time, getting a name is actually very straightforward. We generally prefer to use .NET’s System.Random class, instead of Unity’s, and so we create an instance of the random class:

using System;

static System.Random rnd = new System.Random();

Then we can use rnd.Next(), which is a helpful method that defaults to picking a random int between zero and a maximum value, given as a parameter. You can define a minimum value if needed, but for our purposes we want it to be zero since that’s the starting index of our jagged array. The method we’ve defined takes the array and a NameType as parameters; the NameType is an enum that defines whether the name is a surname, forename, etc. It returns a string that’s a random name taken from that list. You’ll note we cast the NameType to an int; this is an easy way to use enums to define parts of an array, but it’s important not to redefine your enum’s values when you initialize it, otherwise the default ordering won’t work.

/// <summary>
/// Gets a single random name element from a given jagged array.
/// </summary>
public static string GetNameElement(string[][] array, NameType nameType)
{
    return array[(int)nameType][rnd.Next(array[(int)nameType].Length)];
}

Remember, the first dimension of these jagged arrays defines which component of the name we’re addressing, so it’s simply the casted NameType; the second dimension is the arbitrarily long list of names. One tricky thing that always trips me up is the syntax for accessing the dimensions in a jagged array. If you want to get the Length of a jagged array’s first dimension, it’s just array.Length as you would normally do; if you want the Length of the second dimension, though, it’s array[first].Length, where first is which first dimension you’re in. Pretty obvious, but I always forget!

But how do we give this handy little method the right array? That’s handled by a check to the dictionary using TryGetValue(). Here’s the method that generates a random forename:

/// <summary>
/// Get a random name of a given kind, not guaranteed to be unique
/// </summary>
public static string RandomForename(Gender gender, string culture)
{
    string[][] names = new string[][]{};

    NameType nametype = (gender == Gender.MALE) ? NameType.MALE : NameType.FEMALE;

    if(loadedNames.TryGetValue(culture, out names))
    {
        return GetNameElement(names, nametype);
    }

    Debug.Log("Name generation failed. Check the 'culture' parameter for errors.");
    return " - ";
}

The advantage of TryGetValue() is that it doesn’t cause issues on failure; we catch possible failures here in Unity’s debug log with a suggestion to check the culture parameter.

So we’re grabbing names now! In the asset there’s also a method for generating a random full name, returning a Name struct (more on that later). But you may have noticed an issue: duplicate names. As alluded to in the comments on the method above, there’s no guarantee the name returned is unique; it’s merely returning a random name from that array without knowledge of past names. So how do we guarantee names aren’t duplicated?

Preventing duplicates

Preventing duplicate names could be done in a number of ways, but we elected to use a hashing system. For the unitiated, a hash is the output of a cryptographic hash function, which is a mathematical algorithm that given certain input of arbitrary size returns a bit string of fixed size that’s easy to compare. These were devised for cryptography, although they are also very useful to detect duplicate data or prevent data corruption issues. The tremendous advantage is that it is incredibly unlikely that two hashes would collide (that is, different starting data creating identical hashes), and it’s relatively easy to use.

Could we have simply made a list of all the names that have been generated and compared those strings? Yes. But this good practice, and more fun, as well as scaling nicely. We’re using the MD5 algorithm for this project, but do not use it for actual cryptographic purposes! It’s highly vulnerable to attack and is essentially worthless for protecting against intentional corruption. It is, however, still elegant and quick, perfect to protect against accidental corruption. The operations are handled by C#’s built-in System.Cryptography methods, and the general strategy is as follows: generate a random name; hash that name; compare it against an existing list of hashed names; re-roll a new random name if it conflicts, otherwise add this new hash to the list of generated names and return the shining new unique name. The hashing method looks like this:

/// <summary>
/// Hashes a name using MD5 algorithm and returns as a string.
/// </summary>
static string HashName(Name name)
{
    StringBuilder hashedName = new StringBuilder();
    MD5 md5Hasher = MD5.Create();

    Byte[] array = Encoding.ASCII.GetBytes(name.ToString());

    //hash the byte array
    foreach (Byte b in md5Hasher.ComputeHash(array))
    {
        hashedName.Append(b.ToString("x2"));
    }

    return hashedName.ToString();
}

You’ll note the MD5 hasher takes a byte array for its input, so we first convert the name to a string, then a byte array, then feed that byte array into the hasher, which outputs a hashed string. We have a static list of hashed names that is stored in the NameBuilder class and some simple methods for adding and removing hashes from this list—make sure to look at the full source code on GitHub in the #Initialization and List Management region to see the implementation. So now let’s look at the final method that uses this hasher and compares the generated name:

///<summary>
/// Returns a random name guaranteed to be unique from a given culture.
///</summary>
public static Name RandomUniqueName(Gender gender, string culture)
{
    string forename = " - ";
    string surname = " - ";
    string[][] names = new string[][]{};
    Name name;

    NameType nameType = (gender == Gender.MALE) ? NameType.MALE : NameType.FEMALE;

    if(loadedNames.TryGetValue(culture, out names))
    {
        int iterations = 0;
        do
        {
            forename = GetNameElement(names, nameType);
            surname = GetNameElement(names, NameType.SURNAME);
            name = new Name(forename, surname);
            iterations++;
            if(iterations > 20) //sanity check to prevent infinite loops
            {
                Debug.Log("Too few names in the CSV; multiple collisions have occurred.");
                break;
            }
        }
        while(!IsUniqueName(name));

        AddName(name); //add this name to the hashed list
        return name;
    }

    Debug.Log("Name generation failed. Check the 'culture' parameter for errors.");
    return new Name(forename, surname);
}

Of note is the way we test the name. We generate a name (both forename and surname) and use a do while loop to try to find a unique name. The IsUniqueName() method returns a bool and looks like this:

/// <summary>
/// Measures given name against all hashes stored in static list to prevent duplicates.
/// </summary>
static bool IsUniqueName(Name name)
{
    string hashedName = HashName(name);

    return !activeHashedNames.Contains(hashedName);
}

You’ll note we’re testing the number of iterations of the while loop. This is critical, because in some edge cases you could conceiveably loop forever and cause major issues. One edge case would be if the modder only put one or two names into each of the NameTypes for a particular culture; in that case all possible names would be quickly used up, leaving only colliding names left for the generator. We arbitrarily capped it at twenty attempts before failure; this is much higher than you’d ever see with a sufficiently large pool of names, but low enough that you won’t see any performance issues.

Conclusion

The whole name is stored in a Name struct with a series of public strings inside; this is just a container to store everything in and make it nice and neat to pass names around to the parts of your code that need it.

Well, that wraps up our discussion of Cultural Names! Don’t forget to grab the asset from GitHub and let us know what you think about it. Leave an issue if you have a problem or a suggestion.